How well does S3 checkpointing hold up when running Airflow on spot?

(spot.rackspace.com)

1 points | by aleroawani 11 hours ago ago

1 comments

$aleroawani 11 hours ago

Hey guys, I’d love to know how well checkpointing actually works when running Airflow on spot instances. Is it really worth it? [Checkpointing saves the state of a process during execution so it can be restored after a failure.] I wrote this article on building fault-tolerant Airflow pipelines on spot instances (https://spot.rackspace.com/blog/building-fault-tolerant-airf...), and one decision I made was to use S3 as the external state layer and checkpointing task outputs. Here’s a quick summary: 1. Each task writes its output to a specific S3 path. 2. When a worker node is preempted mid-task, Airflow retries the task, and the new pod reads directly from S3, picking up the last successfully written output from the upstream task. 3. Writes use replace=True, so if a task was interrupted mid-write and left a partial file, the retry simply overwrites it, keeping execution idempotent. This is a very simple implementation, but I’m curious what checkpointing methods you all apply in production, or if it’s even something you bother with at all. From this setup, one big question I keep coming back to is whether the overhead of writing to S3 ends up eating into the cost savings of using spot instances in the first place.