Hey guys, I’d love to know how well checkpointing actually works when running Airflow on spot instances. Is it really worth it? [Checkpointing saves the state of a process during execution so it can be restored after a failure.]
I wrote this article on building fault-tolerant Airflow pipelines on spot instances (https://spot.rackspace.com/blog/building-fault-tolerant-airf...), and one decision I made was to use S3 as the external state layer and checkpointing task outputs. Here’s a quick summary:
1. Each task writes its output to a specific S3 path.
2. When a worker node is preempted mid-task, Airflow retries the task, and the new pod reads directly from S3, picking up the last successfully written output from the upstream task.
3. Writes use replace=True, so if a task was interrupted mid-write and left a partial file, the retry simply overwrites it, keeping execution idempotent.
This is a very simple implementation, but I’m curious what checkpointing methods you all apply in production, or if it’s even something you bother with at all.
From this setup, one big question I keep coming back to is whether the overhead of writing to S3 ends up eating into the cost savings of using spot instances in the first place.
Hey guys, I’d love to know how well checkpointing actually works when running Airflow on spot instances. Is it really worth it? [Checkpointing saves the state of a process during execution so it can be restored after a failure.] I wrote this article on building fault-tolerant Airflow pipelines on spot instances (https://spot.rackspace.com/blog/building-fault-tolerant-airf...), and one decision I made was to use S3 as the external state layer and checkpointing task outputs. Here’s a quick summary: 1. Each task writes its output to a specific S3 path. 2. When a worker node is preempted mid-task, Airflow retries the task, and the new pod reads directly from S3, picking up the last successfully written output from the upstream task. 3. Writes use replace=True, so if a task was interrupted mid-write and left a partial file, the retry simply overwrites it, keeping execution idempotent. This is a very simple implementation, but I’m curious what checkpointing methods you all apply in production, or if it’s even something you bother with at all. From this setup, one big question I keep coming back to is whether the overhead of writing to S3 ends up eating into the cost savings of using spot instances in the first place.