How to use checkpointing to save on long-running tasks

Allie Hajian
  • Updated

Running a long job but worried that you will lose your work if your job is interrupted in the middle of a task? Task-level checkpointing lets you restart your workflow at an intermediate step, saving you from having to recompute work done up to the checkpoint. Learn about checkpointing functionality as well as detailed steps and examples of how to implement it in your own WDL. 

Why use checkpointing?

Terra already has one money-saving feature that allows you to resume a failed workflow - "call-caching". With call-caching - which is turned on by default - if a workflow fails, you don't have to rerun tasks that already completed.

However, you do have to run the failed task from scratch. This is especially relevant if your long job fails because it ran out of memory,  encountered a transient Google Cloud error, or is preempted in the middle of a task after ten hours of computation. Until now, you've potentially lost hours of execution costs within the failed task. Even at the discounted preemptible rate, wasting compute time is painful. 

What is checkpointing? Checkpointing, introduced with Cromwell 55, allows you to specify an intermediate checkpointFile value in a task's runtime section. This checkpoint file is copied to cloud storage every ten minutes and can be restored automatically on subsequent attempts if the job is interrupted. Once the final output is successfully generated, the checkpoint file is deleted.

Learn more about checkpointing in this Terra blog post

Read the full checkpointing documentation here 

Checkpointing doesn't work with failed jobs Checkpointing is great for recovering from interruptions

  • Preemptible virtual machines (VMs) interrupted in the middle of their operation 

  • Running out of memory

  • Transient Google Cloud issues causing interruptions

It won't help you jump back into the middle of failed jobs if your workflow fails.

Checkpointing versus call caching

Call caching and checkpointing are similar functions - they both keep you from having to redo work when a workflow run fails. However, with call caching you restart the workflow before the failed task, while checkpointing allows you to resume the analysis partway through the task - at the checkpoint.  

Checkpointing_Restart-from-task2_Diagram.png

Checkpointing lets you keep more of the failed run (the orange part represents the calculation that must be rerun). 

 

Checkpointing_Restart-from-checkpoint_Diagram.png

 

How does checkpointing affect cloud charges?

Charges accrue from storing the checkpoint file while the task runs. There may be more charges from the transfer between the VM and the cloud storage bucket, depending on their locations. These costs should be minor, however, especially balanced against the performance and cost benefits of restoring from the checkpoint when a worker VM is preempted.

Since the checkpoint file is deleted once the task completes, there are no further charges.

When is the checkpoint file not deleted?
If the task is aborted or otherwise stopped externally, i.e., if Cromwell's operation is interrupted, the checkpoint file will NOT be deleted and storage charges continue to accrue indefinitely, or until the file is deleted manually.

Checkpointing example WDL

Because a WDL must include a command that is checkpoint-aware, you need to build checkpointing into your WDL to take advantage of the feature. The following WDL demonstrates the use of the checkpointFile optimization. 

Expand for example WDL code

To make the checkpointing work, the runtime section specifies checkpointFile: "my_checkpoint".

It starts by trying to restore state from the my_checkpoint file (or starts at 1 if the checkpoint is empty)

Then it counts up to 100, printing out the current counter value and a date timestamp at each value

version 1.0

workflow count_wf {
# Count to 2100 at 1/second => 35 minutes to complete, but
# luckily the state can be checkpointed every 10 minutes in
# case of preemption:
call count { input: count_to = 2100 }
}

task count {
input {
Int count_to
}

command <<<
# Note: Cromwell will stage the checkpoint file on recovery attempts.
# This task checks the 'my_checkpoint' file for a counter value, or else
# initializes the counter at '1':
FROM_CKPT=$(cat my_checkpoint | tail -n1 | awk '{ print $1 }')
FROM_CKPT=${FROM_CKPT:-1}

echo '--' >> my_checkpoint
for i in $(seq $FROM_CKPT ~{count_to})
do
echo $i $(date) >> my_checkpoint
sleep 1
done
>>>

runtime {
docker: "ubuntu:latest"
preemptible: 3
# Note: This checkpointFile attribute is what signals to Cromwell to save
# the designated checkpoint file:
checkpointFile: "my_checkpoint"
}

output {
# Note: This task also uses the checkpoint as its output. This is not
# required for checkpointing to work:
Array[String] out = read_lines("my_checkpoint")
}
}

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.