How to use checkpointing to save on long-running tasks

Allie Hajian
  • Updated

Running a long job but worried that you will lose your work if your job is interrupted in the middle of a task? Task-level checkpointing allows you to restart your workflow at an intermediate step, saving you from having to recompute work done up to the checkpoint. This article is an overview of the checkpointing functionality as well as detailed steps and examples of how to implement it in your own WDL. 

Why use checkpointing?

Terra already has one money-saving feature that allows you to resume a failed workflow - "call-caching". With call-caching - which is turned on by default - if a workflow fails, you don't have to rerun tasks that have already completed.

However, you do have to run the failed task from scratch. This is especially relevant if your long job fails because it ran out of memory,  encountered a transient GCP error, or is preempted in the middle of a task after ten hours of computation. Until now, you've potentially lost hours of execution costs within the failed task. Even at the discounted preemptible rate, wasting compute time is painful. 

G0_tip-icon.png


What is checkpointing?

 

Checkpointing, introduced with Cromwell 55, allows you to specify an intermediate checkpointFile value in a task's runtime section. This checkpoint file is copied to cloud storage every ten minutes and can be restored automatically on subsequent attempts if the job is interrupted. Once the final output has been successfully generated, the checkpoint file is deleted.

Learn more about checkpointing in this Terra blog post

Read the full checkpointing documentation here  

 

G0_warning-icon.png


Checkpointing doesn't work with failed jobs

 

Checkpointing is great for recovering from interruptions

  • Preemptible VMs interrupted in the middle of their operation 
  • Running out of memory
  • Transient GCP issues causing interruptions

It won't help you jump back into the middle of failed jobs if your workflow fails.

Checkpointing versus call-caching

Call-caching and checkpointing are similar functions - they both keep you from having to redo work when a workflow run fails. However, with call caching you restart the workflow before the failed task, while checkpointing allows you to resume the analysis partway through the task - at the checkpoint.  

Checkpointing_Restart-from-task2_Diagram.png

Checkpointing allows you to keep more of the failed run (the orange part represents the calculation that must be rerun). 

 

Checkpointing_Restart-from-checkpoint_Diagram.png

 

How does checkpointing affect cloud charges?

Charges will accrue from storing the checkpoint file while the task is running. There may be additional charges from the transfer between the VM and the cloud storage bucket depending on their locations. These costs should be minor, however, especially balanced against the performance and cost benefits of being able to restore from the checkpoint when a worker VM is preempted.

Since the checkpoint file is deleted once the task completes, there will be no further charges.

When is the checkpoint file not deleted?
However, if the task is aborted or otherwise stopped externally, i.e. if Cromwell's operation is interrupted, the checkpoint file will NOT be deleted and storage charges will continue to accrue indefinitely, or until the file is deleted manually.

Checkpointing example WDL

Because a WDL must include a command that is checkpoint-aware, you will need to build checkpointing into your WDL to take advantage of the feature. The following WDL demonstrates the use of the checkpointFile optimization. 

Expand for example WDL code

To make the checkpointing work, the runtime section specifies checkpointFile: "my_checkpoint".

It starts by attempting to restore state from the my_checkpoint file (or starts at 1 if the checkpoint is empty)

Then it counts up to 100, printing out the current counter value and a date timestamp at each value

version 1.0

workflow count_wf {
# Count to 2100 at 1/second => 35 minutes to complete, but
# luckily the state can be checkpointed every 10 minutes in
# case of preemption:
call count { input: count_to = 2100 }
}

task count {
input {
Int count_to
}

command <<<
# Note: Cromwell will stage the checkpoint file on recovery attempts.
# This task checks the 'my_checkpoint' file for a counter value, or else
# initializes the counter at '1':
FROM_CKPT=$(cat my_checkpoint | tail -n1 | awk '{ print $1 }')
FROM_CKPT=${FROM_CKPT:-1}

echo '--' >> my_checkpoint
for i in $(seq $FROM_CKPT ~{count_to})
do
echo $i $(date) >> my_checkpoint
sleep 1
done
>>>

runtime {
docker: "ubuntu:latest"
preemptible: 3
# Note: This checkpointFile attribute is what signals to Cromwell to save
# the designated checkpoint file:
checkpointFile: "my_checkpoint"
}

output {
# Note: This task also uses the checkpoint as its output. This is not
# required for checkpointing to work:
Array[String] out = read_lines("my_checkpoint")
}
}

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.