When you launch a workflow in Terra, there's a lot that happens behind the scenes to get the work done. Usually you don't need to care about any of it, but we're about to release a system upgrade that may cause some of your workflows to fail where they worked previously. We'd like to explain what's going to happen so you can deal with any failures quickly and efficiently. You shouldn't need to make any code changes to the scripts, but in some cases you may need to adjust the amount of cpu/memory or boot disk space requested by specific tasks.
Under the hood, the Terra workflow management system (Cromwell) connects to a Google Cloud service called Pipelines API, or PAPI for short. Last year, Google rolled out a new version of the PAPI service (v2) that brought major reliability improvements, as well as support for custom machines and GPUs. This was an exciting upgrade but we didn't enable it right away in FireCloud/Terra because we wanted to test the new service thoroughly before making the switch. Over the next few months, we worked closely with the PAPI team at Google to hammer the system, identify any issues and get them fixed. Now we're at a point where we're confident that PAPI v2 is ready for primetime, so we're going to switch it on in Terra.
Overall we expect this upgrade to improve the reliability of workflow execution quite substantially. You should see fewer transient workflow errors and a higher rate of success on any retries when things do go wrong. You are also likely to see the cost of some of your workflows go down.
HOWEVER it is possible that you may experience some new failures due to insufficient resource allocation, i.e. some of your jobs may fail due to not having enough cpu/memory or boot disk space.
Let's address the cpu/memory thing first. The "problem" is that with PAPI v2, each of your jobs will receive the exact amount of resources it requested (within some constraints). It sounds obvious that that's what should happen, but it's actually a new feature: the first version of PAPI only provided access to predefined machine configurations, and if you requested anything else, the system rounded up your request to the nearest predefined configuration available. It's as if you previously ordered shirts by giving chest and length measurements, but someone was systematically translating those measurements to Small, Medium or Large, and rounding up whenever you were in between two sizes. With PAPIv2 you can get your shirts tailored to fit, which is great because Google will only charge you for what you asked for, not "the next size up". Hence the lower workflow costs.
The catch is that some of your jobs may have secretly needed more cpu/memory than you were requesting for them, and you got away with it this whole time because PAPI was "helpfully" rounding up your requests. Take away the rounding up, and those jobs will no longer succeed until you increase their cpu/memory allocation.
For any job that fails unexpectedly following the upgrade, check its stdout log output; if the program you were running logged that it ran out of memory, just increase its memory allocation and retry the workflow. For any workflow that uses an input variable for memory, you should be able to do this in the tool inputs panel. If that value is hardcoded in your WDL script, you will unfortunately need to edit the script itself. We recommend taking this opportunity to parameterize resource allocation settings and/or look into the possibility of using autosizing to set default input-dependent values. In addition, we made a cheat sheet to help you identify what standard machine presets the original PAPI would have requested for any given values of memory and CPU. See these instructions on how to use it effectively. We hope this will help you estimate appropriate values for any jobs that fail due to the upgrade to PAPIv2.
The other source of potential failures is boot disk space. PAPIv2 manages docker images differently and in a very small number of cases -- typically involving extremely large docker images -- we have seen jobs run out of boot disk space. Normally the workflow management system auto-assigns the boot disk size automatically based on the size of the compressed docker image, hence why it's not commonly specified in many WDL scripts, but you can request a specific size if needed by using the bootDiskSizeGb key in the runtime block.
We hope you will find this upgrade as worthwhile as we expect, and that the side effects we describe here will not cause too much disruption to your work. As always, we are at your service to help resolve any issue you experience while using Terra. Don't hesitate to reach out to our frontline support team at email@example.com if you need help with any of this. You can also post on the community forum or peruse it to see if other users have posted a similar experience.