Workflows can fail for a lot of different reasons. If you're lucky, you have the wrong inputs plugged into the configuration, which is typically fast to fail and fast to fix (to learn more about how to avoid these sorts of errors, see Workflow setup: Configuring inputs). More complex errors come from the workflow code: bugs, or limitations in the analysis software package you're calling on. And of course you can fall victim to transient errors when something goes wrong on the Google Cloud Platform itself (even giants sometimes stumble).
In this document we'll go over some basic resources to help investigate failed workflows on Terra. This isn’t a guide for solving all errors but a doc to help diagnose failed submissions. The information below is to help guide you as you drill down to find the root cause of these more complex errors, so you can be up and running. Descriptions of more complicated errors are always welcome on the Terra Forum, where our team is happy to help.
At this point we will assume you have a workspace set up with a data table and workflow configuration loaded. You’ve launched your workflow, but your submission has failed. Don’t despair! Information in the Job History and Job Manager can help get your workflow up and running soon enough, as long as you know how to access and use the information.
To troubleshoot, you can access increasingly more granular information starting with submission-level status and digging into workflow-level information (within a submission), and tasks-level (within each workflow).
If your workflow seems to be stuck or stalledIf your workflow is not progressing from submitted to running, or seems to be taking longer that expected to run, you may be bumping up against a Google Cloud resource quota. See How to troubleshoot and fix stalled workflows.
For general information about monitoring submissions, see Overview: Job History (monitor and troubleshoot).
To understand what's happening behind the scenes, see How the workflows system works.
Step 1: Check high-level submissions status
The Job History page includes a list of all the workflows within the submission along with high-level submission status (queued, running, done, a red triangle etc.) and a few columns of metadata (such as the submission date and submission ID).
This page will help you identify a failed submission. To figure out why it failed, you will need to dig deeper.
If your workflow failed immediatelyIf your workflow failed right when you submitted it, the problem is almost certainly because it could not find the input file. If you are using inputs from the data table, this could be because the attribute name - in the inputs section of the workflow card - did not match the column header in the data table. Or it could mean you don't have access to the bucket where the data is stored (or your authorization link has expired).
For additional guidance, please see Step 3: Specify inputs from a table in How to configure workflow inputs.
For details about a particular (failed) submission
Click on the link in the "Submission" column.
Step 2: Check workflow-level status
If you cannot find an obvious problem with the submission, you can check out further details about each workflow within the submission. The Job History submission page lists each work status (failed, queued, etc.) and links to more information in the Job Manager, the Dashboard, and the execution directory.
Submission-level details in Job History
For more detail about a particular (failed) workflow, select one of the three icons at the right.
TIP If you don't see these iconsIf your job failed because it never started (if Terra could not find your input files to localize, for example), you won't see these options. Check (1) that you specified the correct input file attribute and (2) that you have access to the data (i.e. your link to controlled data has not expired).
Step 3. Check workflow details in Job Manager
If Job Manager won’t loadJob Manager may fail to load if your job produced huge amounts of metadata. In these cases, skip to the Workflow Dashboard (below).
Job Manager is your go-to location for a more thorough breakdown of your workflow. Here you can find information about each individual task in the workflow, including
- Error messages
- Links to log files, Google Cloud execution directories, and Compute details
- A timing diagram.
Check the following for clues by clicking on or hovering over the appropriate icon.
|Error messages: Can help identify and investigate which task failed. See below for examples of error messages often found for failed workflows. You can see Error messages (listed by task) right on this page in the Error tab (as above), or by hovering over the icon in the Errors card.|
|Backend log: A step-by-step report of actions during the execution of the task (i.e., Docker setup, localization, stdout from tools run within the command block of the task, and finally, the delocalization and Docker shutdown steps.|
|Execution directory: Includes task-level details and generated outputs for a particular workflow within your submission. Found on Google Cloud console, it is where you can view or download stderr, stdout, and backend logs for each task within the workflow.|
|Compute details: Information on the workflow at the Google Pipelines worker level, including timestamps for the execution of worker tasks and VM configuration information. Use this section to understand or validate the configuration of your worker VM (memory, disk size, machine type, etc.). You can also check this section if you suspect your workflow failed due to a transient Google issue.|
The message displayed under
Failure Messages isn't always short and sweet, and, if interpreted incorrectly, will lead you down the wrong debugging path. Instead, use the message to identify and investigate which task failed.
Error Message Examples
Below are some common errors and their possible meaning as aid (click for possible meaning). There isn’t a solution for all of them, so feel free to post your error on the Terra forum so the team could help you through the message.
The maximum time a non-preemptible PAPI VM can survive is 1 week (168 hours). This is a default set by Google Pipelines API. If you are running into this error, we recommend increasing the CPU or memory, using larger disks, or using SSDs rather than HDDs in order to speed up the work so that it doesn't run out of time. You can alternatively try to chunk the work into smaller, separate workflows.
Check stderr/stdout because the “command block” component of a task ended up failing which generated a non-zero return code. Consult the stderr/stdout file for a stacktrace/exception. To learn more about this error, see this article.
PAPI error code 9. Please check the log file for more details. To learn more about this error, see Error message: PAPI error code 9.
This error means the job exited in such an abrupt way that the machine where the command was running has essentially crashed. It can also mean that the job was preempted (this is less common). To rule out this last possibility, check whether the requested machine was preemptible.
To learn more about this error, see Error message: PAPI error code 10.
When localizing lots of files at once, the command length in physical characters can get too long and you will see an error message similar to the one shown above. For this example, the user had 1000s of inputs attempting to be localized to a task when the workflow failed. To fix an issue like this, you can create a tar.gz of all the input files and provide it as an input to the workflow, then localize and unzip within the task.
This error message is just saying that your command or tool failed in some way, returning a non-zero return code which Cromwell considers a failure. To troubleshoot, we recommend searching for “error” in the log file for your task. If your command/tool produced a useful error message, you may find a solution by searching for that message in your search engine of choice.
Log files/execution directory
If it's not immediately obvious what failed, the best sources of information are log files, which you can access directly from the Job Manager interface by clicking on the icon at the left (looks like a bullet list). These files are generated by Cromwell when executing any task and are placed in the task's folder along with its output. In Terra, we add quick links to these files to make troubleshooting easier.
This gives a step-by-step report of actions during the execution of the task. These details include information about Docker setup, localization (the step of copying files from your google bucket into the Docker container), stdout from tools run within the command block of the task, and finally, the delocalization and Docker shutdown steps.
If there were problems with any of these, you will see them here.
You can also see this in Google Cloud Platform console by clicking the link at the bottom.
If your log stopped abruptly Some log files seem to stop abruptly, not yet having reached the delocalization stage. This is almost certainly because the task has run out of memory. We recommend retrying with more memory to see if your job gets farther. See Out Of Memory Retry to learn more about how to configure your workflow to immediately retry certain tasks if the only error was to run out of memory.
The execution directory is the file in your workspace bucket that contains the task-level details and generated outputs for a particular workflow within your submission. For example, if you ran a workflow on two different samples, and one failed while the other succeeded, you can access the execution directory for the one that failed to find information to help figure out why this one failed.
To learn more, see Overview: Execution directory.
This displays information on the workflow at the Google Pipelines worker level, including timestamps for the execution of worker tasks and VM configuration information. You can use this section to understand or validate the configuration of your worker VM (memory, disk size, machine type, etc.). You can also check this section if you suspect your workflow failed due to a transient Google issue.
3. Timing diagram
This is a visual representation of the things Terra did as it was running the workflow - how much actual clock time was spent on commands within each task. It can be helpful for understanding what to look into further, especially if any part took much longer than expected.
Alternate step 4: Workflow Dashboard (middle icon)
If Job Manager fails to load (if your job produced huge amounts of metadata, for example), you can access much of the same information in the Workflow Dashboard.
Remember that when troubleshooting, you should automatically head to the Monitor tab and check stdout, stderr, and task log for your failed task.
If there isn’t a stdout or stderr file, use the task log and the message explanations in this document to help you solve the problem. Of course, if you are having any trouble with Terra troubleshooting, you can ask your question on the Terra forum.