Workflows can fail for a lot of different reasons. If you're lucky, you have the wrong inputs plugged into the configuration, which is typically fast to fail and fast to fix (to learn more about how to avoid these sorts of errors, see this article). More complex errors come from the workflow code: bugs, or limitations in the analysis software package you're calling on. And of course you can fall victim to transient errors when something goes wrong on the Google Cloud itself (even giants sometimes stumble).
In this document we'll go over some basic strategies to investigate failed workflows on Terra. This isn’t a guide for solving all errors but a doc to help diagnose failed submissions. The information below is to help guide you as you drill down to find the root cause of these more complex errors, so you can be up and running quickly. Descriptions of more complicated errors are always welcomed on the Terra Forum, where our team is happy to help.
At this point we will assume that you have a workspace set up with a data table and workflow configuration loaded. You’ve launched your workflow, but your submission has failed. Don’t despair! There is some information you can gather that will be helpful in getting your workflow up and running.
High-level submissions status
When you click on Job History, you’ll see a list of all the workflow or workflows within the submission along with high-level submission status (succeeded, failed, running, etc.) and a few columns of metadata (such as the submission ID). For details about a particular submission, click on the link in the "Submission" column.
Note that if your workflow failed immediately, the problem is almost certainly because it could not find the input file. If you are using inputs from the data table, this could be because the attribute name - in the inputs section of the workflow card - did not match the column header in the data table. Or it could mean that you don't have access to the bucket where the data are stored.
The next level contains further details about each workflow within the submission and its status (failed, queued, etc.). From here you can access more detail by selecting one of the three icons at the right.
If your job failed because it never started, you won't see these options.
Job manager (far left icon)
1. Error message
The message listed under
Failure Messages isn't always short and sweet, and, if interpreted incorrectly, will lead you down the wrong debugging path. Instead, use the message to identify and investigate which task failed.
Error Message Examples (click "+" to expand
The job was stopped before the command finished. PAPI error code 4. User specified operation timeout reached
The maximum time a non-preemptible PAPI VM can survive is currently configured to 1 week. This user’s task has been running for 168 hours before it died, aka 1 week. We believe the “user” in “user specified operation timeout” refers to PAPI’s role as a user of Compute Engine, not the actual end user of our product
PAPI error code 9
Check stderr/stdout because the “command block” component of a task ended up failing which generated a non-zero return code. Consult the stderr/stdout file for a stacktrace/exception. To learn more about this error, see this article.
Job exit code 3 [See PAPI error code 9]
PAPI error code 9. Please check the log file for more details. To learn more about this error, see this article.
PAPI Error code 10
This error means the job exited in such an abrupt way that the machine where the command was running has essentially crashed. It can also mean that the job was preempted (this is less common). To rule out this last possibility, check whether the requested machine was preemptible. To learn more about this error, see this article.
The task run request has exceeded the maximum PAPI request size (146800064 bytes)
When localizing lots of files at once, the command length in physical characters gets too long and you see the above error. The exact upper limit is on this is unclear. In this example, user has 1000s of inputs attempting to be localized to a task when the workflow fails. To test, try creating a tar.gz of all the outputs and then try to localize and unzip.
The <task name> exited with return code 250 which has not been declared as a valid return code. See ‘continueOnReturnCode’ runtime attribute for more details.
This error message is just saying that a non-zero return code was generated by the command, and thus Cromwell considers it a failure.
If there is a string “error” in the stderr/stdlog file.
If there is an “error” type string returned, the best thing to do is Google it to see who else has run into this before.
2. Log files
If it's not immediately obvious what failed, the best sources of information are log files, which you can access directly from the Job manager interface by clicking on the icon at the left. These files are generated by Cromwell when executing any task and are placed in the task's folder along with its output. In Terra, we add quick links to these files to make troubleshooting easier.
Backend (Cromwell) log (click "+" to expand)
You can also see this in Google Cloud Platform console by clicking the link at the bottom.
Execution directory (click "+" to expand)
1. taskname] log (formerly JES log)
A log file tracking the events that occurred in performing the task such as downloading docker, localizing files, etc.. Occasionally a workflow will fail without a stderr and stdout files, leaving you with only a JES log. More on this on the next section.
2. stderr and stdout
Standard Error (stderr): A file containing error messages produced by the commands executed in the task. A good place to start for a failed task, as many common task level errors are indicated in the stderr file.
Standard Out (stdout): A file containing log outputs generated by commands in the task. Not all commands generate log outputs and so this file may be empty.
Compute details (click "+" to expand)
[TaskName].log (formerly JESLog) - Often this log is difficult to decipher, so it's better to proceed to the other log files. However, in some cases your submitted job will fail with no stderr or stdout files. In these cases you’ll have to suck it up and unravel the meaning behind the TaskName.log messages.
Workflow Dashboard (middle icon)
Execution directory (icon at right)
The Execution directory, which is on Google Cloud Platform console, includes a wealth of details on the API side of things. For more information about what goes on under the ood, see this article.
The error examples discussed above are pretty simple and very common. Be sure to
Remember that when troubleshooting, you should automatically head towards
In cases where there isn’t a stdout or stderr file, use the common JES log and the
Of course, if you are having any trouble with Terra troubleshooting, you can ask your