Troubleshooting Workflows: Tips and Tricks

Sushma Chaluvadi
  • Updated

Workflows can fail for a lot of different reasons. If you're lucky, you have the wrong inputs plugged into the configuration, which is typically fast to fail and fast to fix (to learn more about how to avoid these sorts of errors, see this article). More complex errors come from the workflow code: bugs, or limitations in the analysis software package you're calling on. And of course you can fall victim to transient errors when something goes wrong on the Google Cloud itself (even giants sometimes stumble).

In this document we'll go over some basic strategies to investigate failed workflows on Terra. This isn’t a guide for solving all errors but a doc to help diagnose failed submissions. The information below is to help guide you as you drill down to find the root cause of these more complex errors, so you can be up and running quickly. Descriptions of more complicated errors are always welcomed on the Terra Forum, where our team is happy to help.

Overview

At this point we will assume that you have a workspace set up with a data table and workflow configuration loaded. You’ve launched your workflow, but your submission has failed. Don’t despair! There is some information you can gather that will be helpful in getting your workflow up and running.

Note: If you are looking for general information about how submissions work, see How to monitor and troubleshoot in the Job History Tab.

High-level submissions status 

When you click on Job History, you’ll see a list of all the workflows within the submission along with high-level submission status (succeeded, failed, running, etc.) and a few columns of metadata (such as the submission ID). For details about a particular submission, click on the link in the "Submission" column. 

Troubleshooting-Job_History_Scren_shot.png

Note that if your workflow failed immediately, the problem is almost certainly because it could not find the input file. If you are using inputs from the data table, this could be because the attribute name - in the inputs section of the workflow card - did not match the column header in the data table. Or it could mean that you don't have access to the bucket where the data is stored.

For additional guidance, please see the section titled 6. Configure Inputs in How to set up a workflow analysis.

Workflow-level status

The next level contains further details about each workflow within the submission and its status (failed, queued, etc.). From here you can access more detail by selecting one of the three icons at the right.

Troubleshooting_Job-History-submision_Screen_shot.png

G0_tip-icon.png


If you don't see these icons 

 

If your job failed because it never started, you won't see these options. 

Job manager (far left icon)

Job Manager is your go-to location for a more thorough breakdown of your workflow. Here you can find information about each individual task in the workflow, including log files, links to Google Cloud executions directories, error messages, and a timing diagram.

G0_tip-icon.png

 

If Job Manager won’t load

 

Job Manager may fail to load if your job produced huge amounts of metadata. In these cases, skip to using the Workflow Dashboard (described below)!


Troubleshooting_Job_Manager_Screen_shot.png

1. Error message

The message listed under Failure Messages isn't always short and sweet, and, if interpreted incorrectly, will lead you down the wrong debugging path. Instead, use the message to identify and investigate which task failed. 

Error Message Examples 

Below we’ve provided some common errors and their possible meaning as aid. There isn’t a solution for all of them, so feel free to post your error on the Terra forum so the team could help you through the message.
  • The job was stopped before the command finished. PAPI error code 4. User specified operation timeout reached
    The maximum time a non-preemptible PAPI VM can survive is 1 week (168 hours). This is a default set by Google Pipelines API. If you are running into this error, we recommend increasing the CPU or memory, using larger disks, or using SSDs rather than HDDs in order to speed up the work so that it doesn't run out of time. You can alternatively try to chunk the work into smaller, separate workflows.
  • PAPI error code 9
    Check stderr/stdout because the “command block” component of a task ended up failing which generated a non-zero return code. Consult the stderr/stdout file for a stacktrace/exception. To learn more about this error, see this article

  • Job exit code 3 [See PAPI error code 9]
    PAPI error code 9. Please check the log file for more details. To learn more about this error, see this article

  • PAPI Error code 10
    This error means the job exited in such an abrupt way that the machine where the command was running has essentially crashed. It can also mean that the job was preempted (this is less common). To rule out this last possibility, check whether the requested machine was preemptible. To learn more about this error, see this article

  • The task run request has exceeded the maximum PAPI request size (146800064 bytes)
    When localizing lots of files at once, the command length in physical characters can get too long and you will see an error message similar to the one shown above. For this example, the user had 1000s of inputs attempting to be localized to a task when the workflow failed. To fix an issue like this, you can create a tar.gz of all the input files and provide it as an input to the workflow, then localize and unzip within the task.

  • Job ____ exited with return code _ which has not been declared as a valid return code
    This error message is just saying that your command or tool failed in some way, returning a non-zero return code which Cromwell considers a failure. To troubleshoot, we recommend searching for “error” in the log file for your task. If your command/tool produced a useful error message, you may find a solution by searching for that message in your search engine of choice.


2. Log files

If it's not immediately obvious what failed, the best sources of information are log files, which you can access directly from the Job manager interface by clicking on the icon at the left. These files are generated by Cromwell when executing any task and are placed in the task's folder along with its output. In Terra, we add quick links to these files to make troubleshooting easier.

Task log

This gives a step-by-step report of actions during the execution of the task. These details include information about Docker setup, localization (the step of copying files from your google bucket into the Docker container), stdout from tools run within the command block of the task, and finally, the delocalization and Docker shutdown steps.

You can also see this in Google Cloud Platform console by clicking the link at the bottom.

Troubleshooting-Backend-log_Screen_shot.png

G0_tip-icon.png


If your log stopped abruptly 

 

Some log files seem to stop abruptly, not yet having reached the delocalization stage. This is almost certainly because the task has run out of memory. We recommend retrying with more memory to see if your job gets farther.

Execution directory 

Clicking on this icon will redirect you to the exact folder/directory in your workspace Google bucket where you can find your stderr, stdout, and backend logs in the Google cloud storage bucket. From there, you can open those files to view their contents or you can download them. If your task generates outputs, this directory is where you can find them as well. 

Troubleshooting_Execution-directory_Screen_shot.png

1. taskname.log

A log file tracking the events that occurred in performing the task such as downloading docker, localizing files, etc. This is the same log mentioned in the previous section. Occasionally a workflow will fail without a stderr and stdout files, leaving you with only a task log.

2. stderr and stdout

Standard Error (stderr): A file containing error messages produced by the commands executed in the task. A good place to start for a failed task, as many common task level errors are indicated in the stderr file.

Standard Out (stdout): A file containing log outputs generated by commands in the task. Not all commands generate log outputs and so this file may be empty.

Compute details

This section displays information on the workflow at the Google Pipelines worker level, including timestamps for the execution of worker tasks and VM configuration information. You can use this section to understand or validate the configuration of your worker VM (memory, disk size, machine type, etc.). You can also check this section if you suspect your workflow failed due to a transient Google issue.

Troubleshooting-TaskName-log-in-UI_Screen_shot.png

Workflow Dashboard (middle icon)

Troublehooting_Workflow-Dashboard_Screen_shot.png

Includes 1) error message and 2) links to Job Manager (above) and the Execution directory (see section below). 

Execution directory (icon at right)

The Execution directory, which is on Google Cloud Platform console, includes a wealth of details on the API side of things. For more information about what goes on under the hood, see this article

Troubleshooting_Execution-directory-1_Screen_shot.png

 

G0_tip-icon.png


Summary 

 

The error examples discussed above are pretty simple and very common. Be sure to
check your inputs before launching them to avoid failed workflows.

Remember that when troubleshooting, you should automatically head towards
to the Monitor tab and check stdout, stderr, and task log for your failed task.

In cases where there isn’t a stdout or stderr file, use the task log and the
message explanations in this document to help you solve the problem.

Of course, if you are having any trouble with Terra troubleshooting, you can ask your
question on the Terra forum.

Was this article helpful?

2 out of 2 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.