How to troubleshoot failed workflows

Sushma Chaluvadi
  • Updated

Learn some basic resources to help investigate failed workflows on Terra. This isn’t a guide for solving all errors, but it can help diagnose failed submissions. Descriptions of more complicated errors are always welcome on the Terra Forum, where our team is happy to help.

Overview

Workflows fail for many different reasons. If you're lucky, you have the wrong inputs plugged into the configuration, which is typically fast to fail and fast to fix (to learn how to avoid these sorts of errors, see How to configure workflow inputs). More complex errors come from the workflow code: bugs, or limitations in your analysis software package. And of course, you can fall victim to transient errors when something goes wrong on Google Cloud itself (even giants sometimes stumble). The information below will help you as you drill down to find the root cause of more complex errors.

At this point, we assume you have a workspace set up with a data table and workflow configuration loaded. You’ve launched your workflow, but your submission has failed. Don’t despair! Information in the Job History and Job Manager can help you get your workflow up and running. 

To troubleshoot a failed workflow, start with high-level information about the workflow submission and work your way down to workflow- and task-level logs.  

What to do if your workflow seems to be stuck or stalledIf your workflow's status does not progress from "submitted" to "running," or takes longer than expected to run, you may be up against a Google Cloud resource quota. See How to troubleshoot and fix stalled workflows to learn how to solve this issue.

What's going on behind the scenes? For general information about monitoring submissions, see Overview: Job History (monitor and troubleshoot).

To understand what's happening behind the scenes when a workflow is running, see How the workflows system works.

Step 1: Check high-level submissions status 

Your workspace's Job History tab lists all past workflow submissions.

Screenshot of Job History page with an arrow to a row with a failed submission under the Submissions (click for details) collumn at left

Clicking on one of these submissions reveals more information about each workflow in the submission, including its high-level submission status (queued, running, done, a red triangle, etc.), workflow id, and error messages.  

This page will help you identify a failed submission. To figure out why it failed, you need to dig deeper. 

What to do if your workflow failed immediatelyIf your workflow failed right when you submitted it, the problem is almost certainly that Terra could not find an input file. That can happen for a couple of reasons:
    1. If you are using inputs from a data table, the column name that you entered in the workflow's inputs section may not match the column's name in the data table
    2. If the file is stored outside of your workspace, you may not have access to the bucket where the data is stored (or your authorization link has expired)

For additional guidance, please see Step 3: Specify inputs from a table in How to configure workflow inputs.

Step 2: Check workflow-level status

If you can't find an obvious problem with the submission, you can look for clues about each workflow within the submission. The Job History page for a particular submission lists each workflow's status (failed, queued, etc.), error messages, and links to more information in the Job Manager, the Workflow Dashboard, and the execution directory.

Submission-level details in Job HistoryScreenshot of submission-level details in the Job History with icons for links to the Job Manager, Dashboard, and execution directory circled
For more detail  about a particular (failed) workflow, select one of the three icons at the right.

What to do if you don't see these iconsIf your job failed because it never even started (for example, because Terra could not find your input files), you won't see the job manager, dashboard, and execution directory icons in the "links" column. If this happens, follow the steps in the tip box in Step 2 to check that you correctly specified the file input and that you have access to the file.  

Step 3. Check workflow details in the Job Manager

What to do if Job Manager won’t loadJob Manager may fail to load if your job produced huge amounts of metadata. In these cases, skip to the Workflow Dashboard (below).

The Job Manager (the checklist icon in the "links" section of the workflow's Job History summary) is your go-to location for a more thorough breakdown of how your workflow was run. Here you can find information about each individual task in the workflow, including

  1. Error messages 
  2. Links to log files, Google Cloud execution directories, and Compute details
  3. A timing diagram.

Screenshot of Job Manager page highlighting 1. the failure message (center), 2. log file icons and 3. timing diagram tab (far right)

Check the following for clues by clicking on or hovering over the appropriate icon. 

Troubleshooting-workflows_Job-Manager_Errors-icon.png Error messages: Can help identify and investigate which task failed. See below for examples of error messages often found for failed workflows. You can see Error messages (listed by task) in the Error tab (as above), or by hovering over the triangular error icon in the Errors card. 
Job Manager Backend log icon.png Backend log: A step-by-step report of how each task within the workflow was executed (i.e., Docker setup, localization, stdout from tools run within the command block of the task, and finally, the delocalization and Docker shutdown steps. Click on this icon in the Errors card to view the log within Terra, or click on the icon in the "log files" section to see the log on the Google Cloud console.
Job Manager execution files icon Execution directory: Includes task-level details and generated outputs for a particular workflow within your submission. Found on Google Cloud console, it is where you can view or download stderr, stdout, and backend logs for each task within the workflow. 
Job Manager Compute details icon

Compute details: Information on the workflow at the Google Pipelines worker level, including timestamps for the execution of worker tasks and VM configuration information. Use this section to understand or validate the configuration of your worker VM (memory, disk size, machine type, etc.). Or, check this section if you suspect your workflow failed due to a transient Google issue.

Note: This information is only available for 42 days after the pipeline (VM) started. This is a Google lifecycle policy and there's no workaround to retrieve the data after 42 days.

Common Error messages

The message displayed under Failure Message isn't always short and sweet, and, if interpreted incorrectly, will lead you down the wrong debugging path. Instead, use the message to identify and investigate which task failed. 

Below are some common errors. Click to expand each and learn its possible meaning.

Not all of these errors have a straightforward solution, so if you're having trouble diagnosing your error get in touch with Terra's support team so the team can help you through the message.

  • The maximum time a nonpreemptible PAPI VM can survive is one week (168 hours). This is a default set by Google Pipelines API. If you run into this error, we recommend increasing the CPU or memory, using larger disks, or using SSDs rather than HDDs to speed up the work so that it doesn't run out of time. Alternatively, you can try to chunk the work into smaller, separate workflows.

  • Check stderr/stdout because the “command block” component of a task ended up failing, which generated a nonzero return code. Consult the stderr/stdout file for a stacktrace/exception. To learn more about this error, see this article.

  • PAPI error code 9. Please check the log file for more details. To learn more about this error, see Error message: PAPI error code 9.

  • This error means the job exited in such an abrupt way that the machine where the command was running essentially crashed. Or, it can mean that the job was preempted (this is less common). To rule out this last possibility, check whether the requested machine was preemptible.

    To learn more about this error, see Error message: PAPI error code 10.

  • When localizing lots of files at once, the command length in physical characters can get too long and you will see an error message similar to the one shown above. For this example, the user had 1000s of inputs attempting to be localized to a task when the workflow failed. To fix an issue like this, you can create a tar.gz of all the input files and provide it as an input to the workflow, then localize and unzip within the task.

  • This error message is just saying that your command or tool failed in some way, returning a non-zero return code, which Cromwell considers a failure. To troubleshoot, we recommend searching for “error” in the log file for your task. If your command/tool produced a useful error message, you may find a solution by searching for that message in your search engine of choice.

  • This error can occur if you do not have access to a file that your workflow is trying to access. To learn how to troubleshoot this error, see Error message: AccessDeniedException.

Backend Log 

If it's not immediately obvious what failed, the best sources of information are log files.

Access the backend log directly from the Job Manager interface by clicking on the icon with a cloud superimposed on a page in the "errors" card. This gives a step-by-step report of actions during the execution of the task. These details include information about Docker setup, localization (the step of copying files from your Google bucket into the Docker container), stdout from tools run within the command block of the task, and finally, the delocalization and Docker shutdown steps.

If there were problems with any of these steps, you will see them in the backend log. 

You can also see this file in the Google Cloud console by clicking the link at the bottom.

Screenshot of back end log file with arrow to the link to go to file in Google console (bottom right)

If your log stopped abruptly Some log files seem to stop abruptly without reaching the delocalization stage. This is almost certainly because the task has run out of memory. We recommend retrying with more memory to see if your job gets farther. See Out Of Memory Retry to learn more about how to configure your workflow to immediately retry certain tasks if the only error was to run out of memory.

Execution directory

The execution directory is the file in your workspace bucket that contains the task-level details and generated outputs for a particular workflow within your submission. For example, did you run a workflow on two different samples, and one failed while the other succeeded? Access the execution directory to help find a solution for the one that failed. 

To learn more, see Overview: Execution directory

Compute details

This displays information on the workflow at the Google Pipelines worker level, including timestamps for the execution of worker tasks and VM configuration information. Use this section to understand or validate the configuration of your worker VM (memory, disk size, machine type, etc.). Or, check this section if you suspect your workflow failed due to a transient Google issue.

Screeenshot of  TaskName log in Terra UI

Timing diagram

This is a visual representation of what Terra did as it ran the workflow - how much actual clock time was spent on commands within each task. It can help you understand what to look into further, especially if any part took much longer than expected. 

Alternate Step 3: Workflow Dashboard (middle icon)

If Job Manager fails to load (e.g., if your job produced huge amounts of metadata), you can access much of the same information in the Workflow Dashboard.

Screenshot of Workflow dashboard in Job History tab with the error message (1) and links to error message, Job Manager, and execution directory highlightedThe Workflow Dashboard includes 1) error messages and 2) links to the Job Manager (above) and the Execution directory (see section below). 

Remember - when troubleshooting, head to the Job History tab and check the stdout, stderr, and task log files for your failed task.

If there isn’t a stdout or stderr file, use the task log and the message explanations in this document to help solve the problem. Of course, if you have trouble with Terra troubleshooting, ask your question on the Terra Forum.

 

Was this article helpful?

2 out of 2 found this helpful

Comments

0 comments

Please sign in to leave a comment.