We can probably all agree that the main point of running analyses in the form of workflows is to automate as much of it as you can. So when something goes wrong, you want to be able to identify the nature of the problem as quickly and easily as possible.
Today we wanted to draw your attention to some important improvements that were recently made to the Job Manager interface, which we believe will make it easier to deal with workflow execution issues on Terra. Read on below for details.
As a special note to those of you still using the original FireCloud portal interface, we expect you'll appreciate that “operation details” are finally accessible in the Terra interface via the backend log. Between that and the other refinements to Job Manager detailed below, we're getting much closer to parity with the FireCloud interface. So if that aspect of the platform is what has been keeping you from making the hop over to Terra, we hope you'll check this out and let us know what you think. Our goal is to get Terra interface to the point where you feel you can easily do everything you've been doing in FireCloud -- and more!
Troubleshooting failed workflows
Workflows can fail for a lot of different reasons. If you're lucky, you just have the wrong inputs plugged into the configuration, which is typically fast to fail and fast to fix. More complex are errors in the workflow code, bugs, or limitations in the analysis software package you're calling on. And of course you can fall victim to transient errors when something goes wrong on the Google cloud itself (even giants sometimes stumble).
If it's not immediately obvious what failed, the best sources of information are log files, which you can now conveniently access directly in the Job Manager interface. Here's how you do it:
- The Job History tab gives high-level status on the success or failure of your submissions.To help understand the root cause, click into the submission to see a workflow-level errors:
- You’re now looking at the workflow or workflows within the submission. Click the “View” link to open the Job Manager interface, which details errors about specific workflows:
- Once in the Job Manager interface, you can preview the tail end of your logs by clicking on the icons.
From the list view:
From the card view:
- Then expand to the full log by clicking on the link in the preview.
Functions of log icons from (left-to-right):
- Backend (Cromwell) log - A step-by-step report of actions during the execution of the task. These details include information about Docker setup, localization (the step of copying files from your google bucket into the Docker container), stdout from tools run within the command block of the task, and finally, the delocalization and Docker shutdown steps.
- Execution directory - Clicking on this icon will redirect you to the exact folder/directory where you can find your stderr, stdout, and backend logs in the Google cloud storage bucket. From there, you can open those files to view their contents or you can download them. If your task generates outputs, this directory is where you can find them as well.
- Compute details - A report of the actions taken on the Google side by the Pipelines API (PAPI) to execute the task, including things like the request we send to Google, the exact events as tracked by Google, timestamps of what happened when, and if there were errors. This is where you would find information about errors that are unrelated to the WDL code or configuration. This information is great for debugging when failures happen before a task starts or after a task completes.
Abort a specific workflow
If you have one submission with several workflows where perhaps just one or a handful of workflows are misbehaving (they look like they're stuck at an intermediate step, for example), you might want to abort the one problem workflow without stopping the entire submission. Job Manager now has the ability to do exactly that, through the action called Abort Job.
- Click into your submission (from the Job History tab), which will list all of the workflows so you can isolate the anomalous one.
Press “View” on the one you want to abort, and use the “Abort Job” button without affecting the rest of the submission process.
Coming Soon - workflow details pages that load faster
In case you're curious, we're currently working on some improvements to the machinery that retrieves information about workflows from Terra's internal database, to make the main workflow details pages load faster. We expect to be ready to release those improvements soon, and we're planning to share some benchmarking results that will give you a more concrete sense of what to expect in a future blog post.