Terra sometimes misreports workflow times Answered

Edited November 24, 2020 00:53
7 comments

I ran the TOPMed Aligner WDL on 10 controlled-access CRAMs to benchmark costs as part of BioData Catalysts W3.1's benchmarking goal. I noticed the time that some steps took to complete seemed off, so I manually checked the logs alongside my team. In at least four instances the logs indicate that the time that the tasks actually took to execute was far above what is shown on Terra's UI. All of them (highlighted yellow) should have a reported time of about 24 hours or so, but they're being reported as either about 8 hours or as no time at all.

(The other zeros in the pre-align step are likely not directly related to this issue; according to the logs they were preempted within about a minute or two. They are still a bit concerning because it's unusual for Google to preempt disks so quickly and so often, especially when compared to the align step's preemptibles. But since that has little effect on cost or time it takes to execute a workflow I think that's less of priority.)

If these misreports are also what's going through FISS, then this is a likely culprit to the workflow cost estimator notebook under development by Brian Hannafious being about $50 off for this submission's total $80ish cost. That notebook is being used to generate estimates for Biodata Catalyst and AnVIL users, so there is a practical reason for these time estimates being correctly reported.

Comments

7 comments

Jason Cerrato
- November 24, 2020 14:06
Hi Ash O'Farrell,

Thanks for writing in—we'll be happy to take a closer look at this.

Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
2. Click Save.

Let us know the workspace name or share a link. Please also let us know the submission IDs, workflow IDs, and tasks that correspond to the questionable time entries. We''ll investigate as soon as we can.

Kind regards,

Jason

0
Ash O'Farrell
- November 24, 2020 16:13
Alrighty, that account has been added. Like I mentioned earlier it has controlled access data so it has an auth domain. The workspace is named Aligner Cost Estimator and the workflow submission in question is 5b3f2466-8267-4f0f-a760-2ee72570e01f.

0
Jason Cerrato
- November 24, 2020 16:18
Hi Aisling,

I had missed that point about it being controlled-access. Can you provide the workflow ID(s) associated with these strange costs and point to which IDs/tasks correspond to which odd value?

Jason

0
Ash O'Farrell
- Edited November 24, 2020 18:34
Submission ID: 5b3f2466-8267-4f0f-a760-2ee72570e01f

78e5da68-7b30-4a77-8be6-b1df9aa8a6a9

In Pre-Align, a disk is preempted immediately (weird but not a big concern). On the second attempt, also on a preeempt, it is reported that the run was successful and it took 0h and 0m. However, the log indicates it took quite a bit longer.

2020/11/18 20:34:13 Starting container setup.

[...]

2020/11/19 18:37:02 Done delocalization.

9b37305a-1501-4fd7-b7ff-d308827b74d6

Similar story to the above except the preempts failed thrice and the reported 0h 0m successful run was on a nonpreempt.

2020/11/18 15:44:45 Starting container setup.

[...]

2020/11/19 15:20:53 Done delocalization.

8b6e9afc-b222-474f-9c48-b3ca4c19bb4a

This one is different than the other two in that it seems like all steps are misreported, not just the pre-align step. Here, pre-align is reported as 0h 0m but the log shows:

2020/11/17 20:38:16 Starting container setup.

[...]

2020/11/18 19:57:39 Done delocalization.

The align step doesn't have a complete log in Terra's UI. But based on the parts that are visible ([main] Real time: 126.273 sec; CPU: 1444.153 sec and the fact it took a few minutes to delocalize) indicates it shouldn't be 0h 0m. However since the final cost of this submission is low, it is possible it was indeed unusually quick.

The post-align step is also incorrect, but in a different way. Rather than reporting 0h 0m it reports 7h 49m. However, the correct time should be closer to 22 hours.

2020/11/19 07:08:47 Starting container setup.

[...]

2020/11/20 04:51:31 Done delocalization.

6ab77bf2-2780-4735-bbfb-53b9fb70631a

This one is similar to the one immediately above -- it seems every estimate, for every task, is underreported.

There might be more but these are the four out of ten that stuck out to me and my team.

0
Jason Cerrato
- November 24, 2020 20:40
Hi Aisling,

Perfect, thank you so much for this detailed breakdown. One last ask since we can't actually see your workspace—can you send us screenshots for these four cases where you're seeing the runtime showing as a different number than expected?

We've already validated your first case and are looking to do the rest. Having screenshots to assist will be helpful for when we bring this to our engineers for a deeper investigation.

Kind regards,

Jason

0
Ash O'Farrell
- November 24, 2020 20:47
78e5da68-7b30-4a77-8be6-b1df9aa8a6a9

9b37305a-1501-4fd7-b7ff-d308827b74d6

8b6e9afc-b222-474f-9c48-b3ca4c19bb4a

6ab77bf2-2780-4735-bbfb-53b9fb70631a

Also -- all of these are run on full-sized CRAM files from the same TOPMed project, so it seems unlikely this is an artefact of running on unharmonized or downsampled data.

0
Samantha (she/her)
- November 25, 2020 17:37
Hi Ash O'Farrell,

Thanks for reporting this. I've created a bug ticket for our engineering team to further investigate the issue.

Best,

Samantha

0