Job seems stuck indefinitely at the delete intermediate files step and does not complete
A couple of days ago I started a large job ($33.35) to process DNA microarray data from 27,778 samples. The job more or less completed in 40hrs:
The OUTPUTS tab has been populated and I can indeed access the output files just fine. However, it is still labeled as Running. My best guess is that it is stuck somewhere in the delete intermediate files step. The main bulk of this operation would certainly be the deletion of 27,778 files created in the first scattered task. This was a task scattered in 580 shards so that effectively those 27,778 files were localized in 580 directories with ~48 files each.
When I have looked a those directories about an hour ago, each directory contains now 30-35 files, for a total of 18,180 files. So something like a third of these intermediate files have indeed been deleted but then the deletion step must have gotten frozen or something as when I checked again later the number was still 18,180. So not sure what went wrong.
These are the LABELS for the job:
caas-collection-name "7d031797-0cee-429f-8028-3fc54b388807"
cromwell-workflow-id "cromwell-b5dd7578-a071-4409-84c5-cdf303b8b0c8"
submission-id "2eee2296-b8a5-4ad6-84de-e7a3f914b1f9"
workspace-id "7d031797-0cee-429f-8028-3fc54b388807"
Comments
8 comments
Hi Giulio,
Thanks for flagging this up. We'll be happy to dig into this a little more and see what's going on here. I'll be in touch with updates!
Kind regards,
Jason
It looks like someone did something ... the job used to say "Started Jul 22, 1:20 AM" and now it says "Started: Today, 5:33 PM" and "Ended: Today, 5:34 PM (0h 0m)":
And it claims to have failed while trying to read some variables in the workflow (that were previously successfully read) with read_tsv() and read_lines() as the files do not exist. The files, of course, do not exist because they are intermediate files and were deleted. It looks as if the job restarted after some of the intermediate files got deleted.
> It looks like someone did something
> The files, of course, do not exist because they are intermediate files and were deleted. It looks as if the job restarted after some of the intermediate files got deleted.
Exactly. I tried to restart your workflow, but the restart ran into the newly noticed issue you outlined above: restarting a workflow with partially deleted intermediates fails the workflow.
Two things:
Thank you Khalid. I am fine for this particular workflow. I wrote the WDL and I know exactly what it does so I know how to handle this. I am more worried about this happening to other users that will be using this WDL on their data. Do you have a guess about what went wrong while deleting intermediate files? Is it related to the large number of (small) intermediate files?
> Is it related to the large number of (small) intermediate files?
Yes. The large number of files to delete hit a pathological combo of issues inside Cromwell. This is already being looked at by the Terra/Cromwell team with an early patch already in review, hopefully being released in the next few days.
The related issue of restarting a workflow in the process of deleting intermediates has not been triaged, so I'm not sure yet what the team has planned for that case.
Okay! There is no immediate hurry on my side. I am very happy to know that a solution is in the making. I know I am pushing the limits of Terra a bit, but overall I have been very positively impressed by what was actually achievable in Terra and I think the users that will end up using this very complicated workflow I wrote will also agree on this. I wish a great weekend to the whole Terra/Cromwell team. :-)
Hi Giulio Genovese,
Just wanted to let you know that the fix for this issue is live on Terra!
Kind regards,
Jason
Super! That's awesome and very timing. Thank you Jason! :-)
Please sign in to leave a comment.