Job seems stuck indefinitely at the delete intermediate files step and does not complete

A couple of days ago I started a large job ($33.35) to process DNA microarray data from 27,778 samples. The job more or less completed in 40hrs:

The OUTPUTS tab has been populated and I can indeed access the output files just fine. However, it is still labeled as Running. My best guess is that it is stuck somewhere in the delete intermediate files step. The main bulk of this operation would certainly be the deletion of 27,778 files created in the first scattered task. This was a task scattered in 580 shards so that effectively those 27,778 files were localized in 580 directories with ~48 files each.

When I have looked a those directories about an hour ago, each directory contains now 30-35 files, for a total of 18,180 files. So something like a third of these intermediate files have indeed been deleted but then the deletion step must have gotten frozen or something as when I checked again later the number was still 18,180. So not sure what went wrong.

These are the LABELS for the job:

caas-collection-name "7d031797-0cee-429f-8028-3fc54b388807"

cromwell-workflow-id "cromwell-b5dd7578-a071-4409-84c5-cdf303b8b0c8"

submission-id "2eee2296-b8a5-4ad6-84de-e7a3f914b1f9"

workspace-id "7d031797-0cee-429f-8028-3fc54b388807"

Comments

8 comments

  • Comment author
    Jason Cerrato

    Hi Giulio,

    Thanks for flagging this up. We'll be happy to dig into this a little more and see what's going on here. I'll be in touch with updates!

    Kind regards,

    Jason

    0
  • Comment author
    Giulio Genovese

    It looks like someone did something ... the job used to say "Started Jul 22, 1:20 AM" and now it says "Started: Today, 5:33 PM" and "Ended: Today, 5:34 PM (0h 0m)":

    And it claims to have failed while trying to read some variables in the workflow (that were previously successfully read) with read_tsv() and read_lines() as the files do not exist. The files, of course, do not exist because they are intermediate files and were deleted. It looks as if the job restarted after some of the intermediate files got deleted.

    0
  • Comment author
    Khalid Shakir

    > It looks like someone did something

    > The files, of course, do not exist because they are intermediate files and were deleted. It looks as if the job restarted after some of the intermediate files got deleted.

    Exactly. I tried to restart your workflow, but the restart ran into the newly noticed issue you outlined above: restarting a workflow with partially deleted intermediates fails the workflow.

    Two things:

    • Even though the workflow is now "Failed", the workflow outputs should still be available. Let us know if you can't access the GCS paths via the Web UI and we can help you with the terra REST API for "Get workflow outputs." that should work.
    • As you noticed only some of the intermediate files were deleted by Cromwell. There are still others left in GCS. The current version of "delete intermediates" will NOT go back and try to clean up the rest. So if you want those intermediates deleted it will need to be through some other procedure for now. We have in our backlog a feature to delete intermediates for previously completed workflows, not for only just-finished workflows, but it's still a ways off.

     

    0
  • Comment author
    Giulio Genovese

    Thank you Khalid. I am fine for this particular workflow. I wrote the WDL and I know exactly what it does so I know how to handle this. I am more worried about this happening to other users that will be using this WDL on their data. Do you have a guess about what went wrong while deleting intermediate files? Is it related to the large number of (small) intermediate files?

    0
  • Comment author
    Khalid Shakir

    > Is it related to the large number of (small) intermediate files?

    Yes. The large number of files to delete hit a pathological combo of issues inside Cromwell. This is already being looked at by the Terra/Cromwell team with an early patch already in review, hopefully being released in the next few days.

    The related issue of restarting a workflow in the process of deleting intermediates has not been triaged, so I'm not sure yet what the team has planned for that case.

    0
  • Comment author
    Giulio Genovese

    Okay! There is no immediate hurry on my side. I am very happy to know that a solution is in the making. I know I am pushing the limits of Terra a bit, but overall I have been very positively impressed by what was actually achievable in Terra and I think the users that will end up using this very complicated workflow I wrote will also agree on this. I wish a great weekend to the whole Terra/Cromwell team. :-)

    0
  • Comment author
    Jason Cerrato

    Hi Giulio Genovese,

    Just wanted to let you know that the fix for this issue is live on Terra!

    Kind regards,

    Jason

    0
  • Comment author
    Giulio Genovese

    Super! That's awesome and very timing. Thank you Jason! :-)

    0

Please sign in to leave a comment.