Cromwell WorkFlow getting aborted intermittently without any exception
Hi Team,
We are facing yet another issue with the large file(68GB) processing wherein workflow is getting aborted all of sudden without any exception on AWS enabled Cromwell. This behaviour is not permanent though, sometimes it gets successful but gets fail in another attempt.
We have tried increasing memory which we are assigning to Docker container and Xms values but no luck. Also, tried increasing TIMEOUT setting in Cromwell config.
Most of the time it is happening with SamSplitter(to split large file) and SamToFastqAndBwaMemAndMba tasks. I'm also copying timeout setting which we are maintaining in Cromwell.config.
Please help us to resolve this issue.
Timeout Setting in Cromwell config:
akka {
http {
server {
request-timeout = 1800s
idle-timeout = 2400s
}
client {
request-timeout = 1800s
connecting-timeout = 300s
}
}
}
WorkFlow stopped logger without exception even when a process is still running:--
[2019-05-03 14:34:00,51] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Initializing to Running
[2019-05-03 15:27:11,95] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2m461a6066^[[0mUBTAB.CQYM:0:1]: Status change from Running to Succeeded
[2019-05-03 15:43:25,43] [info] Workflow polling stopped
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] 0 workflows released by cromid-abdb07d
[2019-05-03 15:43:25,46] [info] Aborting all running workflows.
[2019-05-03 15:43:25,46] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-05-03 15:43:25,47] [info] JobExecutionTokenDispenser stopped
[2019-05-03 15:43:25,47] [info] WorkflowStoreActor stopped
[2019-05-03 15:43:25,47] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-05-03 15:43:25,47] [info] WorkflowManagerActor Aborting all workflows
[2019-05-03 15:43:25,47] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] WorkflowLogCopyRouter stopped
[2019-05-03 15:43:25,47] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] c68f0e1b-9998-4cca-9749-40586e3d097f-SubWorkflowActor-SubWorkflow-SplitRG:0:1 [^[[38;5;2mc68f0e1b^[[0m]: Aborting workflow
[2019-05-03 15:43:25,55] [info] Attempted CancelJob operation in AWS Batch for Job ID fd00c634-7a5f-453c-afad-9dbdead31a91. There were no errors during the operation
[2019-05-03 15:43:25,55] [info] We have normality. Anything you still can't cope with is therefore your own problem
[2019-05-03 15:43:25,55] [info] https://www.youtube.com/watch?v=YCRxnjE7JVs
[2019-05-03 15:43:25,55] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0m:SLRG.SamSplitter:NA:1] Aborted StandardAsyncJob(fd00c634-7a5f-453c-afad-9dbdead31a91)
[2019-05-03 15:56:31,10] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Running to Succeeded
[2019-05-03 15:56:31,81] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: WorkflowExecutionActor [^[[38;5;2m461a6066^[[0m] aborted: SubWorkflow-SplitRG:0:1
[2019-05-03 15:56:32,09] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: WorkflowExecutionActor [^[[38;5;2m155c13d0^[[0m] aborted: SubWorkflow-UBTAB:-1:1
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows are aborted
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows finished
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor stopped
[2019-05-03 15:56:33,07] [info] Connection pools shut down
[2019-05-03 15:56:33,07] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] SubWorkflowStoreActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,07] [info] JobStoreActor stopped
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-05-03 15:56:33,08] [info] DockerHashActor stopped
[2019-05-03 15:56:33,08] [info] IoProxy stopped
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] WriteMetadataActor Shutting down: 72 queued messages to process
[2019-05-03 15:56:33,08] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,09] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,09] [info] WriteMetadataActor Shutting down: processing 0 queued messages
[2019-05-03 15:56:33,09] [info] ServiceRegistryActor stopped
[2019-05-03 15:56:33,11] [info] Database closed
[2019-05-03 15:56:33,11] [info] Stream materializer shut down
[2019-05-03 15:56:33,12] [info] WDL HTTP import resolver closed
Comments
15 comments
Hi ssb_cromwell - I am not familiar with AWS enabled Cromwell. I am asking for help and we will get back to you soon.
Thanks, @TiffanyMiller, we urgently need your help on this blocker and will be waiting for your response.
Hi ssb_cromwell -
This looks new to us.
We haven’t tested the AWS backend with large files so it’s quite possible that you have uncovered a genuine bug. I’ll create a github issue for you to follow the progress: https://github.com/broadinstitute/cromwell/issues/4960.
Because Cromwell’s AWS backend in still in alpha, I can’t promise any timeline on getting a fix out (but @rmunshi might have more of an idea).
One possible idea did occur to me based on your logs - this appears to be a race condition between the final output copying and Cromwell's run mode shutdown logic (I'm assuming you're running Cromwell like "java -jar cromwell.jar run x.wdl ..."?) If that is the case, you might be able to make progress by using Cromwell in server mode which will hopefully allow you to avoid Cromwell shutting itself down too soon after the workflow completes.
Here's a link to the Cromwell docs discussing run mode vs server mode: https://cromwell.readthedocs.io/en/develop/Modes/#server
Hi @Chris,
Based upon your suggestion, I tried running workflow twice on Server Mode cromwell via Swagger UI.
But again it is getting stopped abruptly without any exception. Please find below logger for your reference and understanding.
Do you think it could be due to lack of memory or computing resources while running for a large file? if so then sometimes it processes SamSplitter successfully and sometimes stops in between.
Moreover, we are not executing anything on the Cromwell server system, we are creating batch jobs which will be having their own compute n memory assignment. so why this issue then? Please suggest!
Hi @ssb_cromwell,
It is strange that your logs say the docker image for this task is in the Google Container Registry. Can you please confirm you're using a docker image from dockerhub or quay? Thanks!
2019-05-11 08:09:02,405 INFO - dockerImage: us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386
Hi @Ruchi Munshi,
Thanks for your response!
Yes, we are using below docker image, do you see anything wrong with it? shall I use this some other way? please suggest!
docker: "us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386"
Regards,
Hey @ssb_cromwell,
I didn't expect the Google docker images to work on the Amazon backend -- it might be best to use a similar docker image hosted in DockerHub:
https://github.com/broadinstitute/cromwell/blob/develop/centaur/src/main/resources/integrationTestCases/germline/single-sample-workflow/processing-for-variant-discovery-gatk4.hg38.wgs.aws.inputs.json#L37
"broadinstitute/genomes-in-the-cloud:2.3.0-1501082129"
This is a general best practice and may be unrelated to your issue.
Hi @Ruchi Munshi, As suggested by you, I tried using AWS specific docker image rather than using from GCP.
But to our surprise this time even it didn't cross the SamSplitter process which used to get succeeded with the GCP docker image. Please find below logs for your reference.
Just to give you further details, we are running Cromwell on Ec2(t2.medium) and utilizing AWS batches to spin up docker images and process file. Currently, we are trying to process WholeGenome file having size of 68GB. Could you please suggest further on this, in case we should trying some other config or params.
Error while using AWS specific docker image rather than GCP:
2019-05-11 19:55:19,509 cromwell-system-akka.dispatchers.backend-dispatcher-120 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from - to Initializing
2019-05-11 19:55:24,762 cromwell-system-akka.dispatchers.backend-dispatcher-120 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from Initializing to Running
2019-05-11 20:40:22,960 cromwell-system-akka.dispatchers.backend-dispatcher-1151 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from Running to Failed
2019-05-11 20:50:32,644 cromwell-system-akka.dispatchers.backend-dispatcher-1151 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(b1d542d6)UBTAB.CQYM:0:1]: Status change from Running to Succeeded
2019-05-11 20:50:35,164 cromwell-system-akka.dispatchers.engine-dispatcher-24 ERROR - WorkflowManagerActor Workflow cc34fddf-416b-498b-bc53-6349a3498a32 failed (during ExecutingWorkflowState): Job SLRG.SamSplitter:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: s3://cromw/ce/WGSS/cc34fd498b-bc53-6349a3498a32/call-UBTAB/UBTAB/b1d542d6-5aef-2c8f432abb9/call-SplitRG/shard-0/SLRG/1aec4219-e35b-47b4-50fb5c3039/call-SamSplitter/SamSplitter-stderr.log.
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/cromw/ce/WGSS/cc34fd498b-bc53-6349a3498a32/call-UBTAB/UBTAB/b1d542d6-5aef-2c8f432abb9/call-SplitRG/shard-0/SLRG/1aec4219-e35b-47b4-50fb5c3039/call-SamSplitter/tmp.80418deb
^[[1m^[[31mUSAGE: PicardCommandLine ^[[32m<program name>^[[1m^[[31m [-h]
^[[0m^[[1m^[[31mAvailable Programs:
^[[0m^[[37m--------------------------------------------------------------------------------------
^[[0m^[[31mAlpha Tools: Tools that are currently UNSUPPORTED until further testing and maturation.^[[0m
^[[32m CollectIndependentReplicateMetrics ^[[36mEstimates the rate of independent replication of reads within a bam.^[[0m
^[[32m CollectWgsMetricsWithNonZeroCoverage ^[[36mCollect metrics about coverage and performance of whole genome sequencing (WGS) experiments. ^[[0m
^[[32m UmiAwareMarkDuplicatesWithMateCigar ^[[36mIdentifies duplicate reads using information from read positions and UMIs. ^[[0m
^[[37m--------------------------------------------------------------------------------------
^[[0m^[[31mFasta: Tools for manipulating FASTA, or related data.^[[0m
^[[32m CreateSequenceDictionary ^[[36mCreates a sequence dictionary for a reference sequence. ^[[0m
^[[32m ExtractSequences ^[[36mSubsets intervals from a reference sequence to a new FASTA file.^[[0m
^[[32m NonNFastaSize ^[[36mCounts the number of non-N bases in a fasta file.^[[0m
^[[32m NormalizeFasta ^[[36mNormalizes lines of sequence in a FASTA file to be of the same length.^[[0m
^[[37m--------------------------------------------------------------------------------------
^[[0m^[[31mFingerprinting Tools: Tools for manipulating fingerprints, or related data.^[[0m
^[[32m CheckFingerprint ^[[36mComputes a fingerprint from the supplied input (SAM/BAM or VCF) file and compares it to the provided genotypes^[[0m
^[[32m ClusterCrosscheckMetrics ^[[36mClusters the results of a CrosscheckFingerprints run by LOD score.^[[0m
^[[32m CrosscheckFingerprints ^[[36mChecks if all fingerprints appear to come from the same individual.^[[0m
^[[32m CrosscheckReadGroupFingerprints ^[[36mDEPRECATED: USE CrosscheckFingerprints. Checks if all read groups appear to come from the same individual.^[[0m
^[[37m--------------------------------------------------------------------------------------
^[[0m^[[31mIllumina Tools: Tools for manipulating data specific to Illumina sequencers.^[[0m
^[[32m CheckIlluminaDirectory ^[[36mAsserts the validity for specified Illumina basecalling data. ^[[0m
Thanks for trying out server mode, sorry to hear that it didn't work out. I've updated https://github.com/broadinstitute/cromwell/issues/4960 with the results. Unfortunately it sounds like we do have a bug in the delocalization process and we'll need to investigate that fully.
Thanks -
Chris
Hi @Ruchi Munshi, Could you please provide any further suggestions on this? let me know if I need to run this in some other possible way.
@Chris, Thanks for updating case, I hope we will get some clue soon!
Regards!
I just want to report another instance of failure: workflow triggered around 14 batch jobs and status changed to "Running" as well but then all of the sudden Cromwell server exited.
When I checked on the batch dashboard, all these 14 jobs were successful and taken around 2hours per job but there was not Cromwell server to capture the response.
Please suggest if its lack of resources or something else?
Hi Team,
Any updates, please!
Regards
Satwinder Singh
Hi @ssb_cromwell, can you give us few more weeks to figure out the issue? The person who is looking into it is on vacation this week.
Thanks,
Saloni
@ssb_cromwell This server shutdown is really difficult to debug without a concrete error message, so I'm mainly going off experiences I've seen elsewhere until an actual error message can be located.
> getting stopped abruptly without any exception
> we are running Cromwell on Ec2(t2.medium)
Googling the t2.medium says that it has 4GB of memory. Additionally, do you have the command line you're using to start cromwell in server mode? I'm especially curious what Cromwell `-Xmx` setting is being used along with the t2.medium. I'm curious if either a larger instance memory and/or different `-Xmx` settings would help.
Because of the "clean shutdown" messages from Cromwell I'm assuming something external is sending Cromwell a SIGINT or more likely a SIGTERM. The SIGTERMs are sent from things like the Linux OOM Killer, that should be visible/logged somewhere.
Otherwise, if Cromwell is shutting down due to an internally generated signal, without more logs it's tough to understand why. If we're truly stuck we may need to update https://broadworkbench.atlassian.net/browse/BA-4960 to add logging where the shutdown signals are originating, internally or externally.
Lastly re: the shutdowns, as the AWS backend is in active development, I would make sure you're trying the latest Cromwell 42 to see if that helps too.
Separately, regarding:
> Error while using AWS specific docker image rather than GCP
> USAGE: PicardCommandLine <program name> [-h]
This is a specific error message where the built docker image doesn't match what the task expects to run via its command block. If you haven't already, please separate this into a new issue so that debugging your WDL task+docker doesn't get lost in the Cromwell server shutdown discussion.
Thanks for your patience with all of this.
-k
Please sign in to leave a comment.