Cromwell WorkFlow getting aborted intermittently without any exception

May 09, 2019 17:30
15 comments

Hi Team,

We are facing yet another issue with the large file(68GB) processing wherein workflow is getting aborted all of sudden without any exception on AWS enabled Cromwell. This behaviour is not permanent though, sometimes it gets successful but gets fail in another attempt.

We have tried increasing memory which we are assigning to Docker container and Xms values but no luck. Also, tried increasing TIMEOUT setting in Cromwell config.

Most of the time it is happening with SamSplitter(to split large file) and SamToFastqAndBwaMemAndMba tasks. I'm also copying timeout setting which we are maintaining in Cromwell.config.

Please help us to resolve this issue.

Timeout Setting in Cromwell config:
akka {
http {
server {
request-timeout = 1800s
idle-timeout = 2400s
}
client {
request-timeout = 1800s
connecting-timeout = 300s
}
}
}

WorkFlow stopped logger without exception even when a process is still running:--

[2019-05-03 14:34:00,51] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Initializing to Running
[2019-05-03 15:27:11,95] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2m461a6066^[[0mUBTAB.CQYM:0:1]: Status change from Running to Succeeded
[2019-05-03 15:43:25,43] [info] Workflow polling stopped
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] 0 workflows released by cromid-abdb07d
[2019-05-03 15:43:25,46] [info] Aborting all running workflows.
[2019-05-03 15:43:25,46] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-05-03 15:43:25,47] [info] JobExecutionTokenDispenser stopped
[2019-05-03 15:43:25,47] [info] WorkflowStoreActor stopped
[2019-05-03 15:43:25,47] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-05-03 15:43:25,47] [info] WorkflowManagerActor Aborting all workflows
[2019-05-03 15:43:25,47] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] WorkflowLogCopyRouter stopped
[2019-05-03 15:43:25,47] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] c68f0e1b-9998-4cca-9749-40586e3d097f-SubWorkflowActor-SubWorkflow-SplitRG:0:1 [^[[38;5;2mc68f0e1b^[[0m]: Aborting workflow
[2019-05-03 15:43:25,55] [info] Attempted CancelJob operation in AWS Batch for Job ID fd00c634-7a5f-453c-afad-9dbdead31a91. There were no errors during the operation
[2019-05-03 15:43:25,55] [info] We have normality. Anything you still can't cope with is therefore your own problem
[2019-05-03 15:43:25,55] [info] https://www.youtube.com/watch?v=YCRxnjE7JVs
[2019-05-03 15:43:25,55] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0m:SLRG.SamSplitter:NA:1] Aborted StandardAsyncJob(fd00c634-7a5f-453c-afad-9dbdead31a91)
[2019-05-03 15:56:31,10] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Running to Succeeded
[2019-05-03 15:56:31,81] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: WorkflowExecutionActor [^[[38;5;2m461a6066^[[0m] aborted: SubWorkflow-SplitRG:0:1
[2019-05-03 15:56:32,09] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: WorkflowExecutionActor [^[[38;5;2m155c13d0^[[0m] aborted: SubWorkflow-UBTAB:-1:1
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows are aborted
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows finished
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor stopped
[2019-05-03 15:56:33,07] [info] Connection pools shut down
[2019-05-03 15:56:33,07] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] SubWorkflowStoreActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,07] [info] JobStoreActor stopped
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-05-03 15:56:33,08] [info] DockerHashActor stopped
[2019-05-03 15:56:33,08] [info] IoProxy stopped
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] WriteMetadataActor Shutting down: 72 queued messages to process
[2019-05-03 15:56:33,08] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,09] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,09] [info] WriteMetadataActor Shutting down: processing 0 queued messages
[2019-05-03 15:56:33,09] [info] ServiceRegistryActor stopped
[2019-05-03 15:56:33,11] [info] Database closed
[2019-05-03 15:56:33,11] [info] Stream materializer shut down
[2019-05-03 15:56:33,12] [info] WDL HTTP import resolver closed

Comments

15 comments

Tiffany Miller
- May 10, 2019 15:42
Hi ssb_cromwell - I am not familiar with AWS enabled Cromwell. I am asking for help and we will get back to you soon.

0
ssb_cromwell
- May 10, 2019 16:03
Thanks, @TiffanyMiller, we urgently need your help on this blocker and will be waiting for your response.

0
Chris Llanwarne
- Edited May 10, 2019 16:04
Hi ssb_cromwell -

This looks new to us.

We haven’t tested the AWS backend with large files so it’s quite possible that you have uncovered a genuine bug. I’ll create a github issue for you to follow the progress: https://github.com/broadinstitute/cromwell/issues/4960.

Because Cromwell’s AWS backend in still in alpha, I can’t promise any timeline on getting a fix out (but @rmunshi might have more of an idea).

One possible idea did occur to me based on your logs - this appears to be a race condition between the final output copying and Cromwell's run mode shutdown logic (I'm assuming you're running Cromwell like "java -jar cromwell.jar run x.wdl ..."?) If that is the case, you might be able to make progress by using Cromwell in server mode which will hopefully allow you to avoid Cromwell shutting itself down too soon after the workflow completes.

0
Chris Llanwarne
- May 10, 2019 16:05
Here's a link to the Cromwell docs discussing run mode vs server mode: https://cromwell.readthedocs.io/en/develop/Modes/#server

0

ssb_cromwell

Edited May 11, 2019 18:29

Hi @Chris,

Based upon your suggestion, I tried running workflow twice on Server Mode cromwell via Swagger UI.

But again it is getting stopped abruptly without any exception. Please find below logger for your reference and understanding.

Do you think it could be due to lack of memory or computing resources while running for a large file? if so then sometimes it processes SamSplitter successfully and sometimes stops in between.

Moreover, we are not executing anything on the Cromwell server system, we are creating batch jobs which will be having their own compute n memory assignment. so why this issue then? Please suggest!

2019-05-11 08:09:02,050 cromwell-system-akka.dispatchers.backend-dispatcher-132 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b552116e)SLRG.SamSplitter:NA:1]: `set -e
mkdir output_dir
total_reads=$(samtools view -c /cromwell_root/cromwellw-results/samplefiles/NA12878.unmapped.bam)
java -Dsamjdk.compression_level=2 -Xms8000m -jar /usr/gitc/picard.jar SplitSamByNumberOfReads \
  INPUT=/cromwell_root/samplefiles/NA12878.unmapped.bam \
  OUTPUT=output_dir \
  SPLIT_TO_N_READS=48000000 \
  TOTAL_READS_IN_INPUT=$total_reads`

2019-05-11 08:09:02,405  INFO  - Submitting job to AWS Batch
2019-05-11 08:09:02,405  INFO  - dockerImage: us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386
2019-05-11 08:09:02,405  INFO  - jobQueueArn: arn:aws:batch::job-queue/default-a91a-0a605950ff58
2019-05-11 08:09:02,405  INFO  - taskId: SLRG.SamSplitter-None-1
2019-05-11 08:09:02,405  INFO  - hostpath root: SLRG/SamSplitter/b552116e-4a8b-40f6-9291-fbaad6cd71aa/None/1
2019-05-11 08:09:05,729 cromwell-system-akka.dispatchers.backend-dispatcher-138 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b552116e)SLRG.SamSplitter:NA:1]: job id: 9da7c284-facf-4382-891f-234e7062c52e
2019-05-11 08:09:05,778 cromwell-system-akka.dispatchers.backend-dispatcher-137 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b552116e)SLRG.SamSplitter:NA:1]: Status change from - to Initializing
2019-05-11 08:09:09,375 cromwell-system-akka.dispatchers.backend-dispatcher-137 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b552116e)SLRG.SamSplitter:NA:1]: Status change from Initializing to Running
2019-05-11 09:04:43,755 cromwell-system-akka.dispatchers.backend-dispatcher-1395 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(fc65ea33)UBTAB.CQYM:0:1]: Status change from Running to Succeeded
2019-05-11 09:21:10,228  INFO  - Workflow polling stopped
2019-05-11 09:21:10,234  INFO  - Http server unbound.
2019-05-11 09:21:10,240  INFO  - Shutting down WorkflowStoreActor - Timeout = 5 seconds
2019-05-11 09:21:10,248  INFO  - Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
2019-05-11 09:21:10,255  INFO  - Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
2019-05-11 09:21:10,256  INFO  - Shutting down WorkflowManagerActor - Timeout = 5 seconds
2019-05-11 09:21:10,257  INFO  - 1 workflows released by cromid-8d7a2ea
2019-05-11 09:21:10,257  INFO  - WorkflowStoreActor stopped
2019-05-11 09:21:10,257  INFO  - WorkflowLogCopyRouter stopped
2019-05-11 09:21:10,257  INFO  - JobExecutionTokenDispenser stopped
2019-05-11 09:21:10,260  INFO  - Connection pools shut down
2019-05-11 09:21:10,260  INFO  - WorkflowManagerActor stopped
2019-05-11 09:21:10,261  INFO  - Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
2019-05-11 09:21:10,261  INFO  - Shutting down JobStoreActor - Timeout = 1800 seconds
2019-05-11 09:21:10,262  INFO  - Shutting down CallCacheWriteActor - Timeout = 1800 seconds
2019-05-11 09:21:10,262  INFO  - Shutting down ServiceRegistryActor - Timeout = 1800 seconds
2019-05-11 09:21:10,262  INFO  - Shutting down DockerHashActor - Timeout = 1800 seconds
2019-05-11 09:21:10,262  INFO  - JobStoreWriterActor Shutting down: 0 queued messages to process
2019-05-11 09:21:10,263  INFO  - JobStoreActor stopped
2019-05-11 09:21:10,263  INFO  - SubWorkflowStoreActor stopped
2019-05-11 09:21:10,264  INFO  - CallCacheWriteActor Shutting down: 0 queued messages to process
2019-05-11 09:21:10,264  INFO  - CallCacheWriteActor stopped
2019-05-11 09:21:10,266  INFO  - Shutting down IoProxy - Timeout = 1800 seconds
2019-05-11 09:21:10,279  INFO  - Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
2019-05-11 09:21:10,279  INFO  - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
2019-05-11 09:21:10,282  INFO  - IoProxy stopped
2019-05-11 09:21:10,283  INFO  - Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
2019-05-11 09:21:10,286  INFO  - DockerHashActor stopped
2019-05-11 09:21:10,290  INFO  - WriteMetadataActor Shutting down: 0 queued messages to process
2019-05-11 09:21:10,291  INFO  - KvWriteActor Shutting down: 0 queued messages to process
2019-05-11 09:21:10,295  INFO  - ServiceRegistryActor stopped
2019-05-11 09:21:10,311  INFO  - Database closed
2019-05-11 09:21:10,311  INFO  - Stream materializer shut down
2019-05-11 09:21:10,315  INFO  - WDL HTTP import resolver closed
~

Ruchi Munshi
- May 11, 2019 16:50
Hi @ssb_cromwell,

It is strange that your logs say the docker image for this task is in the Google Container Registry. Can you please confirm you're using a docker image from dockerhub or quay? Thanks!
```
2019-05-11 08:09:02,405  INFO  - dockerImage: us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386
```
0
ssb_cromwell
- May 11, 2019 18:43
Hi @Ruchi Munshi,

Thanks for your response!

Yes, we are using below docker image, do you see anything wrong with it? shall I use this some other way? please suggest!

docker: "us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386"

Regards,

0
Ruchi Munshi
- Edited May 11, 2019 18:56
Hey @ssb_cromwell,

I didn't expect the Google docker images to work on the Amazon backend -- it might be best to use a similar docker image hosted in DockerHub:

https://github.com/broadinstitute/cromwell/blob/develop/centaur/src/main/resources/integrationTestCases/germline/single-sample-workflow/processing-for-variant-discovery-gatk4.hg38.wgs.aws.inputs.json#L37

"broadinstitute/genomes-in-the-cloud:2.3.0-1501082129"

This is a general best practice and may be unrelated to your issue.

0
ssb_cromwell
- May 12, 2019 05:31
Hi @Ruchi Munshi, As suggested by you, I tried using AWS specific docker image rather than using from GCP.
But to our surprise this time even it didn't cross the SamSplitter process which used to get succeeded with the GCP docker image. Please find below logs for your reference.

Just to give you further details, we are running Cromwell on Ec2(t2.medium) and utilizing AWS batches to spin up docker images and process file. Currently, we are trying to process WholeGenome file having size of 68GB. Could you please suggest further on this, in case we should trying some other config or params.

Error while using AWS specific docker image rather than GCP:

2019-05-11 19:55:19,509 cromwell-system-akka.dispatchers.backend-dispatcher-120 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from - to Initializing

2019-05-11 19:55:24,762 cromwell-system-akka.dispatchers.backend-dispatcher-120 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from Initializing to Running

2019-05-11 20:40:22,960 cromwell-system-akka.dispatchers.backend-dispatcher-1151 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from Running to Failed

2019-05-11 20:50:32,644 cromwell-system-akka.dispatchers.backend-dispatcher-1151 INFO - AwsBatchAsyncBackendJobExecutionActor [UUID(b1d542d6)UBTAB.CQYM:0:1]: Status change from Running to Succeeded

2019-05-11 20:50:35,164 cromwell-system-akka.dispatchers.engine-dispatcher-24 ERROR - WorkflowManagerActor Workflow cc34fddf-416b-498b-bc53-6349a3498a32 failed (during ExecutingWorkflowState): Job SLRG.SamSplitter:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.

Check the content of stderr for potential additional information: s3://cromw/ce/WGSS/cc34fd498b-bc53-6349a3498a32/call-UBTAB/UBTAB/b1d542d6-5aef-2c8f432abb9/call-SplitRG/shard-0/SLRG/1aec4219-e35b-47b4-50fb5c3039/call-SamSplitter/SamSplitter-stderr.log.

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/cromw/ce/WGSS/cc34fd498b-bc53-6349a3498a32/call-UBTAB/UBTAB/b1d542d6-5aef-2c8f432abb9/call-SplitRG/shard-0/SLRG/1aec4219-e35b-47b4-50fb5c3039/call-SamSplitter/tmp.80418deb

^[[1m^[[31mUSAGE: PicardCommandLine ^[[32m<program name>^[[1m^[[31m [-h]

^[[0m^[[1m^[[31mAvailable Programs:

^[[0m^[[37m--------------------------------------------------------------------------------------

^[[0m^[[31mAlpha Tools: Tools that are currently UNSUPPORTED until further testing and maturation.^[[0m

^[[32m CollectIndependentReplicateMetrics ^[[36mEstimates the rate of independent replication of reads within a bam.^[[0m

^[[32m CollectWgsMetricsWithNonZeroCoverage ^[[36mCollect metrics about coverage and performance of whole genome sequencing (WGS) experiments. ^[[0m

^[[32m UmiAwareMarkDuplicatesWithMateCigar ^[[36mIdentifies duplicate reads using information from read positions and UMIs. ^[[0m

^[[37m--------------------------------------------------------------------------------------

^[[0m^[[31mFasta: Tools for manipulating FASTA, or related data.^[[0m

^[[32m CreateSequenceDictionary ^[[36mCreates a sequence dictionary for a reference sequence. ^[[0m

^[[32m ExtractSequences ^[[36mSubsets intervals from a reference sequence to a new FASTA file.^[[0m

^[[32m NonNFastaSize ^[[36mCounts the number of non-N bases in a fasta file.^[[0m

^[[32m NormalizeFasta ^[[36mNormalizes lines of sequence in a FASTA file to be of the same length.^[[0m

^[[37m--------------------------------------------------------------------------------------

^[[0m^[[31mFingerprinting Tools: Tools for manipulating fingerprints, or related data.^[[0m

^[[32m CheckFingerprint ^[[36mComputes a fingerprint from the supplied input (SAM/BAM or VCF) file and compares it to the provided genotypes^[[0m

^[[32m ClusterCrosscheckMetrics ^[[36mClusters the results of a CrosscheckFingerprints run by LOD score.^[[0m

^[[32m CrosscheckFingerprints ^[[36mChecks if all fingerprints appear to come from the same individual.^[[0m

^[[32m CrosscheckReadGroupFingerprints ^[[36mDEPRECATED: USE CrosscheckFingerprints. Checks if all read groups appear to come from the same individual.^[[0m

^[[37m--------------------------------------------------------------------------------------

^[[0m^[[31mIllumina Tools: Tools for manipulating data specific to Illumina sequencers.^[[0m

^[[32m CheckIlluminaDirectory ^[[36mAsserts the validity for specified Illumina basecalling data. ^[[0m

0
Chris Llanwarne
- May 13, 2019 13:00
Thanks for trying out server mode, sorry to hear that it didn't work out. I've updated https://github.com/broadinstitute/cromwell/issues/4960 with the results. Unfortunately it sounds like we do have a bug in the delocalization process and we'll need to investigate that fully.

Thanks -

Chris

0
ssb_cromwell
- May 15, 2019 20:37
Hi @Ruchi Munshi, Could you please provide any further suggestions on this? let me know if I need to run this in some other possible way.

@Chris, Thanks for updating case, I hope we will get some clue soon!

Regards!

0

ssb_cromwell

Edited May 16, 2019 18:43

I just want to report another instance of failure: workflow triggered around 14 batch jobs and status changed to "Running" as well but then all of the sudden Cromwell server exited.

When I checked on the batch dashboard, all these 14 jobs were successful and taken around 2hours per job but there was not Cromwell server to capture the response.

Please suggest if its lack of resources or something else?

2019-05-15 21:49:26,337 cromwell-system-akka.dispatchers.backend-dispatcher-2436 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:8:1]: Status change from Initializing to Running

2019-05-15 21:49:27,881 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:7:1]: Status change from Initializing to Running

2019-05-15 21:49:28,293 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:1:1]: Status change from Initializing to Running

2019-05-15 21:49:30,063 cromwell-system-akka.dispatchers.backend-dispatcher-2436 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:13:1]: Status change from Initializing to Running

2019-05-15 21:49:31,584 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:2:1]: Status change from Initializing to Running

2019-05-15 21:49:32,808 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:11:1]: Status change from Initializing to Running

2019-05-15 21:49:37,595 cromwell-system-akka.dispatchers.backend-dispatcher-2436 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:3:1]: Status change from Initializing to Running

2019-05-15 21:49:42,213 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:12:1]: Status change from Initializing to Running

2019-05-15 22:01:06,126 cromwell-system-akka.dispatchers.backend-dispatcher-2638 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:0:1]: Status change from Initializing to Running

2019-05-15 22:01:12,106 cromwell-system-akka.dispatchers.backend-dispatcher-2673 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:10:1]: Status change from Initializing to Running

2019-05-15 22:01:44,126 cromwell-system-akka.dispatchers.backend-dispatcher-2673 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:4:1]: Status change from Initializing to Running

2019-05-15 22:01:53,368 cromwell-system-akka.dispatchers.backend-dispatcher-2673 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:9:1]: Status change from Initializing to Running

2019-05-15 22:02:58,998 cromwell-system-akka.dispatchers.backend-dispatcher-2689 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:6:1]: Status change from Initializing to Running

2019-05-15 22:03:15,885 cromwell-system-akka.dispatchers.backend-dispatcher-2689 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:5:1]: Status change from Initializing to Running

2019-05-15 22:43:55,092  INFO  - Workflow polling stopped

2019-05-15 22:43:55,099  INFO  - Http server unbound.

2019-05-15 22:43:55,106  INFO  - Shutting down WorkflowStoreActor - Timeout = 5 seconds

2019-05-15 22:43:55,109  INFO  - Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds

2019-05-15 22:43:55,112  INFO  - WorkflowStoreActor stopped

2019-05-15 22:43:55,115  INFO  - Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds

2019-05-15 22:43:55,116  INFO  - Shutting down WorkflowManagerActor - Timeout = 5 seconds

2019-05-15 22:43:55,116  INFO  - WorkflowLogCopyRouter stopped

2019-05-15 22:43:55,120  INFO  - 1 workflows released by cromid-1acbbc6

2019-05-15 22:43:55,120  INFO  - JobExecutionTokenDispenser stopped

2019-05-15 22:43:55,130  INFO  - Connection pools shut down

2019-05-15 22:43:55,130  INFO  - WorkflowManagerActor stopped

2019-05-15 22:43:55,131  INFO  - Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds

2019-05-15 22:43:55,131  INFO  - Shutting down JobStoreActor - Timeout = 1800 seconds

2019-05-15 22:43:55,131  INFO  - Shutting down CallCacheWriteActor - Timeout = 1800 seconds

2019-05-15 22:43:55,131  INFO  - Shutting down ServiceRegistryActor - Timeout = 1800 seconds

2019-05-15 22:43:55,131  INFO  - Shutting down DockerHashActor - Timeout = 1800 seconds

2019-05-15 22:43:55,131  INFO  - Shutting down IoProxy - Timeout = 1800 seconds

2019-05-15 22:43:55,132  INFO  - CallCacheWriteActor Shutting down: 0 queued messages to process

2019-05-15 22:43:55,132  INFO  - JobStoreWriterActor Shutting down: 0 queued messages to process

2019-05-15 22:43:55,133  INFO  - CallCacheWriteActor stopped

2019-05-15 22:43:55,133  INFO  - JobStoreActor stopped

2019-05-15 22:43:55,138  INFO  - SubWorkflowStoreActor stopped

2019-05-15 22:43:55,142  INFO  - IoProxy stopped

2019-05-15 22:43:55,143  INFO  - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false

2019-05-15 22:43:55,145  INFO  - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false

2019-05-15 22:43:55,146  INFO  - WriteMetadataActor Shutting down: 0 queued messages to process

2019-05-15 22:43:55,147  INFO  - KvWriteActor Shutting down: 0 queued messages to process

2019-05-15 22:43:55,149  INFO  - Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false

2019-05-15 22:43:55,149  INFO  - DockerHashActor stopped

2019-05-15 22:43:55,152  INFO  - ServiceRegistryActor stopped

2019-05-15 22:43:55,177  INFO  - Database closed

2019-05-15 22:43:55,177  INFO  - Stream materializer shut down

2019-05-15 22:43:55,179  INFO  - WDL HTTP import resolver closed

ssb_cromwell
- May 30, 2019 05:42
Hi Team,

Any updates, please!

Regards

Satwinder Singh

0
Saloni Shah
- June 03, 2019 19:54
Hi @ssb_cromwell, can you give us few more weeks to figure out the issue? The person who is looking into it is on vacation this week.

Thanks,

Saloni

0
Khalid Shakir
- June 11, 2019 16:50
@ssb_cromwell This server shutdown is really difficult to debug without a concrete error message, so I'm mainly going off experiences I've seen elsewhere until an actual error message can be located.

> getting stopped abruptly without any exception

> we are running Cromwell on Ec2(t2.medium)

Googling the t2.medium says that it has 4GB of memory. Additionally, do you have the command line you're using to start cromwell in server mode? I'm especially curious what Cromwell `-Xmx` setting is being used along with the t2.medium. I'm curious if either a larger instance memory and/or different `-Xmx` settings would help.

Because of the "clean shutdown" messages from Cromwell I'm assuming something external is sending Cromwell a SIGINT or more likely a SIGTERM. The SIGTERMs are sent from things like the Linux OOM Killer, that should be visible/logged somewhere.

Otherwise, if Cromwell is shutting down due to an internally generated signal, without more logs it's tough to understand why. If we're truly stuck we may need to update https://broadworkbench.atlassian.net/browse/BA-4960 to add logging where the shutdown signals are originating, internally or externally.

Lastly re: the shutdowns, as the AWS backend is in active development, I would make sure you're trying the latest Cromwell 42 to see if that helps too.

Separately, regarding:

> Error while using AWS specific docker image rather than GCP

> USAGE: PicardCommandLine <program name> [-h]

This is a specific error message where the built docker image doesn't match what the task expects to run via its command block. If you haven't already, please separate this into a new issue so that debugging your WDL task+docker doesn't get lost in the Cromwell server shutdown discussion.

Thanks for your patience with all of this.
-k

0

Please sign in to leave a comment.