Cromwell WorkFlow getting aborted intermittently without any exception

Post author
ssb_cromwell

Hi Team,

We are facing yet another issue with the large file(68GB) processing wherein workflow is getting aborted all of sudden without any exception on AWS enabled Cromwell. This behaviour is not permanent though, sometimes it gets successful but gets fail in another attempt.

We have tried increasing memory which we are assigning to Docker container and Xms values but no luck. Also, tried increasing TIMEOUT setting in Cromwell config.

Most of the time it is happening with SamSplitter(to split large file) and SamToFastqAndBwaMemAndMba tasks. I'm also copying timeout setting which we are maintaining in Cromwell.config.

Please help us to resolve this issue.

Timeout Setting in Cromwell config:
akka {
http {
server {
request-timeout = 1800s
idle-timeout = 2400s
}
client {
request-timeout = 1800s
connecting-timeout = 300s
}
}
}

WorkFlow stopped logger without exception even when a process is still running:--

[2019-05-03 14:34:00,51] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Initializing to Running
[2019-05-03 15:27:11,95] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2m461a6066^[[0mUBTAB.CQYM:0:1]: Status change from Running to Succeeded
[2019-05-03 15:43:25,43] [info] Workflow polling stopped
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-05-03 15:43:25,45] [info] 0 workflows released by cromid-abdb07d
[2019-05-03 15:43:25,46] [info] Aborting all running workflows.
[2019-05-03 15:43:25,46] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-05-03 15:43:25,47] [info] JobExecutionTokenDispenser stopped
[2019-05-03 15:43:25,47] [info] WorkflowStoreActor stopped
[2019-05-03 15:43:25,47] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-05-03 15:43:25,47] [info] WorkflowManagerActor Aborting all workflows
[2019-05-03 15:43:25,47] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] WorkflowLogCopyRouter stopped
[2019-05-03 15:43:25,47] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: Aborting workflow
[2019-05-03 15:43:25,47] [info] c68f0e1b-9998-4cca-9749-40586e3d097f-SubWorkflowActor-SubWorkflow-SplitRG:0:1 [^[[38;5;2mc68f0e1b^[[0m]: Aborting workflow
[2019-05-03 15:43:25,55] [info] Attempted CancelJob operation in AWS Batch for Job ID fd00c634-7a5f-453c-afad-9dbdead31a91. There were no errors during the operation
[2019-05-03 15:43:25,55] [info] We have normality. Anything you still can't cope with is therefore your own problem
[2019-05-03 15:43:25,55] [info] https://www.youtube.com/watch?v=YCRxnjE7JVs
[2019-05-03 15:43:25,55] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0m:SLRG.SamSplitter:NA:1] Aborted StandardAsyncJob(fd00c634-7a5f-453c-afad-9dbdead31a91)
[2019-05-03 15:56:31,10] [info] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2mc68f0e1b^[[0mSLRG.SamSplitter:NA:1]: Status change from Running to Succeeded
[2019-05-03 15:56:31,81] [info] 461a6066-0463-485c-9de3-763d0658f236-SubWorkflowActor-SubWorkflow-UBTAB:-1:1 [^[[38;5;2m461a6066^[[0m]: WorkflowExecutionActor [^[[38;5;2m461a6066^[[0m] aborted: SubWorkflow-SplitRG:0:1
[2019-05-03 15:56:32,09] [info] WorkflowExecutionActor-155c13d0-09e9-4ad7-b4c4-9cd2b1099c14 [^[[38;5;2m155c13d0^[[0m]: WorkflowExecutionActor [^[[38;5;2m155c13d0^[[0m] aborted: SubWorkflow-UBTAB:-1:1
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows are aborted
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor All workflows finished
[2019-05-03 15:56:32,84] [info] WorkflowManagerActor stopped
[2019-05-03 15:56:33,07] [info] Connection pools shut down
[2019-05-03 15:56:33,07] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] SubWorkflowStoreActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,07] [info] JobStoreActor stopped
[2019-05-03 15:56:33,07] [info] CallCacheWriteActor stopped
[2019-05-03 15:56:33,07] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-05-03 15:56:33,08] [info] DockerHashActor stopped
[2019-05-03 15:56:33,08] [info] IoProxy stopped
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,08] [info] WriteMetadataActor Shutting down: 72 queued messages to process
[2019-05-03 15:56:33,08] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-05-03 15:56:33,09] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2019-05-03 15:56:33,09] [info] WriteMetadataActor Shutting down: processing 0 queued messages
[2019-05-03 15:56:33,09] [info] ServiceRegistryActor stopped
[2019-05-03 15:56:33,11] [info] Database closed
[2019-05-03 15:56:33,11] [info] Stream materializer shut down
[2019-05-03 15:56:33,12] [info] WDL HTTP import resolver closed

Comments

15 comments

  • Comment author
    Tiffany Miller

    Hi ssb_cromwell - I am not familiar with AWS enabled Cromwell. I am asking for help and we will get back to you soon. 

    0
  • Comment author
    ssb_cromwell

    Thanks, @TiffanyMiller, we urgently need your help on this blocker and will be waiting for your response.

     

    0
  • Comment author
    Chris Llanwarne
    • Edited

    Hi ssb_cromwell -

     

    This looks new to us.

    We haven’t tested the AWS backend with large files so it’s quite possible that you have uncovered a genuine bug. I’ll create a github issue for you to follow the progress: https://github.com/broadinstitute/cromwell/issues/4960.

    Because Cromwell’s AWS backend in still in alpha, I can’t promise any timeline on getting a fix out (but @rmunshi might have more of an idea).

     

    One possible idea did occur to me based on your logs - this appears to be a race condition between the final output copying and Cromwell's run mode shutdown logic (I'm assuming you're running Cromwell like "java -jar cromwell.jar run x.wdl ..."?) If that is the case, you might be able to make progress by using Cromwell in server mode which will hopefully allow you to avoid Cromwell shutting itself down too soon after the workflow completes.

     

     

    0
  • Comment author
    Chris Llanwarne

    Here's a link to the Cromwell docs discussing run mode vs server mode: https://cromwell.readthedocs.io/en/develop/Modes/#server

    0
  • Comment author
    ssb_cromwell
    • Edited

    Hi @Chris, 

    Based upon your suggestion, I tried running workflow twice on Server Mode cromwell via Swagger UI. 

    But again it is getting stopped abruptly without any exception. Please find below logger for your reference and understanding.


    Do you think it could be due to lack of memory or computing resources while running for a large file? if so then sometimes it processes SamSplitter successfully and sometimes stops in between. 

    Moreover, we are not executing anything on the Cromwell server system, we are creating batch jobs which will be having their own compute n memory assignment. so why this issue then? Please suggest!

    2019-05-11 08:09:02,050 cromwell-system-akka.dispatchers.backend-dispatcher-132 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b552116e)SLRG.SamSplitter:NA:1]: `set -e
    mkdir output_dir
    total_reads=$(samtools view -c /cromwell_root/cromwellw-results/samplefiles/NA12878.unmapped.bam)
    java -Dsamjdk.compression_level=2 -Xms8000m -jar /usr/gitc/picard.jar SplitSamByNumberOfReads \
      INPUT=/cromwell_root/samplefiles/NA12878.unmapped.bam \
      OUTPUT=output_dir \
      SPLIT_TO_N_READS=48000000 \
      TOTAL_READS_IN_INPUT=$total_reads`

    2019-05-11 08:09:02,405  INFO  - Submitting job to AWS Batch
    2019-05-11 08:09:02,405  INFO  - dockerImage: us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386
    2019-05-11 08:09:02,405  INFO  - jobQueueArn: arn:aws:batch::job-queue/default-a91a-0a605950ff58
    2019-05-11 08:09:02,405  INFO  - taskId: SLRG.SamSplitter-None-1
    2019-05-11 08:09:02,405  INFO  - hostpath root: SLRG/SamSplitter/b552116e-4a8b-40f6-9291-fbaad6cd71aa/None/1
    2019-05-11 08:09:05,729 cromwell-system-akka.dispatchers.backend-dispatcher-138 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b552116e)SLRG.SamSplitter:NA:1]: job id: 9da7c284-facf-4382-891f-234e7062c52e
    2019-05-11 08:09:05,778 cromwell-system-akka.dispatchers.backend-dispatcher-137 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b552116e)SLRG.SamSplitter:NA:1]: Status change from - to Initializing
    2019-05-11 08:09:09,375 cromwell-system-akka.dispatchers.backend-dispatcher-137 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b552116e)SLRG.SamSplitter:NA:1]: Status change from Initializing to Running
    2019-05-11 09:04:43,755 cromwell-system-akka.dispatchers.backend-dispatcher-1395 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(fc65ea33)UBTAB.CQYM:0:1]: Status change from Running to Succeeded
    2019-05-11 09:21:10,228  INFO  - Workflow polling stopped
    2019-05-11 09:21:10,234  INFO  - Http server unbound.
    2019-05-11 09:21:10,240  INFO  - Shutting down WorkflowStoreActor - Timeout = 5 seconds
    2019-05-11 09:21:10,248  INFO  - Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
    2019-05-11 09:21:10,255  INFO  - Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
    2019-05-11 09:21:10,256  INFO  - Shutting down WorkflowManagerActor - Timeout = 5 seconds
    2019-05-11 09:21:10,257  INFO  - 1 workflows released by cromid-8d7a2ea
    2019-05-11 09:21:10,257  INFO  - WorkflowStoreActor stopped
    2019-05-11 09:21:10,257  INFO  - WorkflowLogCopyRouter stopped
    2019-05-11 09:21:10,257  INFO  - JobExecutionTokenDispenser stopped
    2019-05-11 09:21:10,260  INFO  - Connection pools shut down
    2019-05-11 09:21:10,260  INFO  - WorkflowManagerActor stopped
    2019-05-11 09:21:10,261  INFO  - Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
    2019-05-11 09:21:10,261  INFO  - Shutting down JobStoreActor - Timeout = 1800 seconds
    2019-05-11 09:21:10,262  INFO  - Shutting down CallCacheWriteActor - Timeout = 1800 seconds
    2019-05-11 09:21:10,262  INFO  - Shutting down ServiceRegistryActor - Timeout = 1800 seconds
    2019-05-11 09:21:10,262  INFO  - Shutting down DockerHashActor - Timeout = 1800 seconds
    2019-05-11 09:21:10,262  INFO  - JobStoreWriterActor Shutting down: 0 queued messages to process
    2019-05-11 09:21:10,263  INFO  - JobStoreActor stopped
    2019-05-11 09:21:10,263  INFO  - SubWorkflowStoreActor stopped
    2019-05-11 09:21:10,264  INFO  - CallCacheWriteActor Shutting down: 0 queued messages to process
    2019-05-11 09:21:10,264  INFO  - CallCacheWriteActor stopped
    2019-05-11 09:21:10,266  INFO  - Shutting down IoProxy - Timeout = 1800 seconds
    2019-05-11 09:21:10,279  INFO  - Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
    2019-05-11 09:21:10,279  INFO  - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
    2019-05-11 09:21:10,282  INFO  - IoProxy stopped
    2019-05-11 09:21:10,283  INFO  - Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
    2019-05-11 09:21:10,286  INFO  - DockerHashActor stopped
    2019-05-11 09:21:10,290  INFO  - WriteMetadataActor Shutting down: 0 queued messages to process
    2019-05-11 09:21:10,291  INFO  - KvWriteActor Shutting down: 0 queued messages to process
    2019-05-11 09:21:10,295  INFO  - ServiceRegistryActor stopped
    2019-05-11 09:21:10,311  INFO  - Database closed
    2019-05-11 09:21:10,311  INFO  - Stream materializer shut down
    2019-05-11 09:21:10,315  INFO  - WDL HTTP import resolver closed
    ~                                                               
    0
  • Comment author
    Ruchi Munshi

    Hi @ssb_cromwell,


    It is strange that your logs say the docker image for this task is in the Google Container Registry. Can you please confirm you're using a docker image from dockerhub or quay? Thanks!

    2019-05-11 08:09:02,405  INFO  - dockerImage: us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386
    0
  • Comment author
    ssb_cromwell

    Hi @Ruchi Munshi,

    Thanks for your response!

    Yes, we are using below docker image, do you see anything wrong with it? shall I use this some other way? please suggest!

    docker: "us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386"

    Regards,

    0
  • Comment author
    Ruchi Munshi
    • Edited

    Hey @ssb_cromwell,

    I didn't expect the Google docker images to work on the Amazon backend -- it might be best to use a similar docker image hosted in DockerHub:

    https://github.com/broadinstitute/cromwell/blob/develop/centaur/src/main/resources/integrationTestCases/germline/single-sample-workflow/processing-for-variant-discovery-gatk4.hg38.wgs.aws.inputs.json#L37

    "broadinstitute/genomes-in-the-cloud:2.3.0-1501082129"

    This is a general best practice and may be unrelated to your issue.

    0
  • Comment author
    ssb_cromwell

    Hi @Ruchi Munshi, As suggested by you, I tried using AWS specific docker image rather than using from GCP.
    But to our surprise this time even it didn't cross the SamSplitter process which used to get succeeded with the GCP docker image. Please find below logs for your reference.

    Just to give you further details, we are running Cromwell on Ec2(t2.medium) and utilizing AWS batches to spin up docker images and process file. Currently, we are trying to process WholeGenome file having size of 68GB. Could you please suggest further on this, in case we should trying some other config or params.

     

    Error while using AWS specific docker image rather than GCP:

    2019-05-11 19:55:19,509 cromwell-system-akka.dispatchers.backend-dispatcher-120 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from - to Initializing

    2019-05-11 19:55:24,762 cromwell-system-akka.dispatchers.backend-dispatcher-120 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from Initializing to Running

    2019-05-11 20:40:22,960 cromwell-system-akka.dispatchers.backend-dispatcher-1151 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(1aec4219)SLRG.SamSplitter:NA:1]: Status change from Running to Failed

    2019-05-11 20:50:32,644 cromwell-system-akka.dispatchers.backend-dispatcher-1151 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(b1d542d6)UBTAB.CQYM:0:1]: Status change from Running to Succeeded

    2019-05-11 20:50:35,164 cromwell-system-akka.dispatchers.engine-dispatcher-24 ERROR - WorkflowManagerActor Workflow cc34fddf-416b-498b-bc53-6349a3498a32 failed (during ExecutingWorkflowState): Job SLRG.SamSplitter:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.

    Check the content of stderr for potential additional information: s3://cromw/ce/WGSS/cc34fd498b-bc53-6349a3498a32/call-UBTAB/UBTAB/b1d542d6-5aef-2c8f432abb9/call-SplitRG/shard-0/SLRG/1aec4219-e35b-47b4-50fb5c3039/call-SamSplitter/SamSplitter-stderr.log.

    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/cromw/ce/WGSS/cc34fd498b-bc53-6349a3498a32/call-UBTAB/UBTAB/b1d542d6-5aef-2c8f432abb9/call-SplitRG/shard-0/SLRG/1aec4219-e35b-47b4-50fb5c3039/call-SamSplitter/tmp.80418deb

    ^[[1m^[[31mUSAGE: PicardCommandLine ^[[32m<program name>^[[1m^[[31m [-h]

    ^[[0m^[[1m^[[31mAvailable Programs:

    ^[[0m^[[37m--------------------------------------------------------------------------------------

    ^[[0m^[[31mAlpha Tools:                                     Tools that are currently UNSUPPORTED until further testing and maturation.^[[0m

    ^[[32m    CollectIndependentReplicateMetrics           ^[[36mEstimates the rate of independent replication of reads within a bam.^[[0m

    ^[[32m    CollectWgsMetricsWithNonZeroCoverage         ^[[36mCollect metrics about coverage and performance of whole genome sequencing (WGS) experiments.  ^[[0m

    ^[[32m    UmiAwareMarkDuplicatesWithMateCigar          ^[[36mIdentifies duplicate reads using information from read positions and UMIs. ^[[0m

     

    ^[[37m--------------------------------------------------------------------------------------

    ^[[0m^[[31mFasta:                                           Tools for manipulating FASTA, or related data.^[[0m

    ^[[32m    CreateSequenceDictionary                     ^[[36mCreates a sequence dictionary for a reference sequence.  ^[[0m

    ^[[32m    ExtractSequences                             ^[[36mSubsets intervals from a reference sequence to a new FASTA file.^[[0m

    ^[[32m    NonNFastaSize                                ^[[36mCounts the number of non-N bases in a fasta file.^[[0m

    ^[[32m    NormalizeFasta                               ^[[36mNormalizes lines of sequence in a FASTA file to be of the same length.^[[0m

     

    ^[[37m--------------------------------------------------------------------------------------

    ^[[0m^[[31mFingerprinting Tools:                            Tools for manipulating fingerprints, or related data.^[[0m

    ^[[32m    CheckFingerprint                             ^[[36mComputes a fingerprint from the supplied input (SAM/BAM or VCF) file and compares it to the provided genotypes^[[0m

    ^[[32m    ClusterCrosscheckMetrics                     ^[[36mClusters the results of a CrosscheckFingerprints run by LOD score.^[[0m

    ^[[32m    CrosscheckFingerprints                       ^[[36mChecks if all fingerprints appear to come from the same individual.^[[0m

    ^[[32m    CrosscheckReadGroupFingerprints              ^[[36mDEPRECATED: USE CrosscheckFingerprints. Checks if all read groups appear to come from the same individual.^[[0m

     

    ^[[37m--------------------------------------------------------------------------------------

    ^[[0m^[[31mIllumina Tools:                                  Tools for manipulating data specific to Illumina sequencers.^[[0m

    ^[[32m    CheckIlluminaDirectory                       ^[[36mAsserts the validity for specified Illumina basecalling data.  ^[[0m



    0
  • Comment author
    Chris Llanwarne

    Thanks for trying out server mode, sorry to hear that it didn't work out. I've updated https://github.com/broadinstitute/cromwell/issues/4960 with the results. Unfortunately it sounds like we do have a bug in the delocalization process and we'll need to investigate that fully.

     

    Thanks - 

     

    Chris

    0
  • Comment author
    ssb_cromwell

    Hi @Ruchi Munshi, Could you please provide any further suggestions on this? let me know if I need to run this in some other possible way.

    @Chris, Thanks for updating case, I hope we will get some clue soon!

     

    Regards!

    0
  • Comment author
    ssb_cromwell
    • Edited

    I just want to report another instance of failure: workflow triggered around 14 batch jobs and status changed to "Running" as well but then all of the sudden Cromwell server exited.

    When I checked on the batch dashboard, all these 14 jobs were successful and taken around 2hours per job but there was not Cromwell server to capture the response.

    Please suggest if its lack of resources or something else?

    2019-05-15 21:49:26,337 cromwell-system-akka.dispatchers.backend-dispatcher-2436 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:8:1]: Status change from Initializing to Running

    2019-05-15 21:49:27,881 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:7:1]: Status change from Initializing to Running

    2019-05-15 21:49:28,293 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:1:1]: Status change from Initializing to Running

    2019-05-15 21:49:30,063 cromwell-system-akka.dispatchers.backend-dispatcher-2436 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:13:1]: Status change from Initializing to Running

    2019-05-15 21:49:31,584 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:2:1]: Status change from Initializing to Running

    2019-05-15 21:49:32,808 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:11:1]: Status change from Initializing to Running

    2019-05-15 21:49:37,595 cromwell-system-akka.dispatchers.backend-dispatcher-2436 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:3:1]: Status change from Initializing to Running

    2019-05-15 21:49:42,213 cromwell-system-akka.dispatchers.backend-dispatcher-2438 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:12:1]: Status change from Initializing to Running

    2019-05-15 22:01:06,126 cromwell-system-akka.dispatchers.backend-dispatcher-2638 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:0:1]: Status change from Initializing to Running

    2019-05-15 22:01:12,106 cromwell-system-akka.dispatchers.backend-dispatcher-2673 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:10:1]: Status change from Initializing to Running

    2019-05-15 22:01:44,126 cromwell-system-akka.dispatchers.backend-dispatcher-2673 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:4:1]: Status change from Initializing to Running

    2019-05-15 22:01:53,368 cromwell-system-akka.dispatchers.backend-dispatcher-2673 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:9:1]: Status change from Initializing to Running

    2019-05-15 22:02:58,998 cromwell-system-akka.dispatchers.backend-dispatcher-2689 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:6:1]: Status change from Initializing to Running

    2019-05-15 22:03:15,885 cromwell-system-akka.dispatchers.backend-dispatcher-2689 INFO  - AwsBatchAsyncBackendJobExecutionActor [UUID(80ea6e9e)SLRG.SFBMM:5:1]: Status change from Initializing to Running

    2019-05-15 22:43:55,092  INFO  - Workflow polling stopped

    2019-05-15 22:43:55,099  INFO  - Http server unbound.

    2019-05-15 22:43:55,106  INFO  - Shutting down WorkflowStoreActor - Timeout = 5 seconds

    2019-05-15 22:43:55,109  INFO  - Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds

    2019-05-15 22:43:55,112  INFO  - WorkflowStoreActor stopped

    2019-05-15 22:43:55,115  INFO  - Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds

    2019-05-15 22:43:55,116  INFO  - Shutting down WorkflowManagerActor - Timeout = 5 seconds

    2019-05-15 22:43:55,116  INFO  - WorkflowLogCopyRouter stopped

    2019-05-15 22:43:55,120  INFO  - 1 workflows released by cromid-1acbbc6

    2019-05-15 22:43:55,120  INFO  - JobExecutionTokenDispenser stopped

    2019-05-15 22:43:55,130  INFO  - Connection pools shut down

    2019-05-15 22:43:55,130  INFO  - WorkflowManagerActor stopped

    2019-05-15 22:43:55,131  INFO  - Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds

    2019-05-15 22:43:55,131  INFO  - Shutting down JobStoreActor - Timeout = 1800 seconds

    2019-05-15 22:43:55,131  INFO  - Shutting down CallCacheWriteActor - Timeout = 1800 seconds

    2019-05-15 22:43:55,131  INFO  - Shutting down ServiceRegistryActor - Timeout = 1800 seconds

    2019-05-15 22:43:55,131  INFO  - Shutting down DockerHashActor - Timeout = 1800 seconds

    2019-05-15 22:43:55,131  INFO  - Shutting down IoProxy - Timeout = 1800 seconds

    2019-05-15 22:43:55,132  INFO  - CallCacheWriteActor Shutting down: 0 queued messages to process

    2019-05-15 22:43:55,132  INFO  - JobStoreWriterActor Shutting down: 0 queued messages to process

    2019-05-15 22:43:55,133  INFO  - CallCacheWriteActor stopped

    2019-05-15 22:43:55,133  INFO  - JobStoreActor stopped

    2019-05-15 22:43:55,138  INFO  - SubWorkflowStoreActor stopped

    2019-05-15 22:43:55,142  INFO  - IoProxy stopped

    2019-05-15 22:43:55,143  INFO  - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false

    2019-05-15 22:43:55,145  INFO  - Shutting down connection pool: curAllocated=1 idleQueues.size=1 waitQueue.size=0 maxWaitQueueLimit=256 closed=false

    2019-05-15 22:43:55,146  INFO  - WriteMetadataActor Shutting down: 0 queued messages to process

    2019-05-15 22:43:55,147  INFO  - KvWriteActor Shutting down: 0 queued messages to process

    2019-05-15 22:43:55,149  INFO  - Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false

    2019-05-15 22:43:55,149  INFO  - DockerHashActor stopped

    2019-05-15 22:43:55,152  INFO  - ServiceRegistryActor stopped

    2019-05-15 22:43:55,177  INFO  - Database closed

    2019-05-15 22:43:55,177  INFO  - Stream materializer shut down

    2019-05-15 22:43:55,179  INFO  - WDL HTTP import resolver closed
    0
  • Comment author
    ssb_cromwell

    Hi Team, 

    Any updates, please!

     

    Regards

    Satwinder Singh

    0
  • Comment author
    Saloni Shah

    Hi @ssb_cromwell, can you give us few more weeks to figure out the issue? The person who is looking into it is on vacation this week.

     

    Thanks,

    Saloni

    0
  • Comment author
    Khalid Shakir

    @ssb_cromwell This server shutdown is really difficult to debug without a concrete error message, so I'm mainly going off experiences I've seen elsewhere until an actual error message can be located.

    > getting stopped abruptly without any exception

    > we are running Cromwell on Ec2(t2.medium)

    Googling the t2.medium says that it has 4GB of memory. Additionally, do you have the command line you're using to start cromwell in server mode? I'm especially curious what Cromwell `-Xmx` setting is being used along with the t2.medium. I'm curious if either a larger instance memory and/or different `-Xmx` settings would help.

    Because of the "clean shutdown" messages from Cromwell I'm assuming something external is sending Cromwell a SIGINT or more likely a SIGTERM. The SIGTERMs are sent from things like the Linux OOM Killer, that should be visible/logged somewhere.

    Otherwise, if Cromwell is shutting down due to an internally generated signal, without more logs it's tough to understand why. If we're truly stuck we may need to update https://broadworkbench.atlassian.net/browse/BA-4960 to add logging where the shutdown signals are originating, internally or externally.

    Lastly re: the shutdowns, as the AWS backend is in active development, I would make sure you're trying the latest Cromwell 42 to see if that helps too.

     

    Separately, regarding:

    > Error while using AWS specific docker image rather than GCP

    > USAGE: PicardCommandLine <program name> [-h]

    This is a specific error message where the built docker image doesn't match what the task expects to run via its command block. If you haven't already, please separate this into a new issue so that debugging your WDL task+docker doesn't get lost in the Cromwell server shutdown discussion.

     

    Thanks for your patience with all of this.
    -k

    0

Please sign in to leave a comment.