inconsistent importing of ints from Terra data table

Post author
Mitch Cunningham

Hi I'm trying to run the following workflow..

 

``` 

 

version 1.0

workflow SingleM_SRA {
input {
String SRA_accession_num
Int metagenome_size_in_bp
String Download_Method_Order
File? GCloud_User_Key_File
Boolean GCloud_Paid
String? AWS_User_Key_Id
String? AWS_User_Key
}
call download_and_extract_ncbi {
input:
SRA_accession_num = SRA_accession_num,
metagenome_size_in_bp = metagenome_size_in_bp,
GCloud_User_Key_File = GCloud_User_Key_File,
GCloud_Paid = GCloud_Paid,
AWS_User_Key_Id = AWS_User_Key_Id,
AWS_User_Key = AWS_User_Key,
Download_Method_Order = Download_Method_Order
}
call singlem {
input:
collections_of_sequences = download_and_extract_ncbi.extracted_reads,
srr_accession = SRA_accession_num
}
output {
File SingleM_tables = singlem.singlem_otu_table_gz
}
}

task download_and_extract_ncbi {
input {
String SRA_accession_num
Int metagenome_size_in_bp
String Download_Method_Order
File? GCloud_User_Key_File
Boolean GCloud_Paid
String? AWS_User_Key_Id
String? AWS_User_Key
String dockerImage = "gcr.io/maximal-dynamo-308105/download_and_extract_ncbi:dev9.11e56131"
}

command {
python /ena-fast-download/bin/kingfisher \
-r ~{SRA_accession_num} \
--gcp-user-key-file ~{if defined(GCloud_User_Key_File) then (GCloud_User_Key_File) else "undefined"} \
~{if (GCloud_Paid) then "--allow-paid-from-gcp" else ""} \
--output-format-possibilities fastq \
-m ~{Download_Method_Order}
}
runtime {
docker: dockerImage
disks: "local-disk 50 SSD"
}
output {
Array[File] extracted_reads = glob("*.fastq")
}
}

task singlem {
input {
Array[File] collections_of_sequences
String srr_accession
String memory = "3.5 GiB"
String disks = "local-disk 50 SSD"
String dockerImage = "gcr.io/maximal-dynamo-308105/singlem:0.13.2-dev10.a6cc1b4"
}
command {
export INPUT=`/singlem/extras/sra_input_generator.py --fastq-dump-outputs ~{sep=' ' collections_of_sequences} --min-orf-length 72`
if [ ! -z "$INPUT" ]
then
/opt/conda/envs/env/bin/time /singlem/bin/singlem pipe \
$INPUT \
--archive_otu_table ~{srr_accession}.singlem.json --threads 2 \
--assignment-method diamond \
--diamond-prefilter \
--diamond-prefilter-performance-parameters '--block-size 0.5 --target-indexed -c1' \
--diamond-prefilter-db /pkgs/53_db2.0-attempt4.0.60.faa.dmnd \
--min_orf_length 72 \
--singlem-packages `ls -d /pkgs/*spkg` \
--diamond-taxonomy-assignment-performance-parameters '--target-indexed -c1' \
--working-directory-tmpdir && gzip ~{srr_accession}.singlem.json
fi
}
runtime {
docker: dockerImage
memory: memory
disks: disks
cpu: 2
}
output {
File singlem_otu_table_gz = "~{srr_accession}.singlem.json.gz"
}
}

 

```

on the following data set imported to Terra via tsv as a data table..

 

```

 

entity:sample_id>-------sra_accession>--metagenome_size_in_bp>--singlem_table$


ERR2560573>-----ERR2560573>-----2130993842>-----undefined$


ERR2709812>-----ERR2709812>-----2228931958>-----undefined$

 

```

 

I'm mapping this.metagenome_size_in_bp in the data table to variable metagenome_size_in_bp in the workflow.

I'm mapping this.sra_accession in the data table to variable SRA_accession_num in the workflow.

When I run this the first sample (ERR2560573) works fine but the second sample gives the following error..

Workflow input processing failed (Caused by [reason 1 of 1]: Failed to evaluate input 'metagenome_size_in_bp' (reason 1 of 1): For input string: "2228931958")

I've tried running the same analysis on 50 samples and get about 25 failures (with same error).

Can somebody please suggest what I can do to get this working?

Many thanks

Mitch

Comments

6 comments

  • Comment author
    Ben Woodcroft

    Our current theory is that the input string 'metagenome_size_in_bp' sometimes exceeds 2^31, which might be out of range for an Int?

    0
  • Comment author
    Samantha (she/her)

    Hi Mitch Cunningham,

     

    Thanks for writing in. Can you share the workspace where you are seeing this issue with GROUP_FireCloud-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.

    1. Add GROUP_FireCloud-Support@firecloud.org to the User email field and press enter on your keyboard.
    2. Click Save.

    Let us know the workspace name, as well as the relevant submission and workflow IDs. We’ll be happy to take a closer look as soon as we can.

     

    Best,

    Samantha

    0
  • Comment author
    Mitch Cunningham

    Hi Samantha,
    I've added GROUP_FireCloud-Support@firecloud.org to our workspace.
    Here's a link to a run which works ok with input for metagenome_size_in_bp variable = 2147483647 ie (2^31)-1
    https://app.terra.bio/#workspaces/firstterrabillingaccount/Terra-Workflows-Quickstart%20copy%20copy/job_history/b94873df-0840-459b-a153-59a19a5b7ec4
    Here's a link to the exact same workflow but run with the values 2147483648 and 2147483649 for the same variable.
    https://app.terra.bio/#workspaces/firstterrabillingaccount/Terra-Workflows-Quickstart%20copy%20copy/job_history/64b5823b-a422-4e89-a4fb-473505ce6b20
    Both of these are failing with the message:
    Workflow input processing failed (Caused by [reason 1 of 1]: Failed to evaluate input 'metagenome_size_in_bp' (reason 1 of 1): For input string: "2147483648")
    Our theory re this is that there is limit on the size for ints at c. 2^31-1 and any int larger than this are being rejected.
    We should clarify that we've figured out a workaround for this for our current project on Terra so the outcome won't break the analysis but we would appreciate if you can advise if there is such a limit and update the docs accordingly - ie so we can better understand the requirements should we wish to run other projects on Terra in future.

    0
  • Comment author
    Mitch Cunningham

    Also we are shortly looking to run around 7000 instances of this workflow which might necessitate up to 14000 wdl tasks in total. Can you please advise if there are any hidden limits eg max concurrent containers that we need to increase to get this to work? If this is successful, we would like to increase this further to running 50,000 to 100,000 instances. Again is there any special steps or config we would need to do to run at this scale also? Also, apologies for tagging this question onto an existing forum post. I'm more than happy to post as a separate forum post or support request as appropriate. Many thanks Mitch

    0
  • Comment author
    Samantha (she/her)

    Hi Mitch Cunningham,

     

    I brought this to our engineers and they confirmed that you are hitting an integer size limit. Currently, Cromwell only supports WDL 1.0 - integers are 32-bit and floats are 64-bit. The size limit will be raised once Cromwell adopts WDL 1.1. However, there is no fixed timeline for that implementation.

    As for your other question, I would suggest reading through this support doc for some helpful information on scaling your workflows.

    Please let me know if you have any other questions.

     

    Best,

    Samantha

    0
  • Comment author
    Ben Woodcroft

    Thanks Samantha,

    It was easy enough in our current case to fix - we just provided the input as Gbp rather than bp and adjusted the resource calculations, as we didn't need to be too exact.

    Re scaling, we have run into some quota issues (particularly around external IP counts) but Jason is being quite helpful on that front.

    ben

    0

Please sign in to leave a comment.