Need Help?

Search our documentation and community forum

Terra is a cloud-native platform for biomedical researchers to access data, run analysis tools, and collaborate.
Terra powers important scientific projects like FireCloud, AnVIL, and BioData Catalyst. Learn more.

parallel processing of jobs for running gwas in hail

Comments

30 comments

  • Avatar
    Jason Cerrato

    Hello!

    We're happy to get you acquainted with Terra for your analysis. Depending on your level of familiarity with Terra, you may want to start with one of these:

    If you're very new: https://support.terra.bio/hc/en-us/sections/360006866192-New-users-overview

    If you know your way around: https://support.terra.bio/hc/en-us/articles/360037117492-Getting-Started-with-WDL

    Let us know if you have any questions!

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    apampana

    I ran some analysis but using other's pipelines. I haven't constructed a pipeline on my own. 

     

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Once you familiarize yourself with WDLs, you can look to this example of a WDL that uses a subworkflow which uses a docker image for its runtime in order to use Python code:

    Main WDL: https://github.com/HumanCellAtlas/skylab/blob/master/pipelines/optimus/Optimus.wdl

    Sub-workflow Attach10xBarcodes.wdl: https://github.com/HumanCellAtlas/skylab/blob/master/library/tasks/Attach10xBarcodes.wdl

    If you wanted to use python code, you can considering setting your WDL up in a similar way and setting up a docker image for the runtime. Definitely start with familiarizing yourself with WDLs, and let us know if you have any questions. We're happy to help!

    Jason

    0
    Comment actions Permalink
  • Avatar
    apampana

    Great i will have a look into it and get back to you. I have to do it quick so will check and let you know if any questions.

     

    0
    Comment actions Permalink
  • Avatar
    apampana

    anyone know how to deal with this? I used dsub to submit job and got this error https://files.slack.com/files-pri/T0CMFS7GX-FTRNL1QDA/image.png 

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hello,

    Can you let us know what the result is if you remove the single quotes around the path to the script?

    Jason

    0
    Comment actions Permalink
  • Avatar
    apampana

    (base) akhil@DESKTOP-QV1Q2MS:~$ dsub --image gcr.io/jhs-project-243319/hail_latest:latest --provider google-v2 --project jhs-project-243319 --regions "us-east1" --logging gs://jhs_data_topmed/ --output OUT=gs://jhs_data_topmed/out.txt --input gs://jhs_data_topmed --script gs://jhs_data_topmed/phewas_jhs_lmm.py --disk-size 300 --wait --min-ram 64 --preemptible 2 --retries 2
    Job: phewas-jhs--akhil--200210-155723-12
    Provider internal-id (operation): projects/jhs-project-243319/operations/12890280361931164766
    Launched job-id: phewas-jhs--akhil--200210-155723-12
    To check the status, run:
    dstat --provider google-v2 --project jhs-project-243319 --jobs 'phewas-jhs--akhil--200210-155723-12' --users 'akhil' --status '*'
    To cancel the job, run:
    ddel --provider google-v2 --project jhs-project-243319 --jobs 'phewas-jhs--akhil--200210-155723-12' --users 'akhil'
    Waiting for job to complete...
    Monitoring for failed tasks to retry...
    *** This dsub process must continue running to retry failed tasks.
    phewas-jhs--akhil--200210-155723-12 (attempt 1) failed. Retrying.
    Failure message: CommandException: No URLs matched
    2020-02-10 20:58:39 WARNING: Sleeping 10s before the next attempt of failed gsutil command
    2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
    CommandException: No URLs matched
    2020-02-10 20:58:50 WARNING: Sleeping 10s before the next attempt of failed gsutil command
    2020-02-10 20:58:50 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
    CommandException: No URLs matched
    2020-02-10 20:59:01 WARNING: Sleeping 10s before the next attempt of failed gsutil command
    2020-02-10 20:59:01 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
    CommandException: No URLs matched
    2020-02-10 20:59:12 ERROR: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed

    Provider internal-id (operation): projects/jhs-project-243319/operations/8806285692181215445
    phewas-jhs--akhil--200210-155723-12 (attempt 2) failed. Retrying.
    Failure message: CommandException: No URLs matched
    2020-02-10 20:58:39 WARNING: Sleeping 10s before the next attempt of failed gsutil command
    2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
    CommandException: No URLs matched
    2020-02-10 20:58:50 WARNING: Sleeping 10s before the next attempt of failed gsutil command
    2020-02-10 20:58:50 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
    CommandException: No URLs matched
    2020-02-10 20:59:01 WARNING: Sleeping 10s before the next attempt of failed gsutil command
    2020-02-10 20:59:01 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
    CommandException: No URLs matched
    2020-02-10 20:59:12 ERROR: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed

    Provider internal-id (operation): projects/jhs-project-243319/operations/9577110751754765975
    ['Error in phewas-jhs--akhil--200210-155723-12 - code 9: Execution failed: while running "localization": unexpected exit status 1 was not ignored']
    JobExecutionError: One or more jobs finished with status FAILURE or CANCELED during wait.
    phewas-jhs--akhil--200210-155723-12

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Akhil,

    Can you share the script you are looking to use located at gs://jhs_data_topmed/phewas_jhs_lmm.py for us to take a look? So that I understand your reasoning, where is this path /mnt/data/input/gs/jhs_data_topmed that you are copying to located?

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    apampana

    I am using linux subsystem in windows and there is no path like that. When i am submitting job to run counting and all its throwing this error. I will keep the code here. I am using dsub to submit the job. When i use --use-private-address i dont have any problem but job is not running

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hmm based on the error message, it looks like it's trying to copy to that location. Example:

    2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
    CommandException: No URLs matched

    You may find some helpful information for this issue online by searching the error message. However, we will not be able to continue investigating this specific issue with dsub as it is not a part of the Terra platform. If you run this work in Terra, we are happy to take a look at any issues you run into.

    If you have any further questions, please let us know!

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    apampana

    I am searching online and havent got any response so asking the places where i may get answer. btw Is there a way to convert python code to wdl ?

     

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Your best bet will be to use an example of a WDL that uses Python code within it as a reference for writing your own. For instance: https://github.com/klarman-cell-observatory/cumulus/blob/master/workflows/drop-seq/dropseq_count.wdl#L190

    Using python <<CODE in your WDL, you can start a block for python commands, which ends with a line that says CODE. You will just need to ensure that the docker runtime has Python so that the commands can be run.

    You can see another example here: https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#alternative-heredoc-syntax

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    The previous message is good for if you want to execute Python commands within the WDL. You can also run the python script itself from within your WDL. Simply use a Docker image that has your python script as the runtime, as is done here: https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#runtime-section

    0
    Comment actions Permalink
  • Avatar
    apampana

    I will have a look into the info u sent me. how to run multiple jobs at a time? 

     

    0
    Comment actions Permalink
  • Avatar
    apampana

    and also is there any examples to run hail scripts on terra? I am searching everywhere but couldnt able to find hail related examples.

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    You can run multiple jobs at once with a single workflow easily in Terra. For example, you can select to run the same workflow on one to thousands of samples, and Terra will automatically run each workflow on each sample as its own job. It may be worth reading up on this section: https://support.terra.bio/hc/en-us/articles/360036379771-Get-started-running-workflows

    Once you are comfortable with that information, you can copy the workspace Terra-Workflows-Quickstart and test it out yourself: https://app.terra.bio/#workspaces/fc-product-demo/Terra-Workflows-Quickstart

    We have notebook runtimes with Hail, should you be interested in using Hail in a Jupyter notebook. Would that work for your needs? We don't have a way to start up or connect to a user's spark cluster from Terra Cromwell at the moment, so the interactive notebook would be theh way to go for the time being.

    https://support.terra.bio/hc/en-us/articles/360027237871-Terra-s-Jupyter-Notebooks-environment-Part-I-Key-components

    0
    Comment actions Permalink
  • Avatar
    apampana

    I want to run multiple gwas on different phenotypes at a time. I think an interactive session doesn't match that. 

    0
    Comment actions Permalink
  • Avatar
    apampana

    bascially spining multiple instances to run mutiple gwas at a time in parallel

     

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Just to make sure I am understanding this correctly, you are looking to run multiple gwas at a time in parallel using a Hail script in a WDL—is this correct?

    0
    Comment actions Permalink
  • Avatar
    apampana

    yes  I want to do exactly that. One instance or one task == one gwas

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    This functionality is on the agenda for being built in the future, but it is not slated for build in the short term. Apologies for any inconvenience this causes.

    0
    Comment actions Permalink
  • Avatar
    apampana

    No worries i will figure something out. Thank you for helping

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    If you have any further questions, please let us know!

    0
    Comment actions Permalink
  • Avatar
    apampana

    I want to know how did u installed hail,hadoop,spark in terra? DO we have any code that i can check with re: installation

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Akhil,

    Are you looking to know how to install these in a notebook environment or in a docker image for workflows? Or are you looking to find out how we installed any of these somewhere in particular?

    Kind regards,

    Jason

    0
    Comment actions Permalink
  • Avatar
    apampana

    I am looking for how did u installed hail, Hadoop and spark into terra like generalized. If I have a VM, and want to install hail ,Hadoop,spark into that VM, how do I do it? I have hail docker but when i am trying to install hadoop into vm its not working fine so docker is also fine i can create a docker and install that to vm to run

     

     

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    I saw that Alex Baumann has provided an example script that submits from WDL to a dataproc cluster. He also mentioned that if you have a docker image with Spark you can run the job locally. Whether it be in a notebook or in a workflow, I would say installing what you need through a Docker image is the best way to go if you want to run something in Terra. Does that help? If anything is unclear, please let me know.

    0
    Comment actions Permalink
  • Avatar
    apampana

    if i need to install hadoop thats also need to be done  using docker? I just dont know where i am standing. very confused with how dataproc and virtual machine works. I want to do it in virtual machine which doesnt have hadoop in it. do terra creates virtual machines with custom image? (dont want to go through dataproc as its more costly than virutal machines right)

     

     

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    Hi Akhil,

    I've spoken with some members of the notebooks team to get more clarity on the situation you're facing here. They've confirmed that all Terra clusters are Dataproc clusters, but they expect to have the option of using single (GCE) VMs in a couple months' time. These dataproc clusters have hadoop installed on them by default, so if you use one of the default runtimes (like the Hail runtime) or you use a custom runtime, it will have hadoop installed.

    I hope this answers your question. More information on DataProc clusters can be found here: https://cloud.google.com/dataproc/

    Jason

    0
    Comment actions Permalink
  • Avatar
    Jason Cerrato

    You can see what is installed by default on the clusters here: https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk