parallel processing of jobs for running gwas in hail
Hello,
I am planning to run lmm gwas on 1300 phenotypes and i would like to have an insight of how to do it in terra? I already have the pipeline set in python i want to know how to create a workflow to run parallele jobs for multiple phenotypes in terra. Please do let me know thank you.
ps: I am trying dsub as of now but want to try terra too. Thank you
Comments
30 comments
Hello!
We're happy to get you acquainted with Terra for your analysis. Depending on your level of familiarity with Terra, you may want to start with one of these:
If you're very new: https://support.terra.bio/hc/en-us/sections/360006866192-New-users-overview
If you know your way around: https://support.terra.bio/hc/en-us/articles/360037117492-Getting-Started-with-WDL
Let us know if you have any questions!
Kind regards,
Jason
I ran some analysis but using other's pipelines. I haven't constructed a pipeline on my own.
Once you familiarize yourself with WDLs, you can look to this example of a WDL that uses a subworkflow which uses a docker image for its runtime in order to use Python code:
Main WDL: https://github.com/HumanCellAtlas/skylab/blob/master/pipelines/optimus/Optimus.wdl
Sub-workflow Attach10xBarcodes.wdl: https://github.com/HumanCellAtlas/skylab/blob/master/library/tasks/Attach10xBarcodes.wdl
If you wanted to use python code, you can considering setting your WDL up in a similar way and setting up a docker image for the runtime. Definitely start with familiarizing yourself with WDLs, and let us know if you have any questions. We're happy to help!
Jason
Great i will have a look into it and get back to you. I have to do it quick so will check and let you know if any questions.
anyone know how to deal with this? I used dsub to submit job and got this error https://files.slack.com/files-pri/T0CMFS7GX-FTRNL1QDA/image.png
Hello,
Can you let us know what the result is if you remove the single quotes around the path to the script?
Jason
(base) akhil@DESKTOP-QV1Q2MS:~$ dsub --image gcr.io/jhs-project-243319/hail_latest:latest --provider google-v2 --project jhs-project-243319 --regions "us-east1" --logging gs://jhs_data_topmed/ --output OUT=gs://jhs_data_topmed/out.txt --input gs://jhs_data_topmed --script gs://jhs_data_topmed/phewas_jhs_lmm.py --disk-size 300 --wait --min-ram 64 --preemptible 2 --retries 2
Job: phewas-jhs--akhil--200210-155723-12
Provider internal-id (operation): projects/jhs-project-243319/operations/12890280361931164766
Launched job-id: phewas-jhs--akhil--200210-155723-12
To check the status, run:
dstat --provider google-v2 --project jhs-project-243319 --jobs 'phewas-jhs--akhil--200210-155723-12' --users 'akhil' --status '*'
To cancel the job, run:
ddel --provider google-v2 --project jhs-project-243319 --jobs 'phewas-jhs--akhil--200210-155723-12' --users 'akhil'
Waiting for job to complete...
Monitoring for failed tasks to retry...
*** This dsub process must continue running to retry failed tasks.
phewas-jhs--akhil--200210-155723-12 (attempt 1) failed. Retrying.
Failure message: CommandException: No URLs matched
2020-02-10 20:58:39 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:58:50 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:58:50 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:59:01 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:59:01 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:59:12 ERROR: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
Provider internal-id (operation): projects/jhs-project-243319/operations/8806285692181215445
phewas-jhs--akhil--200210-155723-12 (attempt 2) failed. Retrying.
Failure message: CommandException: No URLs matched
2020-02-10 20:58:39 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:58:50 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:58:50 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:59:01 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:59:01 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:59:12 ERROR: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
Provider internal-id (operation): projects/jhs-project-243319/operations/9577110751754765975
['Error in phewas-jhs--akhil--200210-155723-12 - code 9: Execution failed: while running "localization": unexpected exit status 1 was not ignored']
JobExecutionError: One or more jobs finished with status FAILURE or CANCELED during wait.
phewas-jhs--akhil--200210-155723-12
Hi Akhil,
Can you share the script you are looking to use located at gs://jhs_data_topmed/phewas_jhs_lmm.py for us to take a look? So that I understand your reasoning, where is this path /mnt/data/input/gs/jhs_data_topmed that you are copying to located?
Kind regards,
Jason
I am using linux subsystem in windows and there is no path like that. When i am submitting job to run counting and all its throwing this error. I will keep the code here. I am using dsub to submit the job. When i use --use-private-address i dont have any problem but job is not running
Hmm based on the error message, it looks like it's trying to copy to that location. Example:
2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
You may find some helpful information for this issue online by searching the error message. However, we will not be able to continue investigating this specific issue with dsub as it is not a part of the Terra platform. If you run this work in Terra, we are happy to take a look at any issues you run into.
If you have any further questions, please let us know!
Kind regards,
Jason
I am searching online and havent got any response so asking the places where i may get answer. btw Is there a way to convert python code to wdl ?
Your best bet will be to use an example of a WDL that uses Python code within it as a reference for writing your own. For instance: https://github.com/klarman-cell-observatory/cumulus/blob/master/workflows/drop-seq/dropseq_count.wdl#L190
Using python <<CODE in your WDL, you can start a block for python commands, which ends with a line that says CODE. You will just need to ensure that the docker runtime has Python so that the commands can be run.
You can see another example here: https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#alternative-heredoc-syntax
The previous message is good for if you want to execute Python commands within the WDL. You can also run the python script itself from within your WDL. Simply use a Docker image that has your python script as the runtime, as is done here: https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#runtime-section
I will have a look into the info u sent me. how to run multiple jobs at a time?
and also is there any examples to run hail scripts on terra? I am searching everywhere but couldnt able to find hail related examples.
You can run multiple jobs at once with a single workflow easily in Terra. For example, you can select to run the same workflow on one to thousands of samples, and Terra will automatically run each workflow on each sample as its own job. It may be worth reading up on this section: https://support.terra.bio/hc/en-us/articles/360036379771-Get-started-running-workflows
Once you are comfortable with that information, you can copy the workspace Terra-Workflows-Quickstart and test it out yourself: https://app.terra.bio/#workspaces/fc-product-demo/Terra-Workflows-Quickstart
We have notebook runtimes with Hail, should you be interested in using Hail in a Jupyter notebook. Would that work for your needs? We don't have a way to start up or connect to a user's spark cluster from Terra Cromwell at the moment, so the interactive notebook would be theh way to go for the time being.
https://support.terra.bio/hc/en-us/articles/360027237871-Terra-s-Jupyter-Notebooks-environment-Part-I-Key-components
I want to run multiple gwas on different phenotypes at a time. I think an interactive session doesn't match that.
bascially spining multiple instances to run mutiple gwas at a time in parallel
Just to make sure I am understanding this correctly, you are looking to run multiple gwas at a time in parallel using a Hail script in a WDL—is this correct?
yes I want to do exactly that. One instance or one task == one gwas
This functionality is on the agenda for being built in the future, but it is not slated for build in the short term. Apologies for any inconvenience this causes.
No worries i will figure something out. Thank you for helping
If you have any further questions, please let us know!
I want to know how did u installed hail,hadoop,spark in terra? DO we have any code that i can check with re: installation
Hi Akhil,
Are you looking to know how to install these in a notebook environment or in a docker image for workflows? Or are you looking to find out how we installed any of these somewhere in particular?
Kind regards,
Jason
I am looking for how did u installed hail, Hadoop and spark into terra like generalized. If I have a VM, and want to install hail ,Hadoop,spark into that VM, how do I do it? I have hail docker but when i am trying to install hadoop into vm its not working fine so docker is also fine i can create a docker and install that to vm to run
I saw that Alex Baumann has provided an example script that submits from WDL to a dataproc cluster. He also mentioned that if you have a docker image with Spark you can run the job locally. Whether it be in a notebook or in a workflow, I would say installing what you need through a Docker image is the best way to go if you want to run something in Terra. Does that help? If anything is unclear, please let me know.
if i need to install hadoop thats also need to be done using docker? I just dont know where i am standing. very confused with how dataproc and virtual machine works. I want to do it in virtual machine which doesnt have hadoop in it. do terra creates virtual machines with custom image? (dont want to go through dataproc as its more costly than virutal machines right)
Hi Akhil,
I've spoken with some members of the notebooks team to get more clarity on the situation you're facing here. They've confirmed that all Terra clusters are Dataproc clusters, but they expect to have the option of using single (GCE) VMs in a couple months' time. These dataproc clusters have hadoop installed on them by default, so if you use one of the default runtimes (like the Hail runtime) or you use a custom runtime, it will have hadoop installed.
I hope this answers your question. More information on DataProc clusters can be found here: https://cloud.google.com/dataproc/
Jason
You can see what is installed by default on the clusters here: https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4
Please sign in to leave a comment.