parallel processing of jobs for running gwas in hail

February 06, 2020 19:39
30 comments

Hello,

I am planning to run lmm gwas on 1300 phenotypes and i would like to have an insight of how to do it in terra? I already have the pipeline set in python i want to know how to create a workflow to run parallele jobs for multiple phenotypes in terra. Please do let me know thank you.

ps: I am trying dsub as of now but want to try terra too. Thank you

Comments

30 comments

Jason Cerrato
- February 06, 2020 20:02
Hello!

We're happy to get you acquainted with Terra for your analysis. Depending on your level of familiarity with Terra, you may want to start with one of these:

If you're very new: https://support.terra.bio/hc/en-us/sections/360006866192-New-users-overview

If you know your way around: https://support.terra.bio/hc/en-us/articles/360037117492-Getting-Started-with-WDL

Let us know if you have any questions!

Kind regards,

Jason

0
apampana
- February 06, 2020 20:19
I ran some analysis but using other's pipelines. I haven't constructed a pipeline on my own.

0
Jason Cerrato
- February 06, 2020 20:40
Once you familiarize yourself with WDLs, you can look to this example of a WDL that uses a subworkflow which uses a docker image for its runtime in order to use Python code:

Main WDL: https://github.com/HumanCellAtlas/skylab/blob/master/pipelines/optimus/Optimus.wdl

Sub-workflow Attach10xBarcodes.wdl: https://github.com/HumanCellAtlas/skylab/blob/master/library/tasks/Attach10xBarcodes.wdl

If you wanted to use python code, you can considering setting your WDL up in a similar way and setting up a docker image for the runtime. Definitely start with familiarizing yourself with WDLs, and let us know if you have any questions. We're happy to help!

Jason

0
apampana
- February 06, 2020 21:11
Great i will have a look into it and get back to you. I have to do it quick so will check and let you know if any questions.

0
apampana
- February 10, 2020 16:22
anyone know how to deal with this? I used dsub to submit job and got this error https://files.slack.com/files-pri/T0CMFS7GX-FTRNL1QDA/image.png

0
Jason Cerrato
- February 10, 2020 20:51
Hello,

Can you let us know what the result is if you remove the single quotes around the path to the script?

Jason

0
apampana
- Edited February 10, 2020 21:05
(base) akhil@DESKTOP-QV1Q2MS:~$ dsub --image gcr.io/jhs-project-243319/hail_latest:latest --provider google-v2 --project jhs-project-243319 --regions "us-east1" --logging gs://jhs_data_topmed/ --output OUT=gs://jhs_data_topmed/out.txt --input gs://jhs_data_topmed --script gs://jhs_data_topmed/phewas_jhs_lmm.py --disk-size 300 --wait --min-ram 64 --preemptible 2 --retries 2
Job: phewas-jhs--akhil--200210-155723-12
Provider internal-id (operation): projects/jhs-project-243319/operations/12890280361931164766
Launched job-id: phewas-jhs--akhil--200210-155723-12
To check the status, run:
dstat --provider google-v2 --project jhs-project-243319 --jobs 'phewas-jhs--akhil--200210-155723-12' --users 'akhil' --status '*'
To cancel the job, run:
ddel --provider google-v2 --project jhs-project-243319 --jobs 'phewas-jhs--akhil--200210-155723-12' --users 'akhil'
Waiting for job to complete...
Monitoring for failed tasks to retry...
*** This dsub process must continue running to retry failed tasks.
phewas-jhs--akhil--200210-155723-12 (attempt 1) failed. Retrying.
Failure message: CommandException: No URLs matched
2020-02-10 20:58:39 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:58:50 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:58:50 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:59:01 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:59:01 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:59:12 ERROR: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed

Provider internal-id (operation): projects/jhs-project-243319/operations/8806285692181215445
phewas-jhs--akhil--200210-155723-12 (attempt 2) failed. Retrying.
Failure message: CommandException: No URLs matched
2020-02-10 20:58:39 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:58:50 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:58:50 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:59:01 WARNING: Sleeping 10s before the next attempt of failed gsutil command
2020-02-10 20:59:01 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched
2020-02-10 20:59:12 ERROR: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed

Provider internal-id (operation): projects/jhs-project-243319/operations/9577110751754765975
['Error in phewas-jhs--akhil--200210-155723-12 - code 9: Execution failed: while running "localization": unexpected exit status 1 was not ignored']
JobExecutionError: One or more jobs finished with status FAILURE or CANCELED during wait.
phewas-jhs--akhil--200210-155723-12

0
Jason Cerrato
- February 11, 2020 16:23
Hi Akhil,

Can you share the script you are looking to use located at gs://jhs_data_topmed/phewas_jhs_lmm.py for us to take a look? So that I understand your reasoning, where is this path /mnt/data/input/gs/jhs_data_topmed that you are copying to located?

Kind regards,

Jason

0
apampana
- February 11, 2020 16:32
I am using linux subsystem in windows and there is no path like that. When i am submitting job to run counting and all its throwing this error. I will keep the code here. I am using dsub to submit the job. When i use --use-private-address i dont have any problem but job is not running

0
Jason Cerrato
- February 11, 2020 16:45
Hmm based on the error message, it looks like it's trying to copy to that location. Example:

2020-02-10 20:58:39 WARNING: gsutil -mq cp gs://jhs_data_topmed /mnt/data/input/gs/jhs_data_topmed
CommandException: No URLs matched

You may find some helpful information for this issue online by searching the error message. However, we will not be able to continue investigating this specific issue with dsub as it is not a part of the Terra platform. If you run this work in Terra, we are happy to take a look at any issues you run into.

If you have any further questions, please let us know!

Kind regards,

Jason

0
apampana
- February 11, 2020 16:52
I am searching online and havent got any response so asking the places where i may get answer. btw Is there a way to convert python code to wdl ?

0
Jason Cerrato
- February 11, 2020 18:21
Your best bet will be to use an example of a WDL that uses Python code within it as a reference for writing your own. For instance: https://github.com/klarman-cell-observatory/cumulus/blob/master/workflows/drop-seq/dropseq_count.wdl#L190

Using python <<CODE in your WDL, you can start a block for python commands, which ends with a line that says CODE. You will just need to ensure that the docker runtime has Python so that the commands can be run.

You can see another example here: https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#alternative-heredoc-syntax

0
Jason Cerrato
- February 11, 2020 18:24
The previous message is good for if you want to execute Python commands within the WDL. You can also run the python script itself from within your WDL. Simply use a Docker image that has your python script as the runtime, as is done here: https://github.com/openwdl/wdl/blob/master/versions/development/SPEC.md#runtime-section

0
apampana
- February 11, 2020 19:22
I will have a look into the info u sent me. how to run multiple jobs at a time?

0
apampana
- February 11, 2020 19:46
and also is there any examples to run hail scripts on terra? I am searching everywhere but couldnt able to find hail related examples.

0
Jason Cerrato
- February 11, 2020 21:10
You can run multiple jobs at once with a single workflow easily in Terra. For example, you can select to run the same workflow on one to thousands of samples, and Terra will automatically run each workflow on each sample as its own job. It may be worth reading up on this section: https://support.terra.bio/hc/en-us/articles/360036379771-Get-started-running-workflows

Once you are comfortable with that information, you can copy the workspace Terra-Workflows-Quickstart and test it out yourself: https://app.terra.bio/#workspaces/fc-product-demo/Terra-Workflows-Quickstart

We have notebook runtimes with Hail, should you be interested in using Hail in a Jupyter notebook. Would that work for your needs? We don't have a way to start up or connect to a user's spark cluster from Terra Cromwell at the moment, so the interactive notebook would be theh way to go for the time being.

https://support.terra.bio/hc/en-us/articles/360027237871-Terra-s-Jupyter-Notebooks-environment-Part-I-Key-components

0
apampana
- February 11, 2020 21:17
I want to run multiple gwas on different phenotypes at a time. I think an interactive session doesn't match that.

0
apampana
- February 11, 2020 21:18
bascially spining multiple instances to run mutiple gwas at a time in parallel

0
Jason Cerrato
- February 12, 2020 14:01
Just to make sure I am understanding this correctly, you are looking to run multiple gwas at a time in parallel using a Hail script in a WDL—is this correct?

0
apampana
- February 12, 2020 15:56
yes I want to do exactly that. One instance or one task == one gwas

0
Jason Cerrato
- February 18, 2020 16:45
This functionality is on the agenda for being built in the future, but it is not slated for build in the short term. Apologies for any inconvenience this causes.

0
apampana
- February 18, 2020 17:20
No worries i will figure something out. Thank you for helping

0
Jason Cerrato
- February 18, 2020 17:43
If you have any further questions, please let us know!

0
apampana
- February 18, 2020 17:50
I want to know how did u installed hail,hadoop,spark in terra? DO we have any code that i can check with re: installation

0
Jason Cerrato
- February 18, 2020 17:52
Hi Akhil,

Are you looking to know how to install these in a notebook environment or in a docker image for workflows? Or are you looking to find out how we installed any of these somewhere in particular?

Kind regards,

Jason

0
apampana
- February 18, 2020 18:27
I am looking for how did u installed hail, Hadoop and spark into terra like generalized. If I have a VM, and want to install hail ,Hadoop,spark into that VM, how do I do it? I have hail docker but when i am trying to install hadoop into vm its not working fine so docker is also fine i can create a docker and install that to vm to run

0
Jason Cerrato
- February 20, 2020 18:55
I saw that Alex Baumann has provided an example script that submits from WDL to a dataproc cluster. He also mentioned that if you have a docker image with Spark you can run the job locally. Whether it be in a notebook or in a workflow, I would say installing what you need through a Docker image is the best way to go if you want to run something in Terra. Does that help? If anything is unclear, please let me know.

0
apampana
- Edited February 20, 2020 19:07
if i need to install hadoop thats also need to be done using docker? I just dont know where i am standing. very confused with how dataproc and virtual machine works. I want to do it in virtual machine which doesnt have hadoop in it. do terra creates virtual machines with custom image? (dont want to go through dataproc as its more costly than virutal machines right)

0
Jason Cerrato
- February 20, 2020 20:42
Hi Akhil,

I've spoken with some members of the notebooks team to get more clarity on the situation you're facing here. They've confirmed that all Terra clusters are Dataproc clusters, but they expect to have the option of using single (GCE) VMs in a couple months' time. These dataproc clusters have hadoop installed on them by default, so if you use one of the default runtimes (like the Hail runtime) or you use a custom runtime, it will have hadoop installed.

I hope this answers your question. More information on DataProc clusters can be found here: https://cloud.google.com/dataproc/

Jason

0
Jason Cerrato
- February 20, 2020 20:46
You can see what is installed by default on the clusters here: https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4

0

Please sign in to leave a comment.