Localizing folders with many files

July 05, 2023 18:05
9 comments

I am running the ensembl variant effect predictor (vep) as part of my workflow. I stored the vep database in a sub-folder of the google bucket associated to the workspace.

The vep tool (run via docker) expects a path to a folder containing said database. Localizing each file individually seems to be challenging as there are more than 15000 of them in the database. I tried passing an array of all the files as input to the workflow, but the front-end complains that the input is too big. I also do not see the option to change what directories are bound by docker (this is the solution that I use running on my local infrastructure).

I now pass a single file, a zip archive containing the database, which needs to be localized and unzipped as the first step of the workflow.

Since this seems rather inelegant, can anyone suggest a better solution?

Thank you so much!

Marco

Comments

9 comments

Marco Baggio
- July 12, 2023 15:20
An alternative solution to localize multiple files is to list them in a text file, one line per file (obtained for example with gsutil ls) and use the read_lines method from the WDL specification.

In my example above the input would contain

File vep_list
Array[File] vep_data = read_lines(vep_list)

where vep_list lists all the files contained in the VEP database.

In my testing, this is much slower (by a factor of 2-3) compared to the zip archive method of my original post, so I will keep using that for the time being.

0
Samantha (she/her)
- July 12, 2023 17:45
Hi Marco,

Are you using a custom docker image that you created? If so, I wonder if another option that would work for you is to update the Dockerfile to copy the folder to the docker container itself so it can be accessed in the your task's VM once it gets created.

Best,

Samantha

0
Marco Baggio
- July 12, 2023 18:19
Hi Samantha,

Thank you, that is indeed a great idea! I am just worried about adding 26GB to the image (it currently sits at 1GB), but I will test it and see how it compares to the current solution (and report back).

Cheers,

Marco

0
Samantha (she/her)
- July 12, 2023 18:42
Hi Marco,

Just a heads up, you may want to also increase your bootDiskSizeGb value for your task given the potential size of your image since the default size is only 10GB. See https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/#bootdisksizegb for more information.

Best,

Samantha

1
Marco Baggio
- July 18, 2023 13:31
Hi Samantha,

Pulling the bigger Docker image takes about twice as much as the fastest method (localizing the zip archive and unzipping), i.e. ~60 minutes vs ~30 minutes (probably due to limitations of hub.docker.com, another repository may result in better performance).

I decided to go along with this latter option anyways as it seems more "robust", for example should I upgrade to a different version of the ensembl VEP I can ensure that the database is always the correct one when building the docker image. The difference in cost seems to be small enough to justify it.

Thank you again for the suggestion!

Marco

0
alexander solovyov
- Edited November 08, 2023 05:39
Hi Marco,

I am setting up a workflow with a custom STAR index, which has multiple files and ran into the same problem. I thought (but have not done it yet) about the following solution:

1. Pack the index directory into an image - it may be an ISO file or just a file (e.g., created with dd) formatted to some filesystem (say xfs) where the index files are copied.

2. Localize the image and mv it to index_image

3. Mount it (e.g., https://www.cyberciti.biz/tips/how-to-mount-iso-image-under-linux.html for ISO image). For xfs:
```
mkdir star_index
```
```
mount -o loop -t xfs index_image /cromwell_root/star_index
```
In theory it can save about 5 minutes needed for extracting 30Gb from the archive.

As far as I know everything is run as root within a VM so we should have the privileges to mount things as long as the mount command is within our Docker - need to check this as well.
Again, I have not tested it myself yet.

Unrelated: you can push your docker image to GCR: https://support.terra.bio/hc/en-us/articles/360035638032-Publish-a-Docker-container-image-to-Google-Container-Registry-GCR-
To the best of my knowledge, if it is within the same region (e. g., us-central) Google does not charge you for the network traffic when you localize it.
0
Erik Wolfsohn
- February 23, 2024 18:21
Hi Alexander,

Did your idea to mount the database as an image work, and if so did you see any time/compute cost savings? I am running into the same issue while setting up a workflow for the NCBI PGAP annotation pipeline. The reference directory is over 100 GB and contains ~1300 files.

Best,
Erik

0
alexander solovyov
- Edited March 26, 2024 19:12
Hi Erik,

We have not done the run yet, will post an update if we manage to make it work.

Best,

Alexander.

0
Daniella Matute
- August 12, 2024 21:03
Hello alexander solovyov Erik Wolfsohn,

Are there any updates from your runs?

Best,

Dany

0

Please sign in to leave a comment.