Localizing folders with many files

Post author
Marco Baggio

I am running the ensembl variant effect predictor (vep) as part of my workflow. I stored the vep database in a sub-folder of the google bucket associated to the workspace.

The vep tool (run via docker) expects a path to a folder containing said database. Localizing each file individually seems to be challenging as there are more than 15000 of them in the database. I tried passing an array of all the files as input to the workflow, but the front-end complains that the input is too big. I also do not see the option to change what directories are bound by docker (this is the solution that I use running on my local infrastructure).

I now pass a single file, a zip archive containing the database, which needs to be localized and unzipped as the first step of the workflow.

Since this seems rather inelegant, can anyone suggest a better solution? 

Thank you so much!

Marco

Comments

9 comments

  • Comment author
    Marco Baggio

    An alternative solution to localize multiple files is to list them in a text file, one line per file (obtained for example with gsutil ls) and use the read_lines method from the WDL specification.

    In my example above the input would contain

    File vep_list
    Array[File] vep_data = read_lines(vep_list)

    where vep_list lists all the files contained in the VEP database.

    In my testing, this is much slower (by a factor of 2-3) compared to the zip archive method of my original post, so I will keep using that for the time being.

    0
  • Comment author
    Samantha (she/her)

    Hi Marco,

    Are you using a custom docker image that you created? If so, I wonder if another option that would work for you is to update the Dockerfile to copy the folder to the docker container itself so it can be accessed in the your task's VM once it gets created.

    Best,

    Samantha

    0
  • Comment author
    Marco Baggio

    Hi Samantha,

    Thank you, that is indeed a great idea! I am just worried about adding 26GB to the image (it currently sits at 1GB), but I will test it and see how it compares to the current solution (and report back).

    Cheers,

    Marco

    0
  • Comment author
    Samantha (she/her)

    Hi Marco,

    Just a heads up, you may want to also increase your bootDiskSizeGb value for your task given the potential size of your image since the default size is only 10GB. See https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/#bootdisksizegb for more information. 

    Best,

    Samantha

    1
  • Comment author
    Marco Baggio

    Hi Samantha,

    Pulling the bigger Docker image takes about twice as much as the fastest method (localizing the zip archive and unzipping), i.e. ~60 minutes vs ~30 minutes (probably due to limitations of hub.docker.com, another repository may result in better performance).

    I decided to go along with this latter option anyways as it seems more "robust", for example should I upgrade to a different version of the ensembl VEP I can ensure that the database is always the correct one when building the docker image. The difference in cost seems to be small enough to justify it.

    Thank you again for the suggestion!

    Marco

    0
  • Comment author
    alexander solovyov
    • Edited

    Hi Marco,

    I am setting up a workflow with a custom STAR index, which has multiple files and ran into the same problem. I thought (but have not done it yet) about the following solution:

    1. Pack the index directory into an image - it may be an ISO file or just a file (e.g., created with dd) formatted to some filesystem (say xfs) where the index files are copied.

    2. Localize the image and mv it to index_image

    3. Mount it (e.g., https://www.cyberciti.biz/tips/how-to-mount-iso-image-under-linux.html for ISO image). For xfs:

    mkdir star_index
    mount -o loop -t xfs index_image /cromwell_root/star_index

    In theory it can save about 5 minutes needed for extracting 30Gb from the archive.

    As far as I know everything is run as root within a VM so we should have the privileges to mount things as long as the mount command is within our Docker - need to check this as well.
    Again, I have not tested it myself yet.

    Unrelated: you can push your docker image to GCR: https://support.terra.bio/hc/en-us/articles/360035638032-Publish-a-Docker-container-image-to-Google-Container-Registry-GCR-
    To the best of my knowledge, if it is within the same region (e. g., us-central) Google does not charge you for the network traffic when you localize it.

    0
  • Comment author
    Erik Wolfsohn

    Hi Alexander,

    Did your idea to mount the database as an image work, and if so did you see any time/compute cost savings? I am running into the same issue while setting up a workflow for the NCBI PGAP annotation pipeline. The reference directory is over 100 GB and contains ~1300 files. 

    Best,
    Erik

    0
  • Comment author
    alexander solovyov
    • Edited

    Hi Erik,

    We have not done the run yet, will post an update if we manage to make it work.

    Best,

    Alexander.

    0
  • Comment author
    Daniella Matute

    Hello alexander solovyov Erik Wolfsohn,

    Are there any updates from your runs?

    Best,

    Dany 

    0

Please sign in to leave a comment.