Switch Localization/Delocalization From gsutil to gcloud storage
When executing workflows, I believe that localization and delocalization is done with gsutil and not gcloud. I haven't benchmarked, but it does seem that localization can be slow on Terra for large files and this is one area where gcloud alpha storage should help.
Comments
4 comments
Hi Julian,
Thank you for writing in! I've sent this request to our development team for consideration, and I'll be happy to follow up with you if this feature gets built.
Kind regards,
Josh
Chiming in to agree. Right now, localization and delocalization is pretty slow. Even after doing what's suggested here plus switching to SSDs, I'm seeing speeds of about 150 GB/hour. For comparison, real-life transfer speed of files to an external hard drive connected via USB 3.0 seems to be over twice as fast as that. I'll be the first to admit my benchmarking was casual, but this quick back-of-the-napkin calculation indicates file localization on Terra using big SSDs can be significantly slower than consumer-grade file transfer between HDDs.
For tasks that already take a while, this increases the chance that your VM will be taken away from you when everything is effectively done but files are still being delocalized (especially if you use preemptibles). In my current work, I've found tasks that just aren't feasible to run on Terra just because a quick calculation indicates just delocalization has a chance to take up most of the 168 hour time limit allotted to non-preempt VMs.
Hi Ash,
Thank you for voicing your support for this feature request! I will make a note of your concerns which our product team will take into consideration when prioritizing new features.
Kind regards,
Pamela
Hi
Just chiming in again to say how much of a big difference this makes in my tests. There's also a gsutil compatibility layer which could get some of the benefits, but going to full `gcloud storage cp` goes from like 13MB/s -> 1 GB/s. It's hard bc I see the localization script and exactly where it could be changed :-)
But it means that for some fastq files, it goes from an hour to copy, to just a few minutes, which as some demonstrable cost consequences too. So I've started to have to manually do the gcloud storage myself. It's unfortunate because I lose a lot of niceties of terra (especially where I can use `Size` to estimate costs)
Please sign in to leave a comment.