best practices for running large data sets

April 08, 2021 23:26
3 comments

I am running a prototype to see if how my lab can use terra.

I ran a small test and found that it took about 6 hrs to run our pipeline on 10 samples. We use preemptible containers. Overall the cost was really low

For the next part of my test, I plan to process about 400 samples. Here are some question about how best to go about this

1) my workflow does not use scatter or need scatter. My 10 sample test caused 10 containers to spin up at the same time. If I selected all 400 samples at once would this impact the availability of terra to other users? What about if I want to run on all the 17,000 gtex samples?

2) given we are using preemptible containers I assume some will fail. Is there any easy way to select and re-run them?

3) is there any easy way to know when everything is done? I do not want to have to babysit my batch jobs

4) in an ideal world the total wall clock time for running all samples would be about the same for running 10 and the overall cost would be linear. What should I expect?

Kind regards

Andy

Comments

3 comments

Jason Cerrato
- April 12, 2021 18:26
Hi Andrew,

Thank you for writing in. We'll take a look at your questions and get back to you as soon as we can!

Kind regards,

Jason

0
Jason Cerrato
- April 13, 2021 15:23
Hi Andy,

Here are some answers for your questions:
1. Running large submissions would not have any impact for other Terra users when it comes to Google Cloud resources, as those are defined by your billing project. Large submissions might have a minor impact for job queueing time, but our system is designed to handle submissions so that they don't hold up other submissions for long, if at all.
2. So long as you haven't changed your method configuration for your workflow, you will see a "Relaunch Failures" button on your job history page which will allow you to automatically kick off a resubmission of failed jobs!
3. Our development team is aware of the need for a notification system for workflow submissions. Unfortunately, we do not yet have one in place. I'm happy to follow up on this thread if I hear that it's been built!
4. Yes that's the ideal scenario! I would say you can expect close to linear assuming nothing goes wrong: no preemptible failures, no data access issues, no resource configuration problems. Of course, the real world is hardly so smooth—you'll likely see some variation based on how your machine utilizes its memory, how often it fails due to preemption, etc. The big thing may be to ensure that your workflow doesn't have any spots where you can get caught in an infinite loop under certain conditions, as this could result in a huge bill. If the workflow is set to fail gracefully at the appropriate times, all data access is set up in advance, the data is similar in size and form across your workflows, and you've done your tests to get a good sense of what to expect, you should be in good form.
Here is some general guidance about scaling you may want to read through before launching your bigger submissions: https://support.terra.bio/hc/en-us/articles/360059028911-Scaling-your-workflow-submissions

If you have any questions, please let us know!

Kind regards,

Jason
0
Andrew Davidson
- April 13, 2021 15:49
Thanks Jason

0

Please sign in to leave a comment.