In this guest blog post, Timothy Majarian, a Computational Associate from the Manning lab is giving us a glimpse to the lab's journey with transitioning to cloud computing using Terra from the lab's local high-performance compute cluster.
Back in 2017, our lab had just begun the transition to cloud computing from using our local high-performance compute cluster. We were all new to everything: a platform called FireCloud (soon to become Terra), Workflow Description Language (WDL) and so much more. While the transition certainly took some time -- with many missteps, help requests, and even some frantic forum searches -- we eventually made it to (semi-) pro users. Nowadays, Terra is essential to nearly all of our projects within the Manning lab.
It’s our hope that our story can dispel, or at least quell, some of the initial concerns felt by new-to-the-cloud researchers and help make the journey smoother by sharing what we’ve learned along the way.
As a lab, we’re interested in complex disease genetics, particularly among people of diverse populations. We predominantly study type 2 diabetes in large epidemiological cohorts, made possible by the Trans-Omics for Precision Medicine (TOPMed) program, sponsored by the National Institutes of Health’s National Heart, Lung and Blood Institute. TOPMed is part of a precision medicine initiative that focuses on whole genome sequencing (WGS) of many, many individuals - some 100,000+ study participants from 30+ cohorts. Given the massive scale of these data, the cross-institutional collaborative nature of our project, and the computational demands of our planned genome-wide association analyses with WGS data, we had no choice but to move our analyses and data storage to the cloud.
Before we could even touch the WGS data, we had to develop the tools and workflows needed for our analysis plan. We spent a summer learning WDL and Docker: how to set up the correct compute environment, describe inputs, scatter jobs to run in parallel on multiple virtual machines, localize the right scripts and ensure the outputs were gathered correctly.
Lesson #1 - first, get comfortable with the basics
Lesson #1 came out of these initial efforts (and we cannot stress it enough): take the time to really learn how WDL works and develop and test your workflows with small data sets on a local computing environment. Too often were we stuck with an esoteric error message that could have easily been avoided if we had a greater understanding of the particulars of WDL. Understanding how inputs and outputs are managed will significantly shorten the duration of any headaches.
Lesson #2: Avoid wasting resources (i.e. money and time) by developing and testing locally
Next came development and testing, where concerns over cost usually begin to arise. While it is true that a large-scale analysis in the cloud can quickly become expensive, there are some simple ways to ensure that compute costs don’t run away from you. Again, we found that the initial investment of learning WDL is the best way to prevent unnecessary costs and avoid running a large job only to realize that your results either don’t make sense or are simply missing due to a mistake in your workflow.
Lesson #3: After your workflow succeeds locally, you should always test the workflow in Terra with a small dataset.
Testing locally is a great way to start the development cycle, but it won’t solve all issues involved when moving your analysis to the cloud; mirroring the conditions of cloud-based analysis on your local machine is possible but, depending on your local setup, there may be slight differences. We found, for example, that local file paths don’t always behave the same as Google cloud storage URLs when used as workflow inputs or outputs.
We often use a single chromosome for GWAS testing, which helps us minimize cost when we first move to running our workflows in Terra, while still using a representative dataset. Note that this strategy can also be useful when running a full analysis, since Terra’s caching feature will detect and port over previous results without having to rerun the entire analysis. In the example above, the test chromosome will be skipped and previously generated results linked over.
Having worked our way through these missteps, with the workflow on version I-can’t-even-count-that-high, we created the final version of our GWAS workflow for WGS data (with scripts developed with our TOPMed collaborators) and were able to dive into our genetic association results. If you’re interested, you can find our publicly-available GWAS workflow here.
The payoff - a scalable, shareable and secure analysis
Our association analysis used WGS data from over 50,000 individuals and tested ~44.5 million genetic variants. Terra allowed us to share these results securely with colleagues across the US and the globe, and perform follow-up analyses like LD score regression and GCTA heritability analysis using the interactive Jupyter notebooks built into the Terra platform. All the while, we monitored costs to make sure we remained within our budget.
Resources and workflows that don’t require programming skills
While we went the route of developing our own tools, there are many workflows developed by the broader community that you can use in Terra. Two great resources for workflows are the Dockstore and the Broad Methods Repository . For users new to cloud computing, searching for applicable methods in either repository allows you to learn and understand the specifics of the workflow implementation, adapt the workflows if necessary, and hit the ground running with your own development and analysis life cycle. When using pre-developed workflows, we would stress the importance of understanding the exact statistical methods underlying the workflow. While some tools are ubiquitous, like file conversions, the vast majority should be used with careful consideration of your particular use case.
Final takeaway: initial investments in understanding cloud computing pay off in the end
What I hope you‘ve taken away from our story is this: though you’ll need to make some initial investments to do your research using a cloud-based infrastructure, those investments will pay off many times over - especially so as data grow and science trends towards even more collaboration across institutions. The practices outlined here - learning the details of WDL and Docker, testing locally, testing in the cloud, and, finally, running a full analysis - have helped our lab maximize the benefits of cloud computing. Like any discipline, a solid conceptual understanding of the tools and techniques you'll use will greatly reduce both the time and cost of adopting cloud-based research. This includes workflow development, statistical and computational methods, and particulars of the datasets you’ll be using. The lessons learned in developing our first pipelines, particularly testing locally and in the cloud, have continued to yield positive results as we design and implement new workflows to use in the cloud.
For more on some of the tools that we have developed, and an interactive example of running GWAS in the cloud, head over to our public workspace available through Terra’s showcase & tutorials page. The workspace was developed for the 2019 American Society of Human Genetics annual conference and leads a user through the common steps of a genome-wide association analysis.