Glossary of cloud-based genomics terms

This document is meant as a helpful glossary for scientists who already have some familiarity with high performance computing (HPC) - such as those who are currently users of local HPC clusters in university departments or research institutes - and have recently begun transitioning to cloud-based HPC solutions.

You can also get a great primer by visiting our New to Cloud resource page, or picking up a copy of Geraldine Van der Auwera and Brian O'Connor's comprehensive and accessible book, Genomics in the Cloud.

Glossary

1. API
2. Batch processing
3. Cloud computing
4. Cloud data storage
5. Cloud environment
6. Cluster
7. Containers
8. CPU
9. CWL
10. Data Biosphere
11. dbGaP
12. Docker
- Docker daemon
- Docker image
- Dockerhub
13. eRA
14. FAIR principles
15. Framework services/FaaS
16. GA4GH
17. GATK
18. GCP
- gcloud
- GCP console
- GCP project
- Google buckets
- Google Cloud SDK
- Google cloud shell
- gsutil
19. GPU
20. Hail
21. Interoperability
22. Jupyter Notebooks
23. Node
24. Portability
25. Secure Shell (SSH)
26. Spark
27. TCGA
28. TensorFlow
29. TPU
30. Virtual machine
31. WDL

API

An Application Programming Interface allows you to interact with an application programmatically - by running scripts, for example - rather than click through the user interface. APIs enable you to automate your interactions with certain applications.

Batch processing

Batch processing is a type of computation in which multiple jobs are launched and processed by an automated system in some combination of simultaneous or sequential order. Ideally, batch processing maximizes the efficiency of workflows, freeing you from having to babysit every individual task.

Cloud computing

Cloud computing is running computations on a "cloud infrastructure" - a network of computers, interfaced via the internet in a way that lets users log in and requisition customized computational resources on demand. The idea is to replace the need for expensive supercomputers dedicated to the needs of a small group. Cloud computing lets users request virtual storage and computational resources as needed.

Cloud data storage

Cloud data storage is a service offered by cloud providers for storing unstructured data (digital files). Using cloud data storage allows you to compute on the cloud without moving any data. It also provides expert security protections, secure data and analysis sharing, and reliable emergency backup.

Google Cloud has four different storage "classes" for data, based on different retrieval needs. On one end of the spectrum is "Standard Storage." Standard Storage is used for data that needs to be accessed frequently over a limited period of time, but that does not need to persist in the long term (e.g., intermediate files for batch computation). On the other end of the spectrum is "Archival Storage", used for disaster recovery or long-term storage. In addition to providing peace of mind, Archival Storage may be required for legal or regulatory reasons.

Cloud environment

A cloud environment is a set of hardware and software specifications that defines a type of virtual machine available through Terra for running an interactive analysis. On Terra, you can create a virtual machine (VM) to run interactive analyses and visualizations - a subset of analyses that utilize some popular interactive applications (e.g., Jupyter, RStudio, Galaxy) and code packages (e.g., Bioconductor, Seurat, numpy). The virtual machine includes processing and storage specifications, and also comes preinstalled with an operating system and code packages. The sum of these elements is known as a "cloud environment". Your cloud environment configuration persists as long as you want to keep it.

Cluster

A cluster is a set of computers that have been networked together to function as a single system. A cluster is composed of nodes (see: "node") and functions like a supercomputer on an ad hoc basis.

Containers

If you're unfamiliar with the concept of containerized tools, you may want to read the entries related to Docker first. In the context of cloud computing, "containers" are runnable snapshots of computational environments that contain all necessary dependencies. As a result, the versions of the tools included in a given container always function in a consistent way. "Containerizing" a set of tools is a way to isolate code dependencies and control reproducibility.

CPU

A central processing unit (CPU) - also sometimes just called a "processor" - is the most basic element of a computational machine. Sometimes described as the brain of a computer, the simplest computational machine will have a single CPU. This is where computing processes - like calculation and code compiling - actually take place. As computers have evolved and become more sophisticated, so have CPUs. Even entry level personal computers and laptops available today come with multicore CPUs which effectively function as multiple CPUs working in parallel. Besides CPUs, there are more sophisticated types of processing units, such as GPUs and TPUs.

CWL

Common Workflow Language is a programming language that is designed to describe computational workflows in data-intensive fields. A key feature of CWL is portability, It is designed to be interoperable, so CWL-based workflows should be reproducible on multiple platforms. CWL is supported by Cromwell, although not by Terra. (See also: "WDL")

dbGaP

The database of Genotypes and Phenotypes (dbGaP) is an online database developed by the National Center for Biotechnology Information to archive and distribute data from studies investigating the interaction of genotype and phenotype in humans.

Data Biosphere

The Data Biosphere is a concept proposed in 2017 by the Global Alliance for Genomics and Health that envisions a data ecosystem which contains modular and interoperable components that can be assembled into diverse data environments. You can read about the proposal in the original Medium article here. Terra is a platform built to help researchers and medical practitioners access to as many global bioinformatic resources as possible, in line with the vision for a data ecosystem that is highly accessible.

Docker

Docker is an application that creates virtualized snapshots of operating environments called "containers". This includes:

Docker images - These are the templates that contain instructions for creating a container with a specific set of tools, packages, and preconfigured server environments.
Docker Daemon - Sometimes just called "Dockerd", this is a persistent background process that listens for requests from the Docker API and manages Docker objects.
Dockerhub - The platform that hosts docker images.

eRA

The Electronic Research Administration is a program created by the National Institute of Health. Their goal is to provide electronic systems support to manage the receipt, processing, review, award, and monitoring of grant funds to increase life expectancy and reduce the burdens of illness and disability.

FAIR principles

The FAIR principles are a set of guidelines for creating and curating data in a way that maximizes its machine-actionability with minimal human intervention. The acronym FAIR stands for findability, accessibility, interoperability, and reusability, and FAIR data are data that adhere to these principles.

FaaS

Frameworks as a Service (FaaS) are a type of software product that's somewhere between Software as a Service (SaaS) and Platforms as a Service (PaaS). It provides more structure than PaaS but less structure than SaaS. Whereas SaaS might be a highly specialized set of software solutions and PaaS might be an open-ended platform for hosting data, FaaS provides a platform along with a foundation to rapidly develop specialized applications. Some cloud-based services are often called FaaS solutions because they provide configurable computational services designed with specific types of tasks in mind (like image processing, or genome analysis), in addition to cloud-based storage.

GA4GH

The Global Alliance for Genomics and Health (GA4GH) is an organization that frames policies and sets technical standards to help researchers share genomic data responsibly, within a human rights framework.

GATK

The Broad Institute's Genome Analysis Toolkit (GATK) is an internationally recognized set of bioinformatic tools for human germline variant discovery. The toolkit also supports related topics in human genomics, including somatic variant discovery, copy number variation, and variant annotation/evaluation. It is becoming increasingly popular in research of nonhuman genetics, especially in areas related to disease research.

Google Cloud

Google Cloud is a suite of cloud-computing services that includes a variety of functions for storage and computation. Terra's compute resources primarily run on the Google Cloud infrastructure, although other vendors may also be supported. Some of the features associated with Google Cloud are listed below:

gcloud - The gcloud package is a set of command line tools (also sometimes referred to as a command-line interface aka CLI) used to create and manage Google Cloud resources.
Google Cloud console - The Google Cloud Console is a web-based user interface that allows easy access to all of your Google Cloud projects, news and documentation, Google Cloud APIs, and access to the Google Cloud Shell.
Google project - A Google project is a set of configuration settings that define how an app like Terra interacts with Google services and what resources it uses. A Google Cloud project can give access to multiple users, be linked to billing resources, and to resources on Terra.
Google buckets - A bucket is simply a storage location in Google Cloud. Everything you store in Google cloud storage is contained in a "bucket". This article on key terms in the Google Cloud documentation explains a lot about buckets and how they relate to other concepts in Google Cloud.
Google Cloud SDK - Google Cloud SDK is a set of tools for managing Google Cloud resources such as projects, billing, and storage. It includes packages like gcloud and gsutil, which include command-line tools for doing things like creating Google Cloud projects and shuttling data between locations.
Google cloud shell - The Google Cloud Shell is a lightweight interactive environment hosted on Google Cloud that allows you to quickly and easily try out Google Cloud functionality. It is a free-of-charge virtual machine with 1 CPU core and 5 GB of memory that you can spin up by just clicking the terminal icon while you're in the Google Cloud console.
gsutil - gsutil is a package of command line tools that is part of Cloud SDK and is useful for a variety of functions related to Google Cloud storage. You can use gsutil to do things like create and delete buckets, as well as download, upload, delete, move, and rename objects within those buckets.

GPU

A graphics processing unit is a special class of processing unit (see: "CPU") that was originally designed to quickly render three dimensional graphics. Recently, GPUs have been used increasingly to accelerate bioinformatics algorithms such as sequence alignment and image analysis.

Hail

Hail is an open-source library for scalable data exploration and analysis, with a particular emphasis on genomics. It is essentially a library of tools for analyzing structured tabular and matrix data. Hail contains a collection of primitives for operating on data in parallel, as well as a suite of functions for processing genetic data.

Interoperability

In the context of cloud computing, the principle of interoperability helps ensure that data and tools can be run across different platforms to facilitate collaborations between researchers and practitioners (such as medical professionals). In practice, this involves a lot of synchronization of data and tools with respect to data formats, programming languages, and execution engines.

Jupyter Notebooks

A Jupyter Notebook is a programming environment that provides a convenient visual interface for creating, testing, and using code written in certain programming languages popular in data science, most notable Python and R.

Node

In cloud computing, a node refers to a single computer or device that is part of the cloud's network. A node is also sometimes simply called a "machine". Multiple nodes networked together are called a cluster (see: "cluster").

Portability

In the context of cloud computing, portability is the ability to migrate data and code applications between cloud platforms with minimal compatibility issues. Workflows are considered portable when they can be reliably reproduced on different platforms (see also: "Interoperability").

Secure Shell (SSH)

Secure Shell (aka Secure Socket Shell, aka SSH) is a secure network protocol that helps multiple computers to communicate. You can use SSH to connect to virtual machines such as those accessed via Google Cloud Shell.

Spark

Spark is a data processing framework that allows users to run tasks on very large data sets by parallelizing computing tasks in certain multiprocessor configurations. Terra comes with integrated Spark capabilities - learn more about them here.

TCGA

The Cancer Genome Atlas is an NIH-funded program that aims to strategically coordinate data - such as gene expression, copy number variation and clinical information - in an effort to accelerate our understanding of the molecular basis of cancer. TCGA is a landmark collaboration between the National Cancer Institute and National Human Genome Research Institute that has so far molecularly characterized over 20,000 primary cancer (and matched normal) samples, spanning 33 cancer types. Terra users can access TCGA-based data by requesting access via dbGaP/eRA Commons.

TensorFlow

TensorFlow is an open-source software package developed at Google for training deep neural networks.

TPU

A tensor processing unit (TPU) is a type of application-specific integrated circuit. It's a very sophisticated integrated circuit processing unit that specializes in (you guessed it) processing tensors, a mathematical construct relating large arrays of data. TPUs are designed to handle neural network machine-learning, and were initially developed at Google, particularly to run with TensorFlow.

Virtual machine (VM)

A virtual machine (aka VM) is a virtual construct that is functionally equivalent to a computer - complete with processing power and storage capacity - whose technical specifications are determined by what a user requests, rather than by the hardware where the computation and storage actually take place. This is what makes cloud computing so flexible - when you create a virtual machine, it's just like setting up a new computer, but the power and configuration are determined by whatever you choose when you're creating that machine. You can create, delete, modify, and replace these virtual machines on-demand.

WDL

Workflow Description Language is a community-driven programming language stewarded by the community at openWDLorg. It's designed for describing data-intensive computational workflows, with a focus on accessibility for scientists without deep programming expertise. Similarly to CWL, portability is a key factor in its design. Compared with CWL, WDL is designed to be more human-readable, whereas CWL is primarily optimized for being machine-readable.