Glossary of terms related to cloud-based genomics

Anton Kovalsky
  • Updated

This document is meant as a helpful glossary for scientists who already have some familiarity with high performance computing (HPC) - such as those who are currently users of local HPC clusters in university departments or research institutes - and have recently begun transitioning to cloud-based HPC solutions.

You can also get a great primer by visiting our New to Cloud resource page, or picking up a copy of Geraldine Van der Auwera and Brian O'Connor's riveting book, Genomics in the Cloud.

 

Glossary

1. API

2. Batch processing

3. Cloud computing

4. Cloud data storage

5. Cloud environment

6. Cluster

7. Containers

8. CPU

9. CWL

10. Data Biosphere

11. dbGaP

12. Docker

    - Docker daemon

    - Docker image

    - Dockerhub

13. eRA

14. FAIR principles

15. Framework services/FaaS

16. GA4GH

17. GATK

18. GCP

     - gcloud

     - GCP console

     - GCP project

     - Google buckets

     - Google Cloud SDK

     - Google cloud shell

     - gsutil

19. GPU

20. Hail

21. Interoperability

22. Jupyter Notebooks

23. Node

24. Portability

25. Secure Shell (SSH)

26. Spark

27. TCGA

28. TensorFlow

29. TPU

30. Virtual machine

31. WDL


 

API

An Application Programming Interface is an interface that allows you to interact with an application programmatically, for example by running scripts, rather than clicking around. APIs allow you to automate your interactions with certain applications, rather than click through the user interface.

 

Batch processing

Batch processing is a type of computation in which multiple jobs are launched and processed by an automated system in some combination of simultaneous or sequential order. Ideally, batch processing maximizes the efficiency of workflows without you having to babysit every individual task.

 

Cloud computing

Cloud computing refers to the type of computation one performs on a "cloud infrastructure" - network of computers, interfaced via the internet in such a way that allows users to log in and requisition customized computational resources on demand. The idea is to replace the need for expensive supercomputers dedicated to the needs of a small group by allowing users to request virtual storage and computational resources as needed. 

 

Cloud data storage

Cloud data storage is a type of storage service offered by cloud providers provider that has certain advantages, most notably the ability to compute on cloud without moving any data, but also includes such advantages as expert security protections, ease of secure sharing, and reliable emergency backup. GCP has four different storage "classes" for data based on differing retrieval needs. On one end of the spectrum is "Standard Storage", which is used for data that needs to be accessed frequently over a limited period of time but that does not need to persist in the long term (aka intermediate files needed for batch computation), and on the other end of the spectrum is "Archival Storage", which is used for things like backup for disaster recovery or long-term storage necessary for legal or regulatory reasons.

 

Cloud environment

A cloud environment is a set of hardware and software specifications that defines a type of virtual machine available through Terra. On Terra, you can create a virtual machine for interactive analysis and visualization - a subset of analyses that utilize some popular interactive applications (e.g. Jupyter, RStudio, Galaxy) and code packages (e.g. Bioconductor, Seurat, numpy). This virtual machine will have whatever processing and storage specifications you chose among available options, and the virtual machine will also come preinstalled with an operating system and any number of code packages that are part of the Docker image used to create the virtual machine. The sum of these elements is known as a "Cloud environment". Your cloud environment configuration persists as long as you want to keep it, so it's essentially the specification list of both hardware and software for the virtual machine you've created for interactive analysis.

 

Cluster

A cluster of computers is a set of computers that have been networked together in order to function as a single system. This is effectively a way to assemble a super computer on an ad hoc basis. A cluster is composed of nodes (see: "node").

 

Containers

If you're unfamiliar with the concept of containerized tools, you may want to read the entries related to Docker first. In the context of cloud computing, "containers" are runnable snapshots of computational environments that contain all necessary dependencies so that the versions of the tools included in a given container always function in a consistent way. "Containerizing" a set of tools is a way to isolate code dependencies and control reproducibility. 

 

CPU

A central processing unit (CPU) - also sometimes just called a "processor" - is the most basic element of a computational machine, sometimes described as the brain of a computer. The simplest computational machine will have a single CPU, and this is where computing processes like calculation and code compiling actually take place. As computers have evolved and become more sophisticated, so have CPUs. Even entry level personal computers and laptops available today come with multi-core CPUs which effectively function as multiple CPUs working in parallel. In addition to CPUs, there are also more sophisticated types of processing units, such as GPUs and TPUs.

 

CWL

Common Workflow Language is a programming language that is designed to describe computational workflows in data-intensive fields. A key feature of CWL is portability - it is designed to be interoperable - CWL-based workflows should be reproducible on multiple platforms. CWL is suppoted by Cromwell, although not by Terra. (See also: "WDL")

 

dbGaP

The database of Genotypes and Phenotypes (dbGaP) is an online database developed by the National Center for Biotechnology Information to archive and distribute data from studies investigating the interaction of genotype and phenotype in Humans.

 

Data Biosphere

The Data Biosphere is a concept proposed in 2017 by the Global Alliance for Genomics and Health that envisions a data ecosystem which contains modular and interoperable components that can be assembled into diverse data environments. You can read about the proposal in the original Medium article here. Terra is a platform built to serve as a portal for researchers and medical practitioners to have user-friendly access to as many global bioinformatic resources as possible, in line with the vision for a data ecosystem that is highly accessible.

 

Docker

Docker is an application that creates virtualized snapshots of operating environments called "containers". This includes:

  • Docker images - These are the templates that contain instructions for creating a container with a specific set of tools, packages, and preconfigured server environments.
  • Docker Daemon - Sometimes just called "Dockerd", this is a persistent background process that listens for requests from the Docker API and manages Docker objects.
  • Dockerhub - The platform that hosts docker images.

 

eRA

The electronic research administration is program created by the National Institute of Health with the goal of providing electronic systems support to manage the receipt, processing, review, award, and monitoring of grant funds for the purpose of increasing life expectancy and reducing the burdens of illness and disability by facilitating the funding of medical research.

 

FAIR principles

The FAIR principles are a set of guidelines for creating and curating data in a way that maximizes its machine-actionability with minimal human intervention. The acronym FAIR stands for findability, accessibility, interoperability, and reusability, and FAIR data is data that adheres to these principles.

 

FaaS

Frameworks as a Service (FaaS) are a type of software product that's somewhere between Software as a Service (SaaS) and Platforms as a Service (PaaS) in that it provides more structure than PaaS but less structure than SaaS. Whereas SaaS might be a highly specialized set of software solutions and PaaS might be an open-ended platform for hosting data, FaaS provides a platform along with a foundation to rapidly develop specialized application. Some cloud-based services are often called FaaS solutions because in addition to cloud-based storage, they provide configurable computational services designed with specific types of tasks in mind (like image processing, or genome analysis)

 

GA4GH

The Global Alliance for Genomics and Health (GA4GH) is a policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework.

 

GATK

The Broad Institute's Genome Analysis Toolkit (GATK) is an internationally recognized set of bioinformatic tools for human germline variant discovery and related topics in human genomics, including somatic variant discovery, copy number variation, and variant annotation/evaluation. It is also becoming increasingly popular in research of non-human genetics, especially in areas related to disease research.

 

GCP

The Google Cloud Platform is a suite of cloud computing services that includes a variety of functions for storage and various types of computation. Terra's compute resources primarily run on the Google Cloud infrastructure, although other vendors may also be supported. Some of the features associated with GCP are listed below:

  • gcloud - The gcloud package is a set of command line tools (also sometimes referred to as a command-line interface aka CLI) used to create and manage Google Cloud resources.

  • GCP console - The Google Cloud Console is a web-based user interface that allows easy access to all of your GCP projects, news and documentation, all GCP APIs, and access to the Google Cloud Shell.

  • GCP project - A GCP project is a set of configuration settings that define how an app like Terra interacts with Google services and what resources it uses. A GCP project can give access to multiple users, be linked to billing resources, and be linked to resources on Terra.

  • Google buckets - A bucket is simply a storage location in GCP. Everything you store in Google's cloud storage is contained in a "bucket". This article on key terms in the GCP documentation explains a lot about buckets and how they relate to other concepts in GCP.

  • Google Cloud SDK Google Cloud SDK is a set of tools that allows one to manage GCP resources such as projects, billing, and storage. It includes packages like gcloud and gsutil, which are packages that include command-line tools for doing things like creating GCP projects and shuttling data between locations.

  • Google cloud shell - The Google Cloud Shell is a lightweight interactive environment hosted on GCP that allows you to quickly and easily try out GCP functionality. It is a free-of-charge virtual machine with 1 CPU core and 5 GB of memory that you can spin up by just clicking the terminal icon while you're in the GCP console.

  • gsutilgsutil is a package of command line tools that is part of Cloud SDK and is useful for a variety of functions related to Google Cloud storage. You can use gsutil to do things like create and delete buckets, as well as download, upload, delete, move, and rename objects within those buckets.

 

GPU

A graphic processing unit is a special class of processing unit (see: "CPU") that was originally designed to accelerate rendering of three dimensional graphical output. Recently, GPUs have been used increasingly for accelerating bioinformatics algorithms such as sequence alignment and image analysis.

 

Hail

Hail is an open-source library for scalable data exploration and analysis, with a particular emphasis on genomics. It is essentially a library of tools for analyzing structured tabular and matrix data. Hail contains a collection of primitives for operating on data in parallel, as well as a suite of functionality for processing genetic data.

 

Interoperability

In the context of cloud computing, the principle of interoperability helps ensure that data and tools can be run across different platforms in order to maximize the ability of researchers and practitioners (such as medical professionals) to collaborate. In practice, this involves a lot of synchronization of data and tools with respect to things like data formats, programming languages, and execution engines.

 

Jupyter Notebooks

A Jupyter Notebook is a programming environment that provides a convenient visual interface for creating, testing, and/or using code written in certain programming languages popular in data science, most notable Python and R.

 

Node

In cloud computing, a node refers to a single computer or device that is part of the network. A node is also sometimes simply called a single "machine". Multiple nodes networked together are referred to as a cluster (see: "cluster").

 

Portability

In the context of cloud computing, portability is the ability to migrate data and code applications between cloud platforms with minimal compatibility issues. Workflows are considered portable when they can be reliably reproduced on different platforms (see also: "Interoperability").

 

Secure Shell (SSH)

Secure Shell (aka Secure Socket Shell, aka SSH) is a secure network protocol that allows multiple computers to communicate. You can use SSH to connect to virtual machines such as the those accessed via Google Cloud Shell.

 

Spark

Spark is a data processing framework that allows users to run tasks on very large data sets by enabling parallelization of computing tasks in certain multiprocessor configurations. Terra comes with integrated Spark capabilities that you can learn more about here.

 

TCGA

The Cancer Genome Atlas is an NIH-funded program that aims to strategically coordinate data such as gene expression, copy number variation and clinical information in an effort to accelerate understanding of the molecular bass of cancer. TCGA is a landmark collaborative effort between the National Cancer Institute and National Human Genome Research Institute that has so far molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. Terra users can access TCGA-based data by requesting access via dbGaP/eRA Commons.

 

TensorFlow

TensorFlow is an open-source software package developed at Google for training deep learning neural networks.

 

TPU

A tensor processing unit (TPU) is a type of application-specific integrated circuit, it's simply a very sophisticated integrated circuit processing unit that specializes in (you guesses it) processing tensors, which are a mathematical construct relating large arrays of data. TPUs designed to handle neural network machine-learning, and were initially developed at Google, particularly to run with TensorFlow.

 

Virtual machine (VM)

A virtual machine (aka VM) is a virtual construct that is functionally equivalent to a computer - complete with processing power and storage capacity - whose technical specifications are determined by what a user requests, rather than by the hardware where the computation and storage actually take place. This is actually what makes cloud computing so flexible - when you create a virtual machine it's just like setting up a new computer, but the power and configuration is determined by whatever you choose when you're creating that machine, and you can create, delete, modify, and replace these virtual machines on-demand.

 

WDL

Workflow Description Language is a community-driven programming language stewarded by the community at openWDLorg. It's designed for describing data-intensive computational workflows, and is designed with a focus on accessibility for scientists without deep programming expertise. Similarly to CWL, portability is a key factor in its design, and what differentiates WDL from CWL is that WDL is designed to be more human-readable whereas CWL is primarily optimized for being machine-readable.

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.