This document is meant as a helpful glossary for scientists who already have some familiarity with high performance computing (HPC) - such as those who are currently users of local HPC clusters in university departments or research institutes - and have recently begun transitioning to cloud-based HPC solutions.
You can also get a great primer by visiting our New to Cloud resource page, or picking up a copy of
2. Batch processing
3. Cloud computing
4. Cloud data storage
5. Cloud environment
10. Data Biosphere
- Docker daemon
- Docker image
14. FAIR principles
15. Framework services/FaaS
- GCP console
- GCP project
- Google buckets
- Google Cloud SDK
- Google cloud shell
22. Jupyter Notebooks
25. Secure Shell (SSH)
30. Virtual machine
An Application Programming Interface allows you to interact with an application programmatically - by running scripts, for example - rather than click through the user interface. APIs enable you to automate your interactions with certain applications.
Batch processing is a type of computation in which multiple jobs are launched and processed by an automated system in some combination of simultaneous or sequential order. Ideally, batch processing maximizes the efficiency of workflows, freeing you from having to babysit every individual task.
Cloud computing is computations run on a "cloud infrastructure" - a network of computers, interfaced via the internet in a way that lets users log in and requisition customized computational resources on demand. The idea is to replace the need for expensive supercomputers dedicated to the needs of a small group. Cloud computing lets users request virtual storage and computational resources as needed.
Cloud data storage
A data storage service for storing unstructured data (digital files) offered by cloud providers. Using cloud data storage offers the ability to compute on cloud without moving any data, and also includes advantages such as expert security protections, ease of secure sharing, and reliable emergency backup.
Google Cloud has four different storage "classes" for data based on different retrieval needs. On one end of the spectrum is "Standard Storage", used for data that needs to be accessed frequently over a limited period of time but does not need to persist in the long term (aka, intermediate files for batch computation). On the other end of the spectrum is "Archival Storage", used for backup for disaster recovery or long-term storage (required for legal or regulatory reasons, for example).
A set of hardware and software specifications that defines a type of virtual machine available through Terra for running an interactive analysis (Galaxy, Jupyter notebook, or RStudio). On Terra, you can create a virtual machine (VM) for interactive analysis and visualization - a subset of analyses that utilize some popular interactive applications (e.g., Jupyter, RStudio, Galaxy) and code packages (e.g., Bioconductor, Seurat, numpy). The virtual machine includes processing and storage specifications chosen from available options, and also comes preinstalled with an operating system and code packages that are part of the Docker image used to create the virtual machine. The sum of these elements is known as a "cloud environment". Your cloud environment configuration persists as long as you want to keep it.
A set of computers that have been networked together to function as a single system. A cluster is composed of nodes (see: "node") and functions like a supercomputer on an ad hoc basis.
If you're unfamiliar with the concept of containerized tools, you may want to read the entries related to Docker first. In the context of cloud computing, "containers" are runnable snapshots of computational environments that contain all necessary dependencies so that the versions of the tools included in a given container always function in a consistent way. "Containerizing" a set of tools is a way to isolate code dependencies and control reproducibility.
A central processing unit (CPU) - also sometimes just called a "processor" - is the most basic element of a computational machine, sometimes described as the brain of a computer. The simplest computational machine will have a single CPU, and this is where computing processes like calculation and code compiling actually take place. As computers have evolved and become more sophisticated, so have CPUs. Even entry level personal computers and laptops available today come with multicore CPUs which effectively function as multiple CPUs working in parallel. Besides CPUs, there are more sophisticated types of processing units, such as GPUs and TPUs.
Common Workflow Language is a programming language that is designed to describe computational workflows in data-intensive fields. A key feature of CWL is portability - it is designed to be interoperable - CWL-based workflows should be reproducible on multiple platforms. CWL is supported by Cromwell, although not by Terra. (See also: "WDL")
The database of Genotypes and Phenotypes (dbGaP) is an online database developed by the National Center for Biotechnology Information to archive and distribute data from studies investigating the interaction of genotype and phenotype in humans.
The Data Biosphere is a concept proposed in 2017 by the Global Alliance for Genomics and Health that envisions a data ecosystem which contains modular and interoperable components that can be assembled into diverse data environments. You can read about the proposal in the original Medium article here. Terra is a platform built to serve as a portal for researchers and medical practitioners to have user-friendly access to as many global bioinformatic resources as possible, in line with the vision for a data ecosystem that is highly accessible.
Docker is an application that creates virtualized snapshots of operating environments called "containers". This includes:
- Docker images - These are the templates that contain instructions for creating a container with a specific set of tools, packages, and preconfigured server environments.
- Docker Daemon - Sometimes just called "Dockerd", this is a persistent background process that listens for requests from the Docker API and manages Docker objects.
- Dockerhub - The platform that hosts docker images.
The Electronic Research Administration is a program created by the National Institute of Health. Their goal is to provide electronic systems support to manage the receipt, processing, review, award, and monitoring of grant funds to increase life expectancy and reduce the burdens of illness and disability by facilitating funding of medical research.
The FAIR principles are a set of guidelines for creating and curating data in a way that maximizes its machine-actionability with minimal human intervention. The acronym FAIR stands for findability, accessibility, interoperability, and reusability, and FAIR data are data that adhere to these principles.
Frameworks as a Service (FaaS) are a type of software product that's somewhere between Software as a Service (SaaS) and Platforms as a Service (PaaS). It provides more structure than PaaS but less structure than SaaS. Whereas SaaS might be a highly specialized set of software solutions and PaaS might be an open-ended platform for hosting data, FaaS provides a platform along with a foundation to rapidly develop specialized application. Some cloud-based services are often called FaaS solutions because in addition to cloud-based storage, they provide configurable computational services designed with specific types of tasks in mind (like image processing, or genome analysis)
The Global Alliance for Genomics and Health (GA4GH) is a policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework.
The Broad Institute's Genome Analysis Toolkit (GATK) is an internationally recognized set of bioinformatic tools for human germline variant discovery and related topics in human genomics, including somatic variant discovery, copy number variation, and variant annotation/evaluation. It is becoming increasingly popular in research of nonhuman genetics, especially in areas related to disease research.
Google Cloud is a suite of cloud-computing services that includes a variety of functions for storage and various types of computation. The compute resources of Terra primarily run on the Google Cloud infrastructure, although other vendors may also be supported. Some of the features associated with Google Cloud are listed below:
gcloud - The gcloud package is a set of command line tools (also sometimes referred to as a command-line interface aka CLI) used to create and manage Google Cloud resources.
>Google Cloud console - The Google Cloud Console is a web-based user interface that allows easy access to all of your Google Cloud projects, news and documentation, all Google Cloud APIs, and access to the Google Cloud Shell.
Google project - A Google project is a set of configuration settings that define how an app like Terra interacts with Google services and what resources it uses. A Google Cloud project can give access to multiple users, be linked to billing resources, and to resources on Terra.
Google buckets - A bucket is simply a storage location in Google Cloud. Everything you store in Google cloud storage is contained in a "bucket". This article on key terms in the Google Cloud documentation explains a lot about buckets and how they relate to other concepts in Google Cloud.
Google Cloud SDK - Google Cloud SDK is a set of tools that enables management of Google Cloud resources such as projects, billing, and storage. It includes packages like gcloud and gsutil, which are packages that include command-line tools for doing things like creating Google Cloud projects and shuttling data between locations.
Google cloud shell - The Google Cloud Shell is a lightweight interactive environment hosted on Google Cloud that allows you to quickly and easily try out Google Cloud functionality. It is a free-of-charge virtual machine with 1 CPU core and 5 GB of memory that you can spin up by just clicking the terminal icon while you're in the Google Cloud console.
gsutil - gsutil is a package of command line tools that is part of Cloud SDK and is useful for a variety of functions related to Google Cloud storage. You can use gsutil to do things like create and delete buckets, as well as download, upload, delete, move, and rename objects within those buckets.
A graphic processing unit is a special class of processing unit (see: "CPU") that was originally designed to accelerate rendering of three dimensional graphical output. Recently, GPUs have been used increasingly for accelerating bioinformatics algorithms such as sequence alignment and image analysis.
Hail is an open-source library for scalable data exploration and analysis, with a particular emphasis on genomics. It is essentially a library of tools for analyzing structured tabular and matrix data. Hail contains a collection of primitives for operating on data in parallel, as well as a suite of functionality for processing genetic data.
In the context of cloud computing, the principle of interoperability helps ensure that data and tools can be run across different platforms to maximize the ability of researchers and practitioners (such as medical professionals) to collaborate. In practice, this involves a lot of synchronization of data and tools with respect to data formats, programming languages, and execution engines.
A Jupyter Notebook is a programming environment that provides a convenient visual interface for creating, testing, and/or using code written in certain programming languages popular in data science, most notable Python and R.
In cloud computing, a node refers to a single computer or device that is part of the network. A node is also sometimes simply called a single "machine". Multiple nodes networked together are referred to as a cluster (see: "cluster").
In the context of cloud computing, portability is the ability to migrate data and code applications between cloud platforms with minimal compatibility issues. Workflows are considered portable when they can be reliably reproduced on different platforms (see also: "Interoperability").
Secure Shell (SSH)
Secure Shell (aka Secure Socket Shell, aka SSH) is a secure network protocol that helps multiple computers to communicate. You can use SSH to connect to virtual machines such as those accessed via Google Cloud Shell.
Spark is a data processing framework that allows users to run tasks on very large data sets by enabling parallelization of computing tasks in certain multiprocessor configurations. Terra comes with integrated Spark capabilities - learn more about them here.
The Cancer Genome Atlas is an NIH-funded program that aims to strategically coordinate data such as gene expression, copy number variation and clinical information in an effort to accelerate understanding of the molecular basis of cancer. TCGA is a landmark collaborative effort between the National Cancer Institute and National Human Genome Research Institute that has so far molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. Terra users can access TCGA-based data by requesting access via dbGaP/eRA Commons.
TensorFlow is an open-source software package developed at Google for training deep learning neural networks.
A tensor processing unit (TPU) is a type of application-specific integrated circuit. It's simply a very sophisticated integrated circuit processing unit that specializes in (you guessed it) processing tensors, a mathematical construct relating large arrays of data. TPUs are designed to handle neural network machine-learning, and were initially developed at Google, particularly to run with TensorFlow.
Virtual machine (VM)
A virtual machine (aka VM) is a virtual construct that is functionally equivalent to a computer - complete with processing power and storage capacity - whose technical specifications are determined by what a user requests, rather than by the hardware where the computation and storage actually take place. This is what makes cloud computing so flexible - when you create a virtual machine, it's just like setting up a new computer, but the power and configuration is determined by whatever you choose when you're creating that machine, and you can create, delete, modify, and replace these virtual machines on-demand.
Workflow Description Language is a community-driven programming language stewarded by the community at openWDLorg. It's designed for describing data-intensive computational workflows, with a focus on accessibility for scientists without deep programming expertise. Similarly to CWL, portability is a key factor in its design, and what differentiates WDL from CWL is that WDL is designed to be more human-readable whereas CWL is primarily optimized for being machine-readable.
Please sign in to leave a comment.