Accessing workspace data tables using AnVIL's R package

Anton Kovalsky

Learn to work with workspace tables and data resources directly from a Jupyter notebook or RStudio analysis using the The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) R package.

What the AnVIL R package does

The package includes R-based utilities for working with data in tables and other data resources, as well as convenient functions for moving files to and from Google cloud storage. After installing the R package when creating the analysis app,  you can use commands in a notebook or RStudio analysis for tasks such as bringing data from a table into the interactive analysis or manipulating tables from within a notebook or RStudio. 

To learn more about the R package code, see AnVIL's R documentation page.

How to use AnVIL's R package

In R Studio

All RStudio Cloud Environments use the R/Bioconductor base image by default. This means you can use the AnVIL R package commands out of the box when running RStudio.

See How to launch and customize your RStudio app

In a Jupyter notebook

To use the AnVIL R package in a Jupyter Notebook on Terra, you'll need to create a Jupyter Cloud Environment using the R/Bioconductor base image. If you don't use this image, you'll need to figure out what dependencies are missing from the environment you're using and install them yourself (but why bother!). See step-by-step instructions below.

Using the R/Bioconductor base image in a notebook (step-by-step)

  • 1. To see your Jupyter Cloud Environment configuration form, with the current values of your cloud environment, click the gear icon at the top right of any workspace to reveal your Cloud Environment pane.
    Start-Jupyter-environment-by-clicking-gear-icon_Screen_shot.png

    2. If you haven't yet created or customized a cloud environment, you will see the the defaults in the Cloud Environment pane. Select the Customize button at the bottom right.
    Cloud-Environment_Default_Screen_shot.png

    3. When you select Customize, or if Jupyter is running already, you'll see the Cloud Environment configuration page (screenshot below). Note that it has fewer options and is much simpler to adjust in Terra than the equivalent Google Cloud Platform interface!
    Cloud-Environment-Setup_Screen_shot.png

    4. Select the R/Bioconductor option under the Application configuration menu within the Jupyter Cloud Environment pane:
    Screen_Shot_2021-07-18_at_11.22.01_PM.png

    5. Once you are done any additional customizations, click the blue Create button to start or recreate your Jupyter Cloud Environment with the R/Bioconcuctor image.

    This environment will come preinstalled with all of the dependencies you'll need to load and use the AnVIL package:

    BiocManager::install("AnVIL")
    library("AnVIL")
  • 1. To see your Jupyter Cloud Environment configuration form, with the current values of your cloud environment, click the cloud icon in the right sidebar

    2. Click the Jupyter gear icon (Environment settings) in the Cloud Environment Details pane.  

    This will surface the Jupyter Cloud Environment pane.

    2. If you haven't yet created or customized a cloud environment, you will see the the defaults in the Cloud Environment pane. Select the Customize button at the bottom right.
    Jupyter-Cloud-Environment-defaults_Screen_shot.png

    3. When you select Customize, or if Jupyter is running already, you'll see the Jupyter Cloud Environment configuration page (screenshot below). Note that it has fewer options and is much simpler to adjust in Terra than the equivalent Google Cloud Platform interface!
    Jupyter-Cloud-Environment-pane_Screen_shot.png

    4. Select the R/Bioconductor option under the Application configuration menu within the Jupyter Cloud Environment pane:
    Screen_Shot_2021-07-18_at_11.22.01_PM.png

    5. Once you are done any additional customizations, click the blue Create button to start or recreate your Jupyter Cloud Environment with the R/Bioconcuctor image.

    This environment will come preinstalled with all of the dependencies you'll need to load and use the AnVIL package:

    BiocManager::install("AnVIL")
    library("AnVIL")

Useful R commands

Once you've loaded the AnVIL code library, you'll have access to a host of functions that you can use to interact with data referenced in tables in your workspace Data tab. See a curated list of operations and their commands below.

List all tables in the data tab of your workspace 

Use the function avtables()
Screen_Shot_2021-07-18_at_10.16.52_PM.png

View the contents of a specific table in your workspace

Use the function avtable(), with the name of the table as it appears in the data tab as the function's argument ().

avtables() versus avtable()Note the difference in the plural ("avtables") for the function listing all the tables and singular ("avtable") for the function showing the contents of the tables.

Store table in a variable

You can even store the table in a variable in typical R fashion:
Screen_Shot_2021-07-18_at_10.26.55_PM.png

The avtable function returns a "tibble" (learn more about tibbles and how to use them here), which is a slightly revamped type of R data frame designed to facilitate cleaner code.

Extract columns from a table

To extract data in a column, use the $ or [[ <string>]]. See example below.
Screen_Shot_2021-07-18_at_11.54.23_PM.png

x$<column name>
The $ symbol extracts a column based on the name

x[[ <column number> ]] or x[[ <column name>]]
Square brackets can extract the column based either on the column number or on the column name (expressed as a string).

Access data tables from other workspaces

You can do this using the avworkspace() function, assuming you have the proper permissions.

Empty argument
When you first load the AnVIL library, if you run the avworkspace() function with an empty argument, it will print out the Billing Project and workspace name for the workspace where the notebook is running.

Argument includes Billing Project/workspace
However, if you run this function with an argument that includes a valid Billing Project/workspace to which you have access - for instance a featured workspace (or any public workspace) - you'll be able to run all of the functions described above on the tables contained in the Data tab of that workspace!

How do you know what Billing Project a public workspace belongs to?The Billing Project is shown at the top of the screen when you are viewing the workspace, to the left of the workspace name:
2021-07-19_08-35-24.png

Using R package in a notebook example

Below is an example of these commands from the HCA_Optimus_Pipeline public workspace.
Screen_Shot_2021-07-19_at_12.02.34_AM.png

 

Check out this video for a preview of using AnVIL in RStudio for more detail

 

 

 

 

 

 

 

Was this article helpful?

0 out of 0 found this helpful

Comments

1 comment

  • Comment author
    Tiffany Miller
    • Edited

    In the linked video, the following code is used:

    #Install packages
    BiocManager::install(c("AnVIL", "SingleCellExperiment", "LoomExperiment", "singleCellTK"))
    library(reticulate)
    library(AnVIL)
    library(SingleCellExperiment)
    library(LoomExperiment)
    library(singleCellTK)

    #Access workspace data tables
    avworkspace("featured-workspaces-hca/HCA_Optimus_Pipeline")
    avtables()
    sample_set <- avtable("sample_set")
    sample_data <- sample_set %>% dplyr::slice(3)
    View(sample_data)

    #Create a directory for data to be placed and copy it to the cloud environment
    sample_name <- sample_data %>% pull(name)
    sample_dir <- paste("data", sample_name, sep="/")
    dir.create(sample_dir)
    file_cols <- c("cell_metrics", "cell_calls", "gene_metrics", "matrix", "matrix_row_index", "matrix_col_index", "loom_output_file")
    for (col in file_cols) { gsutil_cp(sample_data %>% pull(col), sample_dir) }

    #Working with the loom file
    sce_loom <- LoomExperiment::import("data/pbmc_human_v3/pbmc_human_v3.loom", type="SingleCellLoomExperiment")
    sce_loom
    sce_optimus <- importOptimus(sample_dir, sample_name, matrixLocation = "sparse_counts.npz",
    colIndexLocation = "sparse_counts_col_index.npy",
    rowIndexLocation = "sparse_counts_row_index.npy",
    cellMetricsLocation = "merged-cell-metrics.csv.gz",
    geneMetricsLocation = "merged-gene-metrics.csv.gz",
    emptyDropsLocation = "empty_drops_result.csv"
    0

Please sign in to leave a comment.