Learn how to quickly and conveniently access regions of large genomic data files in Terra using GA4GH Data Repository Services API Uniform Resource Identifiers (DRS URI) and SAMtools - with the help of terra-notebook-utils. This article covers how to ensure you’re using the necessary software versions and then provides examples of using these tools together effectively
Introduction to SAMtools
SAMtools - a suite of programs for reading/writing/editing/indexing/viewing SAM/BAM/CRAM formats - provides powerful capabilities to work with genomic data files. SAMtools itself does not support interoperable DRS URIs as inputs. However, terra-notebook-utils (TNU) can “resolve” DRS URIs to signed URLs, which SAMtools does support as inputs. This lets you quickly and efficiently access regions of large genomic data files without having to download the whole file.
1. Verify/update SAMtools to the current version
SAMtools software version requirementsSAMtools is a command-line tool preinstalled in Terra Cloud Environments. Since SAMtools support for signed URLs as inputs was added relatively recently, it’s necessary to use a recent enough version of SAMtools. SAMtools 1.13 or later is required.
1.1 Check the current version of SAMtools
As of 11/14/2023, the current version of SAMtools is 1.18, and that version is used in the examples later in this document. To check the current version of SAMtools, run the following in a workspace Cloud Environment Terminal.
$ samtools --version
What to expect (output)
samtools 1.18
Using htslib 1.18
Copyright (C) 2023 Genome Research Ltd.
1.2. Update SAMtools in the Terra Cloud Environment (if needed)
You must use a startup script to update SAMtools because you need superuser/root access to run the commands. For step-by-step instructions, see Using a startup script to launch a pre-configured cloud environment.
Startup script file to update SAMtools to the current version
#! /usr/bin/env bash
apt update
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install "SAMtools>=1.13"
2. Check for/install current version of terra-notebooks-utils
You'll use terra-notebook-utils to resolve DRS URIs to signed URLs which may be used as inputs to SAMtools.
Caveats
- These features are available through both the command line and a Python API.
- These may be used both inside Terra (Cloud Environment Terminal and Notebooks) and outside of Terra (local systems, GCE instance, etc.).
- If you are working outside of Terra, you must use a Linux system to run terra-notebook-utils (TNU will not run in macOS or Microsoft Windows).
2.1. Check the current version of terra-notebook-utils
The current version of TNU works with Python 3.10 and later. It is installed using the pip command.
Check that the currently installed terra-notebook-utils (tnu) version is 0.13.0 or later.
$ tnu -–version
2.2. Install an updated version (if needed)
If the currently installed version is less than 0.13.0, update to the latest version in either the terminal or a Jupyter Notebook.
Example code (Terra Cloud Environment Jupyter Terminal)
$ pip install –upgrade –no-cache-dir terra-notebook-utils
Example code (Terra Cloud Environment Jupyter Notebook)
%pip install --upgrade --no-cache-dir terra-notebook-utils
Remember to restart the Jupyter kernel after running a pip install!!
3. Use SAMtools + signed URLs to interact with genomic files
Access regions of BAM/CRAMs with an index file (view) | View alignments interactively (tview)
Have patience when using SAMtools the first timeThe first execution of some SAMtools commands may take much longer while SAMtools downloads reference files. SAMtools caches these reference files, and subsequent executions of the command will run much faster.
Option 1: Access regions of BAM/CRAM files with index file (view)
The SAMtools view
command accesses regions of BAM/CRAM files using the corresponding index file.
view command caveats
- The examples below use the -X option, which allows the index file to be at a location different than the data file. This was added in SAMtools version 1.10.
- For additional information and options available, see the SAMtools-view man page.
Region Access Requires an Index File
Region access to large genomic data files (e.g., BAM/CRAM) file requires a corresponding index file (e.g., BAI/CRAI). Some data sources also provide the index file. If an index file is not available from the data source, it is possible to create one, although it requires reading the entire large genomic data file and is time-consuming.
Region access
The commands below use NHGRI AnVIL CRAM and CRAI files from the open-access AnVIL 1000 Genomes dataset. These same techniques may also be used with DRS URIs from other Terra-supported DRS data sources (e.g., NHLBI BioData Catalyst, NCI Cancer Research Data Commons (CRDC), and Kids First DRC).
- CRAM DRS URIs:
drs://drs.anv0:v2_a51f9287-fe9e-3349-93aa-db8f2d33fef2
- cram_signed_url: a variable that will hold the value of the cram signed URL (generated by tnu)
- CRAI DRS URIs:
drs://drs.anv0:v2_04ccfee7-1a46-3743-bbef-c76d14a92429
- crai_signed_url: a variable that will hold the value of the crai signed URL (generated by tnu)
These examples use the bash command line to illustrate region access. The first two commands run tnu
on the CRAM and CRAI file DRS URIs, respectively, and assign the signed URL (generated by tnu) to the variables. The last command runs samtools view
using the signed URLs as input. This same approach may be used in Python with the help of pysam
, although examples are not provided here.
View a region
$ cram_signed_url=$(tnu drs access drs://drs.anv0:v2_a51f9287-fe9e-3349-93aa-db8f2d33fef2)
$ crai_signed_url=$(tnu drs access drs://drs.anv0:v2_04ccfee7-1a46-3743-bbef-c76d14a92429)
$ samtools view -X "$cram_signed_url" "$crai_signed_url" chr3:10,000-11,000 | more
Extract a Region to a new BAM File
$ cram_signed_url=$(tnu drs access drs://drs.anv0:v2_a51f9287-fe9e-3349-93aa-db8f2d33fef2)
$ crai_signed_url=$(tnu drs access drs://drs.anv0:v2_04ccfee7-1a46-3743-bbef-c76d14a92429)
$ samtools view -X "$cram_signed_url" "$crai_signed_url" chr3:10,000-11,000 -bu --output sample_chr3_10kto11k.bam
Extract a Region to a new CRAM File
$ cram_signed_url=$(tnu drs access drs://drs.anv0:v2_a51f9287-fe9e-3349-93aa-db8f2d33fef2)
$ crai_signed_url=$(tnu drs access drs://drs.anv0:v2_04ccfee7-1a46-3743-bbef-c76d14a92429)
$ samtools view -X "$cram_signed_url" "$crai_signed_url" chr3:10,000-11,000 -u --output sample_chr3_10kto11k.cram
Option 2: View alignments interactively (tview command)
The samtools tview
command provides simple, interactive viewing of alignments and may be used in the Terra Cloud Environment Terminal.
tview command caveats
- The
tview
command does not support an option to have the index file in a different location. Instead, it assumes it is the current directory.
tview example
These examples use the bash command line to illustrate region access. The first two commands run tnu
on the CRAM and CRAI file DRS URIs, respectively, and assign the signed URL (generated by tnu) to the variables. The last command runs samtools tview
using the signed URLs as input. This same approach may be used in Python with the help of pysam
, although examples are not provided here.
Note that the index file must be stored in the local directory. Below we use getm
, a new tool installed with terra-notebook-utils, to quickly download the index file to the current directory, then run the tview
interactive viewer.
1. Use the following commands in a terminal. This first pair runs tnu
on the CRAM and CRAI file DRS URIs, respectively, and assign the signed URL (generated by tnu) to the variables.
$ cram_signed_url=$(tnu drs access drs://drs.anv0:v2_a51f9287-fe9e-3349-93aa-db8f2d33fef2)
$ crai_signed_url=$(tnu drs access drs://drs.anv0:v2_04ccfee7-1a46-3743-bbef-c76d14a92429)
2. Download the index file to the current directory.
$ getm "$crai_signed_url"
3. Run the tview interactive viewer on the CRAM signed URL.
$ samtools tview "$cram_signed_url"
4. In the viewer, press ?
for help.
+--------------------------------------+
| -=- Help -=-
| ? This window
| Arrows Small scroll movement
| h,j,k,l Small scroll movement
| H,J,K,L Large scroll movement
| ctrl-H Scroll 1k left
| ctrl-L Scroll 1k right
| space Scroll one screen
| backspace Scroll back one screen
| g Go to specific location
| m Color for mapping qual
| n Color for nucleotide
| b Color for base quality
| c Color for cs color
| z Color for cs qual
| . Toggle on/off dot view
| s Toggle on/off ref skip
| r Toggle on/off rd name
| N Turn on nt view
| C Turn on cs view
| i Toggle on/off ins
| v Inverse video
| q Exit
|
| Underline: Secondary or orphan
| Blue: 0-9 Green: 10-19
| Yellow: 20-29 White: >=30
+--------------------------------------+
5. Type the letter g
to check the alignment starting from a region in the format like chr3:10900 or =10900
when viewing the same reference sequence.
+--------------------------------------------------------+
|>chr1 length: 248956422 |
| |
| |
| |
| |
| |
+--------------------------------------------------------+
| Goto: chr3:10900
+--------------------------------------------------------+
What to expect
4. Type q
to exit. the interactive viewer and return to the terminal prompt.
Additional resources
For more information and options, see the SAMtools-tview man page.