Bring your own data to Terra on Azure

Anton Kovalsky
  • Updated

If you're interested in using Terra on Azure, please email terra-enterprise@broadinstitute.org.

Learn how to use Blob storage, Azure's equivalent of a Google Bucket. Your workspace cloud storage is where you can upload unstructured data (such as CRAMs, BAMs, or TSV files) in one of three ways: 1) Terra file manager (uploading less than 5 GB at a time), 2) Microsoft Azure Storage Explorer App, or 3) command line (for uploads exceeding 5 GB).

What is Blob Storage?

Azure Blob (Binary Large OBject) Storage is Microsoft's cloud-based object storage solution. Blob Storage is optimized for storing massive amounts of unstructured data (data that doesn't adhere to a particular data model or definition, such as text or binary data). 

Blob Storage access

You'll use Shared Access Signature (SAS) tokens to access storage containers’ Blobs. The dashboard of your Terra on Azure workspace generates a SAS token (in the form of a Storage SAS URL) you can use to read from and write to your storage blob.

Screen_Shot_2023-01-18_at_10.10.47_PM.png

SAS token caveats

  • The tokens available in your dashboard expire after eight hours.
  • You can use multiple valid tokens concurrently.
  • You should not share your SAS token with others.

Storage container URL format

https://{unique-identifier}.blob.core.windows.net/sc-{terra-workspace-ID}?{SAS-token-string}

The storage SAS URL includes both the storage container (in blue) with an appended temporary SAS token (after the ? symbol - in red). 

SAS URL example

https://{unique-identifier}.blob.core.windows.net/sc-{terra-workspace-ID}?sv=2021-06-08&spr=https&st=2022-11-17T20%3A07%3A28Z&se=2022-11-18T04%3A22%3A28Z&sr=c&sp=racwdl&sig=t7ny7DbPxaPVkwgvihnWdIOZkqtdl5djCIA%2BvaNoNY4%3D”

Option 1: Upload using Terra’s File Manager (small files)

The file manager is designed as a user-friendly interface for the convenience of all users. It is best for small numbers of small files. For larger files (or larger numbers of files), we recommend the Microsoft Storage Explorer App or the or AzCopy Command Line Interface. (if you are comfortable with the CLI).

This method supports uploads up to 5GB per fileINote that while it's possible, uploading files that are GBs can take quite a while. A 4GB file could take ~50 mins to upload depending on your internet connection.  

1. Select the folder icon on the right side panel from anywhere in your Terra on Azure workspace.

Screen_Shot_2023-01-17_at_5.57.58_PM.png

2. Selecting the folder will open up a new screen where you can upload your files to your workspace’s Azure storage container.

Screen_Shot_2023-01-17_at_5.58.16_PM.png

3. Select upload to select a file to upload from your local machine.

4. After uploading your file, you should see it appear as a list under the Name column.

Screen_Shot_2023-01-17_at_5.58.26_PM.png

5. Click the link to open a pop-up window with additional details (such as the file size and Azure storage location), and an option to download the file to your local machine.

Popup example

Screen_Shot_2023-01-17_at_5.58.43_PM.png

Option 2: Microsoft Azure Storage Explorer App

2.1. Download and set up Microsoft Azure Storage Explorer locally.

No need to sign in!

2.2. Go to Dashboard of Terra on Azure Workspace.

2.3. Click Cloud info (on the right-hand side).

2.4. Click copy to clipboard (file icon) to the right of Storage SAS URL.

SAS token caveats-SAS tokens currently expire after 8 hours
- You can have more than 1 valid SAS token for a storage blob at a time.
- Don’t share your SAS token with anyone, because anyone would be able to access your workspace storage!

2.5. Go to your local Storage Explorer app.

2.6. Under Storage accounts,  choose attach a resource.

2.7. In pop-up, select blob container.

2.8. How will you connect? Select Shared access signature URL (SAS).

2.9. In the display name field, you can write whatever.

2.10. Paste the Storage SAS URL from your Terra on Azure Workspace in the Blob container SAS URL field. 

What the SAS URL includes- Your permissions on the blob (Read, Add, Create, Write, Delete, List)
- Blob storage ID/”resource name”
- SAS token

2.11. Select Connect.

What to expect/do next

A new pop-up should show you successfully added the new connection. You can use the built-in file directory to transfer large files to your workspace cloud storage. 

Option 3: Upload using AzCopy Command Line Interface (CLI)

Using the AzCopy command line interface (CLI) tool may be less comfortable if you have less experience using terminal commands, but is necessary when uploading more than 5 GB of data at a time.

1. Download azcopy and set it up on your local machine.

2. Copy files from your local machine to your Azure blob storage using the command line interface (CLI).

azcopy copy [source] [destination] [flags]

The destination is (remember the double quotes!) "https://[account].blob.core.windows.net/[container]/[path/to/blob]?[SAS]"

You can find this URL in your Workspace Dashboard under Cloud Information (copy the Storage SAS URL by clicking the copy icon). Remember that your SAS token expires after 8 hours.

Example command

azcopy copy /Users/user/Downloads/SRR17259545/SRR17259545_1.fastq.gz
https://sa226344b664da26ad6863.blob.core.windows.net/sc-226344b6-1f90-4754-ac2e-64da26
ad6863?sv=2021-06-08&spr=https&st=2022-11-17T20%3A07%3A28Z&se=2022-11-18T04%3A22%3A28Z
&sr=c&sp=racwdl&sig=t7ny7DbPxaPVkwgvihnWdIOZkqtdl5djCIA%2BvaNoNY4%3D–put-md5

3. Once your upload is complete, you can check what is available in your storage blob with this command:

azcopy list “https://[account].blob.core.windows.net/[container]/[path/to/blob]?[SAS]”

What to expect

This outputs the filenames but does not include the full path URL. See the example below.

INFO: SRR17259545/SRR17259545_1.fastq.gz;  Content Length: 27.67 MiB

To reference this data in an analysis in your workspaceYou will need to concatenate the Storage Container URL and the path to your file without the SAS token.

For example
https://sa226344b664da26ad6863.blob.core.windows.net/sc-226344b6-1f90-4754-ac2e-64da26ad6863/SRR17259545/SRR17259545_1.fastq.gz

You can also get the path to the file by going to the Workspace File Manager shown in the section above.

Copy from a Google Bucket to an Azure storage container

You will use AzCopy to do this. Be careful as this operation will incur egress charges

Step 1: Set up AzCopy on your local machine

For step-by-step instructions, see Microsoft documentation on AzCopy.

Step 2: Set up authentication with Google Cloud

2.1. Create a Service Account in Google Cloud Console. For more background details, see What is a service account (video)?

2.2. Select the Terra Billing project you will use. 

2.3. Create a secure key to use for authentication. This key will generally be downloaded locally in JSON format. Don’t share this key with anyone.

2.4. After you have a service key, set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the absolute path to the service account key file (citation)

Screen_Shot_2023-01-17_at_6.19.39_PM.png

Step 3: Find the file's Authenticated URL

3.1. In your Terra on Google Workspace, use the file management system or data table to find the URL for the data you wish to copy to Azure. It will start with gs://

3.2. Select View this file in the Google Cloud Storage Browser.

3.3. Click on the file name and copy the Authenticated URL. It will start with https:// and end with ?authuser=0.

Step 4: Use AzCopy to move data from Google Cloud to Azure

4.1. Log into your Terra on Azure Workspace where you would like your data to be placed.

4.2. On the dashboard of this workspace, under Cloud Information, copy the temporary SAS token associated with your workspace’s cloud storage.

4.3. Now that you can authenticate into both cloud storage containers, use azcopy in your local terminal to copy the data. Be mindful that this will incur egress charges. Azure generally charges $0.08, and Google charges $0.11 to egress 1 GB of data (citation)

Command example

azcopy copy 'https://storage.cloud.google.com/<bucket-name>/<directory-name>/<filename>?authuser=0' 
'https://<storage-account-name>.blob.core.windows.net/<container-name>/<directory-name>'
--recursive=true

Note: You must use an authenticated URL from Google Cloud (e.g., starting with https://). A gs:// URL will not work.

Was this article helpful?

1 out of 2 found this helpful

Comments

0 comments

Please sign in to leave a comment.