gsutil tutorial

Yashasvika Duggal
  • Updated

 

Learn how to use gsutil to manage buckets and objects in Terra. gsutil is part of the gcloud shell scripts and is fully open sourced on github and under active development. Understanding gsutil is useful for navigating around Terra because it is an important Python command line tool that lets you manage buckets and objects on Google Cloud Storage. 

Overview: gsutil in a nutshell

gsutil is a useful python tool for navigating and managing Google Cloud Storage, which is where data related to workspaces is stored. gsutil allows users to interact with the Google Cloud from the terminal on their local machine or in a workspace.

Tasks in Google Cloud you can do with gsutil

  • Creating and deleting buckets.
  • Uploading, downloading, and deleting objects.
  • Listing buckets and objects.
  • Moving, copying, and renaming objects.

1. Install/open gsutil terminal

To use gsutil you'll need to start a Jupyter Cloud Environment in a Terra workspace or a python terminal on your local machine.

Note that for some tasks you may need to use a particular instance of the terminal. For example, when moving files from local storage to the cloud, you need to use a local terminal instance. 

Step-by-step instructions to set up gsutil

2. Set environment variables

Oftentimes, the code is cleaner and easier to work with if you set environment variables prior to running the commands. Since the URIs for these are usually non-human-friendly, setting these variables will help avoid errors when executing gsutil commands. 

Variables you will need for this tutorial

  • The "bucket" - the URI for workspace storage
  • The "project ID" - the workspace Google Project ID. 
  • 1. To find the workspace storage (i.e., Google bucket) ID, click on cloud information on the right hand side of the workspace dashboard. 

    bucket_URI.png

    2. Next to the Bucket Name is the gsutil address. 

    Syntax for accessing resourcesgsutil uses the prefix gs://to indicate a resource in Cloud Storage.

    To make the bucket name functional as an address, you will need to add gs:// to the start of the bucket name. For example, gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6.

  • 1. To find the Google Project ID of your workspace, click on Cloud Information in the workspace dashboard 9right side).

    project-to-bill.png

    2. Copy your workspace Google Project ID, this is the billing ID associated with the workspace. 

Run the following code to set the BUCKET and PROJECT_ID variables on the appropriate terminal (either in a Terra workspace or local machine).

BUCKET='gs://your-bucket-address'
PROJECT_ID='your_Google_project_ID'

Example: Setting environment variables (Python)

BUCKET='gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6'
PROJECT_ID='terra-47b6f28c'

R syntaxRStudio uses different terminology to set variables. You can assign value to variables in R by using <- instead of =.

Built in gsutil help

If you ever need help while working with gsutil you can type the following command into your local machine. 

gsutil help

This will open up a list of all available commands as well as a brief description of their function. 

Reference this article for more help using your local machine. 

3. Run commands in cloud environment terminal

There are many possible gsutil commands. To practice the ones most used in Terra, follow the instructions below.

For a comprehensive list, see Google Cloud documentation

  • Step 1. Open a terminal configured to run gsutil. This can be either your Terra workspace or local machine.

    Step 2. List files within your gsutil directory with the following command.

    gsutil ls $BUCKET

    set-and-list-BUCKET.png

    Using environment variables

    Remember that you set the BUCKET variable to be the gsutil URI in the above section.

  • You can copy a file to your Google bucket using the copy command. This command works in both python and R environments.

    Copying files in Python

    Step 1. To copy a file (ex: ubams.list) to your workspace Google bucket from your interactive analysis, run the following gsutil cp command.

    gsutil cp [file name] $BUCKET

    Copying files in RStudio

    When using R, you will need to adjust the code slightly to save and load R objects from the workspace bucket.

    Step 1. Run the following command on your R terminal.

    system('gsutil cp [file name] [destination]')

    Example code

    system('gsutil cp ubams.list gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6 2>&1', intern = TRUE)

    Step 2. Verify that the file has downloaded to bucket by running the following.

    gsutil ls $BUCKET
  • Requester pays is a useful setting in Google Cloud Storage (i.e. Google buckets) that allows dataset owners to make data available without incurring egress fees when someone reads or copies data from a different region.

    To learn more, see Requester Pays buckets.

    Step 1. Find the Project ID of the workspace you are charging to from setting the variables (above)

    Step 2. In the workspace notebook, workflow, or on the local command line run the following code.

     gsutil -u $PROJECT_ID cp <gs://path/to/file> <destination>

    requester-pays-gsutil.png

    Requester pays caveats A note of caution: When you use the -u flag for egress, you are charging a workspace/project for the egress, which is likely different from the workspace you are egressing from. If you are mistaken about the bucket in question being a "Requester Pays" bucket and use this command, you may inadvertently charge the bucket owner for the egress.

    If Requester pays is turned on and you do not provide -u flag, the command will fail.

  • You can save images and tables into a file using gsutil.

    For this command, you can use an image file you already have on your local machine or you can download and use the following image:
    cute_cat.png

    Step 1. Find the files you want to upload to your Google bucket. If you're using the above photo, download to your local machine before moving onto step 2.

    Step 2. Run the following command in the terminal on your local machine.

    gsutil cp [file-name.png] $BUCKET

    Note: To save all images within a folder use the wildcard *.png.

  • The gsutil set metadata command allows you to set or remove metadata on objects.

    For this command you can use an image file you already have on your local machine or you can download and use the following image:
    cute_cat.png

    Step 1: Find the file path for the image you want to upload to your Google bucket. If you are using the above photo, download to your local machine before moving onto step 2.

    Step 2: Upload the image to your Google bucket using the gsutil cp command on your local terminal.

    gsutil cp [file path] $BUCKET

    Step 3: Move to your workspace terminal and enter the following command. When you have a large number of objects, use the gsutil -m to perform a parallel update:

    gsutil -m setmeta -h 'Content-Type:image/png' $BUCKET/[file name]
  • Step 1. You can to use the -m flag to copy the files in parallel.

    gsutil -m cp -R $BUCKET [local file path]

    Step 2. To to maximize parallelization by configuring thread count, use -o.

    gsutil -o ‘GSUtil:parallel_thread_count=1’ 
    -o ‘GSUtil:sliced_object_download_max_components=8’
    cp gs://[bucket URL]/[file name] [local file path]

How to find the URI for individual files in workspace Data tab

1. Click on the workspace Data page.

2. Go to the files icon in the bottom left dashboard. This will open a list of files in workspace storage (i.e., Google bucket).

3. Double click any file to open file details which include gsutil information. 

Demo walk-through

Large_GIF__1494x586_.gif

Copying to a different directory This text box contains the full gsutil cp command followed by a period (dot operator ".") in addition to the file URL. You will need to change directories if you want to copy this file somewhere else.

4. Move local files (run commands in local terminal)

To transfer data to or from local storage always requires running gsutil in a local terminal instance. 

Click to expand each section for the specific commands.

  • gsutil is particularly ideal for moving large files or large numbers of files. For smaller files it is much easier to upload files through the data tab in Terra.

    Step 1. Run the following command on the terminal of your local machine.

    Permissions requirementsYou must be an Owner or Writer of the workspace to upload data to the workspace.

    gsutil cp [local file path] $BUCKET

    Example code

    To upload a file "Example.bam.tsv" from local machine into our workspace bucket.

    gsutil cp /Users/yduggal/Documents/Example.bam.tsv gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6

    copy-file-localmachine-to-bucket.png
    Note: If you want to copy all files in the directory you can use the wild card * instead of a specific file.

    Step 2. Verify that the file has downloaded to bucket by running the following.

    gsutil ls $BUCKET
  • Often you will need to download data from a bucket to local machine.

    Step 1. Run the following code in your local terminal. This code is the reverse of uploading from local machine to the Terra Platform.

    gsutil cp $BUCKET/[file name] [local file path]

    Make sure to leave a space before entering the [local file path].

    gsutil cp gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6/ubams.list Users/Documents

    downloading_from-Terra-to-localmachine.png

    If you're downloading folders, you'll need to use the -R flag to copy the folder and its contents

    gsutil cp -R $BUCKET [local file path]

Practice workspace

Getting comfortable with gsutil usually takes some time and practice. To practice moving around buckets and data within the workspace and local machine, try the gsutil tutorial workspace

Was this article helpful?

Comments

0 comments

Please sign in to leave a comment.