gsutil tutorial

Yashasvika Duggal
  • Updated

 

Learn how to use gsutil to manage buckets and objects in Terra. gsutil is part of the gcloud shell scripts and is fully open sourced on github and under active development. Understanding gsutil is useful for navigating around Terra because it is an important Python command line tool that lets you manage buckets and objects on Google Cloud Storage. 

Overview

gsutil is a useful python tool for navigating and managing Google Cloud Storage, which is where data related to workspaces is stored. gsutil allows users to interact with the Google Cloud from the terminal on their local machine or in a workspace. gsutil can be used to help with a wide range of tasks in Google Cloud including the following

  • Creating and deleting buckets.
  • Uploading, downloading, and deleting objects.
  • Listing buckets and objects.
  • Moving, copying, and renaming objects.

Setting up environment for gsutil

To use gsutil you will need to either start an analyses workspace in Terra or use the python terminal on your local machine.

1. Find the workspace bucket name

Finding workspace bucket name in your workspace dashboard

1.1. Click on cloud information on the right hand side of the workspace dashboard. 

bucket_URI.png

1.2. Next to the Bucket Name is the gsutil address. 

Note: Syntax for accessing resources gsutil uses the prefix gs://to indicate a resource in Cloud Storage. To make the bucket name functional as an address, you will need to add gs:// to the start of the bucket name. For example, gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6.

Finding the URI for individual files in workspace Data tab

1.3. Alternatively, click on the data tab of a workspace.

1.4. Go to the files button in the bottom left dashboard. This will open a list of files in the workspace.

1.5. Double click any file to open file details which include gsutil information. 

Large_GIF__1494x586_.gif

Note: This text box contains the full gsutil cp command followed by a period (dot operator ".") in addition to the file URL. You will need to change directories if you want to copy this file somewhere else.

2. Find the Google Project ID

2.1. To find the Google Project ID of your workspace, click on Cloud Information in the workspace dashboard. 

project-to-bill.png

2.2. Copy your Google Project ID, this is the billing ID associated with the workspace. 

3. Setting variables

Oftentimes, the code is cleaner and easier to work with if you set variables prior to running the commands.

For the gsutil commands in this tutorial, the two variables that are important to set are the "bucket" which is the gsutil URI and the "project ID" which is the Google Project ID.  
3.1. Set the BUCKET and PROJECT_ID on the appropriate terminal (either in a Terra workspace or local machine)

BUCKET='gs://your-bucket-address'
PROJECT_ID='your_Google_project_ID'

For example:

BUCKET='gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6'
PROJECT_ID='terra-47b6f28c'

Note: RStudio uses different terminology to set variables. You can assign value to variables in R by using <-

Practice workspace

Getting comfortable with gsutil can take some time and practice. We have provided the following workspace to practice moving around buckets and data within the workspace and local machine. 

If you ever need help while working with gsutil you can type the following command into your local machine. 

gsutil help

This will open up a list of all available commands as well as a brief description of their function. 

Reference this article for more help with using your local machine. 

Commands to be used in cloud environment 

There are a multitude of commands possible with gsutil, a comprehensive list can be found in Google Cloud documentation. The commands most often used in the Terra Platform are described below. 

  • Step 1. Open a terminal configured to run gsutil. This can be either your Terra workspace or local machine. 
    Step 2. List files within your gsutil directory with the following command 
    gsutil ls $BUCKET
    set-and-list-BUCKET.png
    Note: In the above section we set the "BUCKET" variable to be the gsutil URI
  • You can copy a file to your Google bucket using the copy command. This command works in both python and R environments 

    Copying files in Python

    Step 1. To copy a file (ex: ubams.list) to your workspace Google bucket from your interactive analysis, open your Terra platform. 
    Step 2. Run the following gsutil cp command

    gsutil cp [file name] $BUCKET

    Copying files in RStudio

    When using R, you will need to adjust the code slightly to save and load R objects from the workspace bucket. 
    Step 1. Run the following command on your R terminal. 
    system('gsutil cp [file name] [destination]')
    Example:
    system('gsutil cp ubams.list gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6 2>&1', intern = TRUE)
    Step 2. Verify that the file has downloaded to bucket by running
    gsutil ls $BUCKET
  • Requester pays buckets are a useful setting in Google Cloud that allows dataset owners to make data available without needing to pay egress fees when someone reads or copies data from a different region. 
    You can learn how to set up "Requester Pays" for your own workspaces/buckets here.
    Step 1. Find the Project ID of the workspace you are charging to from setting the variables (above)
    Step 2. In the workspace notebook, workflow, or on the local command line run the following code.  If Requester pays is turned on and you do not provide -u flag, the command will fail.

     gsutil -u $PROJECT_ID cp <gs://path/to/file> <destination>

    requester-pays-gsutil.png
    A note of caution: When you use the -u flag for egress, you are charging a workspace/project for the egress, which is likely different from the workspace you are egressing from. If you are mistaken about the bucket in question being a "Requester Pays" bucket and use this command, you may inadvertently charge the bucket owner for the egress.

  • You can save images and tables into a file using gsutil.
    For this command you can use an image file you already have on your local machine or you can download and use the following image:
    cute_cat.png

    Step 1. 
    Find the files you want to upload to your Google bucket. If you are using the above photo, download to your local machine before moving onto step 2. 
    Step 2. Run the following command in the terminal on your local machine. 
    gsutil cp [file-name.png] $BUCKET
    Note: To save all images within a folder use the wildcard *.png.
  • The gsutil set metadata command allows you to set or remove metadata on objects. 

    For this command you can use an image file you already have on your local machine or you can download and use the following image:
    cute_cat.png
    Step 1: Find the file path for the image you want to upload to your Google bucket. If you are using the above photo, download to your local machine before moving onto step 2. 
    Step 2: Upload the image to your Google bucket using the gsutil cp command on your local terminal.

    gsutil cp [file path] $BUCKET

    Step 3: Move to your workspace terminal and enter the following command. When you have a large number of objects, use the gsutil -m to perform a parallel update:

    gsutil -m setmeta -h 'Content-Type:image/png' $BUCKET/[file name]
  • Downloading data using parallelization

    Step 1. You can to use the -m flag to copy the files in parallel

    gsutil -m cp -R $BUCKET [local file path]

    Step 2. Its also possible to to maximize parallelization by configuring thread count using -o.

    gsutil -o ‘GSUtil:parallel_thread_count=1’ 
    -o ‘GSUtil:sliced_object_download_max_components=8’
    cp gs://[bucket URL]/[file name] [local file path]

Installing gsutil to a local machine 

You will need to install gsutil to your local machine. This can be done through the official installation and update method as part of Google Cloud CLI.   Full instructions to install gsutil on your operating system can be found on Google Cloud's install gsutil page.

In order to install gsutil your system will need to meet the following requirements:

  • Linux/Unix, Mac OS, or Windows (XP or later).
  • Versions 5.0 and up require Python 3

Step 1. First open a terminal running on your local system
Step 2. Run the following script to install Google Cloud SDK.

./google-cloud-sdk/install.sh

Step 3. Run gcloud init to initialize the gcloud CLI

./google-cloud-sdk/bin/gcloud init

Step 4. After installing gsutil, it is very important to authenticate your installation using the following code:

gcloud auth login

Step 5. You will now be prompted to choose an account to continue to Google Cloud SDK.
authenticate-account-GoogleCloudSDK.png
Make sure the Google account you use is the same account associated with Terra, otherwise you will not have authorization to use some of the commands. 

Commands to be used in local terminal

    • Using gsutil is particularly ideal for large file sizes or 1000s of files. For smaller files it is much easier to upload files through the data tab on the Terra Platform. 
      Step 1. Run the following command on the terminal of your local machine.
      Note: you must be an Owner or Writer of the workspace to upload data to the workspace. 

      gsutil cp [local file path] $BUCKET

      Example: To upload a file "Example.bam.tsv" from local machine into our workspace bucket:

      gsutil cp /Users/yduggal/Documents/Example.bam.tsv gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6

      copy-file-localmachine-to-bucket.png
      Note: If you want to copy all files in the directory you can use the wild card * instead of a specific file

      Step 2. Verify that the file has downloaded to bucket by running

      gsutil ls $BUCKET
    • Often you will need to download data from a bucket to local machine.
      Step 1. Run the following code in your local terminal. This code is the reverse of uploading from local machine to the Terra Platform. 

      gsutil cp $BUCKET/[file name] [local file path]

      Note: make sure to leave a space before entering the [local file path]

      gsutil cp gs://fc-392080b2-a7b1-40c8-9550-5c971be3f7e6/ubams.list Users/Documents
      downloading_from-Terra-to-localmachine.png
      If you're downloading folders, you'll need to use the -R flag to copy the folder and its contents
      gsutil cp -R $BUCKET [local file path]

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.