Moving data to/from a Google bucket (workspace or external)

Anton Kovalsky

Explore how to add data to - or download from - a Google bucket (could be your workspace bucket or an external bucket). Choose the best way based on how many and what size files you are working with, whether you are moving to or from local storage, and how comfortable you are with the different options. 

Upload and download using the Terra interface 

  • Recommended for small numbers (one to ten) of files
  • Only for transfers between the workspace bucket and local storage (e.g. laptop)

Note that this is the sort of transfer you often see where you upload or download a file from the internet. Because your local storage has no cloud-native  "path", you can only transfer files stored on the system running your browser. 

You'll start from the workspace Data page. Click on the File icon (bottom of the left column) and follow the instructions below.

Upload files from local storage to the workspace bucket

1. Select the "Files" icon on the left side of the screen: 2. In the "files" section,  click on the "+" button in the lower right corner:
S10a_May31_2019.png Data_tab_files_upload_Screen_Shot.png

Download files from the workspace bucket to local storage

1. Select the "Files" icon on the bottom of the left column (underneath "Other Data").

2. Find the file you want to download (note that you may have to navigate down many levels of file folders) to access the file you want:

3. Click on the file to download, this will open a popup window (screenshot below).

4. The popup window offers multiple choices for copying the data, as well as the cost. 

Moving-data_File-icon-Screen_Shot.png

Moving-data_File-details-modal.png

Download files from a workspace Data table to local storage

1. Click on one of the table tabs on the left side of the screen. Example below uses a tab labeled"sample". 

2. Any files available for download will be shown as a link in the relevant sample row:

3. Clicking on a file link will open a pop-up window (below) describing the size and cost of the download. 
Files_to_download_Screen_Shot.png Download_costs_Screen_Shot.png

4. Clicking on the “Download for $0.56” link will initiate the download. Note: This button starts the download immediately. You do not have another opportunity to verify before the download starts. However, you can cancel the download at any time during the process. 

5. Repeat for any additional files you would like to download.

Upload and download data files in a terminal using gsutil

  • Works well for all size transfers
  • Ideal for large file sizes or 1000s of files
  • Can be used for transfers between local storage and a bucket, workspace VM or persistent disk and a Google bucket, as well as between Google buckets (external and workspace)
G0_icon-tip/png


An intro to gsutil

  gsutil is a Python application that lets you access Cloud Storage from the command line in a terminal. The terminal you use can be run on your local machine (local instance) or built into the workspace Cloud Environment (workspace instance).


gsutil in a terminal - Step-by-step instructions

Depending on where your data are stored, you may need to run a particular instance of the terminal. If you are moving data to or from the workspace Cloud Environment (CM memory or Persistent Disk), you will use the workspace terminal instance. If you are moving data to or from local storage, you will use a local terminal instance.

Move-data_Storage-and-transfer-options_Diagram.png

Which instance should you use plus step-by-step instructions

VM/PD <-> Google bucket
Google bucket <-> Google bucket

Use the workspace terminal instance

1. Run in a workspace
    a. Spin up a Cloud Environment
   
b. Start the built-in terminal 

2. Run gsutil commands

Local storage <-> Google bucket
Google bucket <-> Google bucket

Use a local terminal instance

1. Run locally
    a. Open a terminal on your local machine
    b. Set up gsutil locally (if not already installed)

2. Run gsutil commands

For Google bucket to Google bucket transfers, you can use either instance.

1. Run in a Terra workspace - required to move data to/from workspace VM/PD 

Scroll to the top right corner of any workspace page to see these icons, which will lead you to one of the best-kept secrets of Terra - a command line interface. Click on the (>_) icon (to the left of the play or pause button) and you'll be able to access what resembles a UNIX terminal.

Screenshot of runtime icon with terminal icon

You will need to start a cloud environment first if one is not already running, as this is the virtual machine the terminal runs on as well.

From here, you can perform command-line tasks including gsutil.

1. Run in a local terminal - required to move to/from local storage

First you will need to open a terminal on your local machine. Then follow Google’s installation instructions for Cloud SDK or follow the directions below to install Google Cloud SDK, which includes gsutil:

1. Run the following command using bash shells in your Terminal:
curl https://sdk.cloud.google.com | bash 
Or download google-cloud-sdk.zip or google-cloud-sdk.tar.gz and unpack it. Note: The command is only supported in bash shells.

2. Restart your shell: exec -l $SHELL or open a new bash shell in your Terminal.

3. Run gcloud init to authenticate.

Before uploading/downloading data using gsutil, use the ls command to look at the buckets you have access to:

  • Run gsutil ls to see all of the Cloud Storage buckets under the workspace's project ID.
    • Before running this command be sure to have set a default project name using gcloud config set project [myProject].
  • Run gsutil ls -p [project name] to list buckets for a specific project.
    • You will need to have owner access to the project to run this command.

2. Run gsutil commands

Once in a terminal (either on your local machine or in a Terra workspace), you can copy data from one place to another using the cp command:

gsutil cp where_to_copy_data_from/filename where_to_copy_data_to

Additional details on the gsutil cp command can be found here.

Note: you must be an Owner or Writer to upload to a Google bucket, including the workspace bucket!

Create manifest file using -L gsutil command:

gsutil cp -L where_to_copy_data_from/filename where_to_copy_data_to

Example: Copy from external bucket to workspace bucket

To copy the file "Example.bam" from an external bucket gs://My_GCP_bucket" into the "gene_files" folder in a workspace bucket "gs://fc-7ac2cfe6-4ac5-4a00-add1-c9b3c84a36b7":
gsutil cp gs://My_GCP_bucket/Example.bam gs://fc-7ac2cfe6-4ac5-4a00-add1-c9b3c84a36b7

Finding the full path to workspace bucket
In Terra, you can find the full path to the workspace bucket by clicking the Clipboard icon in the right ride of the workspace Dashboard: 

Moving-data_Google-bucket_Screen_Shot.png

Example: Download data from workspace bucket to local storage

Note that in order to do this, you must use gsutil in a terminal on your local machine.

To download data from a bucket, reverse the order of the bucket URL and local file path:

gsutil cp [bucket URL]/[file name] [local file path]

Make sure to leave a space between the the bucket URL and the file path:

gsutil cp gs://WorkspaceBucket/gene_files/example.bam /Users/Documents

Finding the full path to the workspace bucket
In Terra, you can find the full path to the workspace bucket by clicking the Clipboard icon in the right ride of the workspace Dashboard: 

Moving-data_Google-bucket_Screen_Shot.png

Example: Download data from a requester pays bucket

To download data from a bucket that is enabled with requester-pays, run the command this way:
gsutil -u [google-billing-project] cp gs://[bucket URL]/[file name] [local file path]

To learn more about accessing files from a requester-pays enabled Google bucket, see this article

Example: Downloading folders of data

If you're downloading folders, you'll need to use the -R flag to copy the folder and its contents:

gsutil cp -R gs://example-bucket/folder1 [local file path]

Example: Downloading data quickly using parallelization

1. If you're downloading folders, you can to use the -m flag to copy the files in parallel:

gsutil -m cp -R gs://example-bucket/folder1 [local file path]

More gsutil instuctions working with large data can be found here and an explanation of -mcan be found here.

2. Its also possible to to maximise parallelization by configuring thread count using -o, try the following (explanation can be found in this article) :

gsutil -o ‘GSUtil:parallel_thread_count=1’ -o ‘GSUtil:sliced_object_download_max_components=8’ cp gs://[bucket URL]/[file name] [local file path]

File validation / checksum generation

Per Google: At the end of every upload or download the gsutil cp command validates that the checksum it computes for the source file/object matches the checksum the service computes. If the checksums do not match, gsutil will delete the corrupted object and print a warning message. This very rarely happens, but if it does, please contact gs-team@google.com.

Troubleshooting

gcloud authorization error 

You may have trouble accessing your Terra workspaces if you have authorized your gcloud sdk installation with a Google Account that is not registered in Terra and applied to your workspace.  You can verify which Google Account you’ve authorized with gcloud by running the following command: gcloud auth list

    1. If the Google ID returned matches the one on your Terra workspace, you should be able to access your workspace.  If not, please contact your Project Manager.
    2. If the Google ID returned does not match the one on your Terra workspace, run the following command to specify the correct account:
      gcloud auth login [Google account]

gsutil errors on Unix

When working on a Unix system, you will need to to tell it not to try to start a browser. Then it gives you a url you can paste into your desktop browser. 

To tell the system not to start a browser, use the command

gcloud auth login --no-launch-browser

 

Was this article helpful?

6 out of 6 found this helpful

Have more questions? Submit a request

Comments

6 comments

  • Comment author
    xiao li

    I got this error message: ServiceException: 401 Anonymous caller does not have storage.objects.list access to <my bucket link> while I was following the instructions for gsutil uploading.

    I did authenticate my account and am able to see my buckets under google console. Is this related to service account?

     

    0
  • Comment author
    Anton Kovalsky

    Hi xiao li thanks for posting your question. May I ask, are you positive that the email you used to register for Terra has the necessary access to the bucket(s) in question? If it is not a bucket you created, you may need to contact the bucket's owner for them to add permission for you.

    If you are certain that the permissions should all line up, you should submit a question to customer support (use the "contact us" link at the bottom of the main menu in the Terra interface) specifying the bucket and the email address you used to authenticate.

    0
  • Comment author
    xiao li

    Hi Anton, the bucket was created by Terra when the workspace was created (I supposed it was created by the corresponding service account). I could see that bucket from my google cloud console, and I was able to upload my file directly from the google cloud's console. I used `gcloud auth login` from my console and followed the instructions for authentication, and switched to the correct project ID. Are there something I might miss doing this? 

    0
  • Comment author
    Jason Cerrato

    Hi xiao li, I would be happy to take a closer look at your case here. Can you create a new support request through the UI, or email support@terra.bio with details about which steps you are following on this article, as well as information about the workspace and bucket you are working with, and the email address you are using to authenticate? Please also share the workspace with GROUP_FireCloud-Support@firecloud.org if possible.

    0
  • Comment author
    xiao li

    Sounds good! I will do that.

    0
  • Comment author
    Dan Lu

    Thanks for the helpful tutorial!! Following the exact steps in section "Set up gsutil in a local terminal" with fresh gsutil 4.60 install:
    Run gsutil ls  gave "You are attempting to perform an operation that requires a project id, with none configured." error;
    Run gsutil ls -p [workspace project name] gave "403. [email] does not have storage.buckets.list access to the Google Cloud project." error.
    Whereas gsutil ls -l gs://workspace_bucket_namen would work.
    For anyone else testing their gsutil set up. 

    0

Please sign in to leave a comment.