Saving data from an interactive analysis to workspace storage

Anton Kovalsky
  • Updated

Saving data from an interactive cloud environment (such as an instance of Galaxy, Jupyter notebooks, or RStudio) is a useful trick in some situations. If you're worried about losing work that you've done in an interactive environment because you need to delete or modify your persistent disk, you can use "gsutil" to copy it to you workspace bucket. Below, you can find step-by-step instructions for doing this from either Jupyter or RStudio.

Be careful not to lose data when reducing disk size! Important reminder: Reducing your disk size mid-analysis can lead to loss of data stored on that disk. When you do reduce your PD size, it's a good idea to save data first.

To learn more, see the documentation on Detachable Persistent Disks.

Why copy generated data to workspace storage?

Below are the primary reasons you might want to copy data generated in a notebook analysis to workspace storage (or external Google bucket).  

Use generated data as input for a workflow

Files generated by a notebook are not automatically saved in workspace storage (Google bucket) and are not accessible outside your personal virtual Jupyter Cloud Environment.

Share generated data with collaborators - even in a shared workspace

Note that for the same reason, you will need to copy data to workspace storage if you want colleagues to have access. This is true even if you are working in a shared Workspace, since each user has their own Cloud Environment and Persistent Disk that is inaccessible by anyone else.

Archive data

If you want to archive data, especially if you want to copy it to less expensive Nearline or Coldline storage, you will first need to copy it to an external bucket. 

To safeguard data when re-creating or deleting the PD

There are times when you may need to reconfigure your Cloud Environment (if you are moving between a notebook and RStudio analysis, for example) or delete your PD. In some cases, you can lose all or some generated data unless you explicitly save your output to workspace or external storage (i.e. Google bucket). As an example, if you want to decrease your PD (because you overestimated how much you would need and don't want to continue to pay for unused space), you would want to back up data before decreasing the disk size, in case the part of the disk that is deleted includes some generated data. 

Notebook (i.e. .jpynb) files are autosaved to workspace storageWhen working in a Jupyter notebook on Terra, your notebook is regularly auto-saved to your workspace bucket, so normally you don't need to worry about saving the changes to the notebook itself (i.e. code or documentation cells). Outputs that that are displayed in the notebook (plots, for example) itself will be autosaved. However, output files (matrices, for example) that aren't displayed in the notebook are saved to the PD, not to workspace storage.

Use notebook code to copy data to workspace storage

You can explicitly save generated outputs to permanent cloud storage (the workspace bucket) using code in the notebook itself by following the directions below.

Step 1. Set environment variables in a Jupyter Notebook

Setting the environment variables lets the notebook grab variables such as the workspace name and Google bucket directly. This makes cleaner and more flexible notebooks that don't require you to hardcode these variables in.

Run the commands below in a code cell:

  • import os

    BILLING_PROJECT_ID = os.environ['WORKSPACE_NAMESPACE']
    WORKSPACE = os.environ['WORKSPACE_NAME']
    bucket = os.environ['WORKSPACE_BUCKET']
  • project <- Sys.getenv('WORKSPACE_NAMESPACE')
    workspace <- Sys.getenv('WORKSPACE_NAME')
    bucket <- Sys.getenv('WORKSPACE_BUCKET')

Step 2. Save output files to a bucket with bash commands

The workspace storage is a Google bucket, so basic bash commands in notebooks need to be preceded by "gsutil."

These commands will only work if you have run the commands above to set the environment variables. Once you execute the code below, the data files should be visible in the workspace bucket.

To save all files, use the commands below. If you want to copy individual files, you can replace `*` with the file name to copy.

  • # Copy all files in the notebook into the bucket
    !gsutil cp ./* $bucket

    # Run list command to see if file is in the bucket
    !gsutil ls $bucket
  • # Copy all files generated in the notebook into the bucket
    system(paste0("gsutil cp ./* ",bucket),intern=TRUE)

    # Run list command to see if file is in the bucket
    system(paste0("gsutil ls ",bucket),intern=TRUE)

Use gsutil to copy data from the PD to workspace storage

Below are instructions for saving files or folders from your Cloud Environment storage to your workspace bucket.

  • Step 1: Find the files in the PD

    1.1. Your Cloud Environment comes with its own storage, which you can access with command line tools by clicking on the terminal icon in the Cloud Environment widget button in a running Cloud Environment.
    2021-09-10_06-21-02.png

    1.2. This opens a command line terminal directly to your Cloud Environment virtual machine.

    When you first open this terminal, you're in your /home directory. If you use the ls command to list the contents of this directory, you'll notice that the files do NOT necessarily correspond to the notebooks listed in your notebooks tab or the "notebooks" folder in your workspace bucket files (see these three examples below).

    Files in "notebooks" folder (in Terminal):
    Screen_Shot_2022-02-11_at_9.44.08_AM.png

    Compare to what's listed in the Analyses or Data tabs (workspace storage files) of the workspace.

    Files in Notebooks tab:
    Screen_Shot_2022-02-11_at_9.46.07_AM.png

    Files in workspace storage:
    Screen_Shot_2022-02-11_at_9.58.28_AM.png

    Why are the notebooks different in different places? The .ipynb files of the notebooks in your workspace exist before you've launched a Jupyter Cloud Environment. That's what you see stored in workspace storage and listed in the Analyses tab.

    However, launching Jupyter does not automatically bring the notebooks into your Cloud Environment. The .ipynb files aren't copied to the PD until you open your notebook - either in edit or playground mode.

    1.3. When you click to open any notebook (or use in "Edit" mode), Terra will create a new folder in the /home directory of your Jupyter Cloud Environment, which will be named after your workspace.

    Screen_Shot_2022-02-11_at_10.50.32_AM.png

    Notebook exists (with workspace name):
    Screen_Shot_2022-02-11_at_10.48.23_AM.png

    1.4. If you list the contents of this new directory, you'll find a folder named /edit. This subdirectory contains copies of all of the .ipynb files in your workspace, and these files include whatever edits you've saved to those notebooks during your current interactive session.

    Screen_Shot_2022-02-11_at_10.52.53_AM.png

  • Step 1: Find the files in the PD

    1.1. Your Cloud Environment comes with its own storage, which you can access with command line tools by clicking on the terminal icon in the Cloud Environment sidebar (as long as your Jupyter Cloud Environment is running).

    This opens a command line terminal directly to your cloud environment virtual machine.

    Terminal-in-sidebar_Screen_shot.png

    When you first open this terminal, you're in your PD /home directory. If you use the ls command to list the contents of this directory, you'll notice that the files do NOT necessarily correspond to the notebooks listed in your notebooks tab or the "notebooks" folder in your workspace bucket files (see these three examples below).

    1.2. List files in "notebooks" folder (in Terminal) using the ls command:
    Screen_Shot_2022-02-11_at_9.44.08_AM.png

    Compare to what's listed in the Analyses or Data tabs (workspace storage files) of the workspace:

    Files in Analyses tab:
    Notebook-files-in-PD_Analyses-tab-view_Screen_shot.png

    Files in workspace storage:
    Notebook-files-in-PD_Data-tab-view_Screen_shot.png

    Why are the notebooks different in different places? The .ipynb files of the notebooks in your workspace exist before you've launched a Jupyter Cloud Environment. That's what you see stored in workspace storage and listed in the Analyses tab.

    However, launching Jupyter does not automatically bring the notebooks into your Cloud Environment. The .ipynb files aren't copied to the PD until you open your notebook - either for editing, or in playground mode.

    1.3. When you click to open any notebook (or use in "Edit" mode), Terra will create a new folder in the /home directory of your Jupyter Cloud Environment, which will be named after your workspace.

    Screen_Shot_2022-02-11_at_10.50.32_AM.png

    Notebook exists (with workspace name):
    Screen_Shot_2022-02-11_at_10.48.23_AM.png

    1.4. If you list the contents of this new directory, you'll find a folder named /edit. This subdirectory contains copies of all of the .ipynb files in your workspace, and these files include whatever edits you've saved to those notebooks during your current interactive session.

    Screen_Shot_2022-02-11_at_10.52.53_AM.png

    Persistent Disk file structure

    It's useful to understand the file structure of your Cloud Environment and workspace bucket storage, and how to transfer things between the locations. This is especially true if you're generating output files from both workflows and interactive analyses (Galaxy, Jupyter, and RStudio) as they are stored by default in different locations.

Step 2: Copy files (gsutil commands)

2.1. Use gsutil to copy any of these files (or even the entire folder containing all of the files) to your workspace storage by copying the address of your workspace bucket from the dashboard, and using that as the destination for the copy.

  • Use the following command.

    gsutil cp [file name] gs://[workspace bucket address]
    HINT: You can find the Workspace bucket address in the right side of the Dashboard page.
    Screen_Shot_2021-09-10_at_7.24.34_AM.png
  • Remember to add the -r argument to the cp command to copy all contents recursively.

    gsutil cp -r [folder name] gs://[workspace bucket address]

    HINT: You can find the Workspace bucket address in the Cloud Information section on the right side of the Dashboard page.
    Workspace-Cloud-Information_Screen_shot.png

Expected output (screenshot below)

Screen_Shot_2021-09-10_at_7.40.05_AM.png

2.2. The files you've copied are now safely in workspace storage, regardless of what you do with your Jupyter Cloud Environment or persistent disk! You can find and download them either by navigating to Files in the data tab, or while navigating workspace storage (Google bucket). See two options for navigating workspace storage below.

  • Notebook-copied-to-workspace-storage_In-Data-tab_Files_Screen_shot.png
  • Step 1: Open browser (in Dashboard)
    Cloud-Information_Open-bucket-in-browser_Screen_shot.png

    Step 2: Find files (GCP console)
    Notebook-copied-to-workspace-bucket_Screen_shot.png

How to copy RStudio data to workspace storage

To move generated data to permanent cloud storage, follow the directions below. Note that this can be workspace storage or an external Google bucket. 

Step 1. Work in the built-in RStudio terminal

You can access a bash terminal from the Terminal tab in the main RStudio pane:
RStudio-terminal-function_Screen_shot.png

Step 2. Set the variable "bucket" for the destination storage

Setting a variable makes it so you can copy/paste the commands from the documentation. 

To use the workspace bucket for storage, run the command bucket="$WORKSPACE_BUCKET".

To save data to an external Google bucket, run the command bucket="$gs://<your-bucket-name>".

WORKSPACE_BUCKET is an environment variable that is pre-defined when using the terminal in Terra. Using environment variables lets RStudio grab the workspace Google bucket directly. This helps avoid hardcoding these variables into the code to move the data. Use the syntax below:

Step 3. Save files to "bucket" with bash commands 

Note: workspace storage is a Google bucket, so basic bash commands in the RStudio terminal need to be preceded by "gsutil."

To copy all files generated in the notebook into the bucket, use the command:
gsutil cp * "$bucket"

To make sure the files are in the bucket, you can run the following:
gsutil ls "$bucket"

Be careful when copying all filesUsing `*` can mean copying a lot of large files, which can be expensive. Be sure to check the size of the files in the bucket after copying! If you want to copy individual files, you can replace `*` with the file name to copy.

Additional resources

To learn more about your workspace Cloud Environment storage, see Detachable Persistent Disks

For additional bash capabilities, see Using the terminal and interactive shell in Terra.

A deeper dive: Terra's Cloud Environment To understand what's under the hood and why RStudio and notebooks have these characteristics, see this article about key notebook components or this article about key notebook operations.

  •  

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.