Using Data from Google Bucket in R Jupyter Notebook or changing work directory

July 30, 2020 19:49
13 comments

Note: I am not looking to copy files to the notebook disk per https://support.terra.bio/hc/en-us/articles/360046617372-Analyzing-data-from-a-workspace-bucket-in-a-notebook

I would like to load in a dataframe that lives on a Google Bucket I own and have used before with Terra Notebooks. With a Python-based environment, it seems I can just use "gs://my-directory/file" but this doesnt seem to work in R.

I also tried to use setwd(google_bucket) and got the error message:

Error in setwd(google_bucket): cannot change working directory

Any tips?

Comments

13 comments

Sushma Chaluvadi
- July 30, 2020 20:00
Hello,

I haven't tested this myself just yet but this Notebook has a section "Option 2: Save file as tsv to your workspace bucket" that has an example of how to copy a file BACK to your workspace bucket so perhaps testing reversing the strings of source and destination in:
```
#copy tsv to workspace bucket
system(str_glue('gsutil cp traits_for_analysis.tsv {ws_bucket}/tsv-objects/ 2>&1'), intern = TRUE)
```
Possibly useful!
0
Dan Spagnolo
- July 30, 2020 21:04
Doesn't seem to be doing what I need.

It's not clear to me how to load in data sitting in a Google Bucket in a Terra R notebook. E.g. loading up gs://my-directory/file with something as simple as read.table(), where this would be quite easy to do in a corresponding Python notebook.

All the guides I am seeing in Terra documentation reference BigQuery which I am not using. I just have a standard Google Bucket.

1
Sushma Chaluvadi
- July 30, 2020 22:39
I'm not sure that the below still gets you exactly what you had in mind but I was able to read in a file named `inputs.yaml` from my Workspace bucket to the R kernel Notebook Runtime (renamed as inputs_R.yaml) with the following commands:

0
Dan Spagnolo
- July 30, 2020 23:22
Thanks! This looks like it is working, though I had to hard code in the bucket directory since I don't have str_glue() installed and wasn't sure what library it is a part of.

So is this copying the file from the google bucket to virtual disk that is running the notebook? Is there a way to bypass doing so and just setwd() to a google bucket? One of the files I need to use is quite large. In my Python notebook I just use the bucket directory as part of the file path and it seems to work without anything copying to the notebook directory.

It is also not in my Terra google bucket though I can copy it there if needed.

0
Sushma Chaluvadi
- July 31, 2020 00:21
I think the library(readr) is the one that you need to use str_glue() -- atleast that is the one that I had to import to get the str_glue() command to work without an error!

Yes, effectively, these commands are going to copy the file from the Terra workspace bucket, into the Notebook runtime environment's persistent disk. At this time, there isn't a way to bypass that though I believe there is work for the future to find a way to "mount" buckets to the runtime environments which would alleviate the intermediate step of copying.

If your file is in a non-Terra google bucket you *should* be able to copy it with the same commands. You do have to grant your Proxy Group (found in the "Profile" section from the upper left hand "hamburger" menu) Storage Object Viewer permissions in your external bucket.

Re: In my Python notebook I just use the bucket directory as part of the file path and it seems to work without anything copying to the notebook directory. I'm not quite sure what you mean by this but if you can paste a line of code that might help - It will also allow the Terra support team to share ideas from their end!

0
Sushma Chaluvadi
- July 31, 2020 16:47
I was re-running these cells again today and realized that I needed to add library(stringr) as well. Just a note in case there are errors with using the above code.

0
Jason Cerrato
- August 06, 2020 20:24
Hi Dan Spagnolo,

Just wanted to check in here to see how things were going. Were Sushma's suggestions helpful in moving you forward?

Kind regards,

Jason

0
Dan Spagnolo
- August 06, 2020 22:49
Yes, Sushma's suggestions helped me greatly. Thanks Sushma Chaluvadi! I did have to use stringr as you mentioned, not readr.

Is this the standard way to access data in an R notebook Jason Cerrato?

My confusion, was that for a Python notebook, if my gsutil link is gs://my-directory, and i set bucket = "gs://my-directory" then I could simply use open(bucket+my_file) or in the case of importing files in Hail use import_bgen(bucket+my_file).

R notebooks have the extra step of copying the needed files to the Notebook runtime environment's persistent disk. One of the files I need to use is 5GB, and I got a warning that I should install and use gsutil crcmod, but was able to download without that install luckily.

0
Jason Cerrato
- August 07, 2020 15:01
Hi Dan,

Sushma's suggestion is probably the best way to go about it at this time, as there isn't really a standard way of doing this in R. That said, our notebooks product manager has some plans to investigate ways we can create/support mechanisms for importing data down the road.

Kind regards,

Jason

0
Aravind Easwar
- July 15, 2021 18:16
Do we still have to copy files from the google bucket or is there a way to access them without copying them to the persistent disk ?

0
Samantha (she/her)
- July 16, 2021 17:54
Hi Aravind Easwar,

Thanks for your question. It would depend on what you're doing with the files. Can you explain what you are ultimately trying to achieve without copying them?

Best,

Samantha

0
Aravind Easwar
- July 25, 2021 07:33
Hi Samantha (she/her),

Apologies for the late reply. I would like to use the abundance files generated by kallisto for further analysis using sleuth. The abundance files are present in the google bucket. For sleuth analysis , I'm using RStudio which has the storage as persistent disk.

Thanks,

Aravind Easwar

0
Samantha (she/her)
- July 26, 2021 15:31
HI Aravind Easwar,

In that case, you may be interested in using the AnVIL R package, which allows you to interact with Google buckets. You can see an example of how to use AnVIL in RStudio by viewing this video here: https://youtu.be/JAcCtTkkvJw?t=125

I hope this helps. Let me know if you have any other questions.

Best,

Samantha

0

Please sign in to leave a comment.