Strategy for cleaning up a workspace bucket
I have a workspace associated with a now-published manuscript, so I'm not producing new results but would like to keep all relevant workflow runs and data results around. I want to get rid of a bunch of old files (including years' worth of old workflow runs), but have a complicated set of workflow runs with arbitrary Google Cloud bucket names that have relevant data and need to stick around.
Is there an efficient strategy for searching through a workspace bucket to sort by time and/or size? I have enough runs in the workspace that straightforward strategies based on "gsutil ls" and postprocessing of that output end up failing or never finishing. Thanks!
Comments
9 comments
Hi Kenneth,
Thanks for writing in! You're correct that gsutil ls is the easiest way to look at all the files in a bucket, and the one that we normally recommend. However, this doesn't sound like it'll work for you, so I'm going to investigate and see if I can find a better solution. One question I had in the meantime is are you running gsutil from within a Terra Cloud Environment or from your computer's terminal using gcloud?
Please let me know if you have any questions.
Best,
Josh
Thanks, Josh! So far, I've been using gcloud from a terminal on my laptop.
Hi Kenneth,
Thanks for the reply! One alternative solution for reducing the size of a bucket would be to delete any intermediate files that still exist. You can read more about that here: Saving storage costs by deleting intermediate files. At the bottom of that document is a link to a notebook file that will help with deleting the files.
As for sorting though the existing files, you could go though the files section of the Data tab in Terra or the bucket UI for GCP. These will allow you to view the files manually, but there aren't options to sort them by size.
As it turns out, the best way to do this is with gsutil ls, so I have a few options you could try. One, would be to use the Terminal of a Cloud Environment inside of Terra to list and delete the files using gsutil ls. You may need to give the environment more resources (CPU and memory) than the default, so I would slowly try higher settings to see if you can find one that will allow you to run the command without failing.
The second option would be to manually view the files and move or delete a few of them until you can run the gsuil ls command from your laptop with issue.
Please let me know if that information was helpful of if you have any questions.
Best,
Josh
I do think the strategy of deleting intermediate files could get the job done (I'm fine getting rid of any workflow products other than logs and files that are actually tracked as results in the Data Model). I tried using the Remove_Workflow_Intermediates.ipynb notebook strategy to get this done, but I actually hit the same obstacle that I had encountered when trying to get this done "manually" -- I see
ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
during the portion of the notebook where "gsutil ls -l gs://bucket_name/**" is called.
Hi Kenneth,
Thanks for the reply! I know that we're actively working to improve the functioning of that notebook file, but it seems that it might be running into an issue with your workspace. Would it be possible to share your workspace with Support, so that I can take a look? If so, could you:
Once we have access, I'll take a look as soon as I can!
Please let me know if you have any questions.
Best,
Josh
Workspace link: https://terra.biodatacatalyst.nhlbi.nih.gov/#workspaces/kw-bdc-strides-credits/TOPMed-gene-diet
Hi Kenneth,
Thanks for providing access! After looking over the workspace and the Google Bucket, I have a few suggestions that may be helpful.
First, you could try to open the workspace's bucket in the GCP UI and deleting the directories and files manually. You can get to the workspace bucket by clicking the open bucket in browser link under Cloud Information on the workspace's dashboard. While this isn't the fastest way to remove old data, it can be used to hopefully reduce the amount of directories to the point where you can run gsutil ls without timeouts.
Second, would be to try to run the gsutil du command. This allows you to see the size of objects (files and directories) within a Google Bucket. While running this on the entire bucket at once might return the same timeout errors as with gsutil ls, you could run it on one directory at a time which will at least give you a sense of the size of child-directories and files. You can find the names of main directories in the bucket by following the directions above to open the workspace bucket in the GCP UI, and you can find more information on the gsutil du command here: https://cloud.google.com/storage/docs/gsutil/commands/du
Third, you could try to run a Cloud Environment and run the gsutil ls or gsutil du commands from its terminal. You may get the same timeout errors at first, but you could try slowly adding more resources to the environment to help with this issue.
Please let me know if any of that information was helpful or if you have any questions.
Best,
Josh
Thanks for all of these suggestions! Ultimately, I was able to make progress by filtering the Job History page in a semi-targeted way and going one-by-one to the associated GCP UI pages to delete non-logfile subdirectories. Slow, but eventually got the overall bucket trimmed down enough to successfully use some of the gsutil-based approaches.
Hi Kenneth,
Thanks for the update! That's great news! I'm glad some of my suggestions were helpful.
Best,
Josh
Please sign in to leave a comment.