Cloud logging

April 10, 2025 20:27
1 comment

I've always found it odd that although my Terra submissions are ostensibly running in Google Cloud, I can't monitor their progress in GCP Logs Explorer.

Instead, Terra creates an enormous number of redundant log files in GCS. On a recent submission, I was able to find the same distinct log line in 4 different files:

submissions/intermediates/<submission-id>/<method-name>/<workflow-id>/call-<task-name>/<task-name>.log
submissions/intermediates/<submission-id>/<method-name>/<workflow-id>/call-<task-name>/pipelines-logs/action/13/stdout
submissions/intermediates/<submission-id>/<method-name>/<workflow-id>/call-<task-name>/pipelines-logs/output
submissions/intermediates/<submission-id>/<method-name>/<workflow-id>/call-<task-name>/stderr

This is obviously wasteful from a storage perspective, and not just in terms of the data duplication.

Although there is no de jure per-blob cost (a 0 byte blob costs nothing to merely exist), the fees associated with blob operations, retrieval, autoclassing, etc., constitute a de facto per-blob cost. I'm especially mindful of the sheer number of blobs in a Terra bucket when I try to list files, e.g. for automopping. A heavily-used workspace can easily have billions of blobs in it.

My hot take is that the correct number of log files that Terra generates during a submission should be zero: From a philosophical standpoint, logs aren't files–they're streams of text. If they were instead streamed to cloud logging, they could easily be tailed, rotated, monitoring, aggregated, analyzed, exported, etc. With the current setup, you can't really do any of those things.

Comments

1 comment

Beth Sheets
- April 10, 2025 21:00
- Official comment
Thanks for your feedback, Devin. We are hoping to improve the usability of workflow logs with Cromwell's transition to Google Batch API. You can follow our roadmap article to receive updates: https://support.terra.bio/hc/en-us/articles/31190930435483.

Please sign in to leave a comment.