Cloud logging

Post author
Devin McCabe

I've always found it odd that although my Terra submissions are ostensibly running in Google Cloud, I can't monitor their progress in GCP Logs Explorer.

Instead, Terra creates an enormous number of redundant log files in GCS. On a recent submission, I was able to find the same distinct log line in 4 different files:

  • submissions/intermediates/<submission-id>/<method-name>/<workflow-id>/call-<task-name>/<task-name>.log
  • submissions/intermediates/<submission-id>/<method-name>/<workflow-id>/call-<task-name>/pipelines-logs/action/13/stdout
  • submissions/intermediates/<submission-id>/<method-name>/<workflow-id>/call-<task-name>/pipelines-logs/output
  • submissions/intermediates/<submission-id>/<method-name>/<workflow-id>/call-<task-name>/stderr

This is obviously wasteful from a storage perspective, and not just in terms of the data duplication.

Although there is no de jure per-blob cost (a 0 byte blob costs nothing to merely exist), the fees associated with blob operations, retrieval, autoclassing, etc., constitute a de facto per-blob cost. I'm especially mindful of the sheer number of blobs in a Terra bucket when I try to list files, e.g. for automopping. A heavily-used workspace can easily have billions of blobs in it.

My hot take is that the correct number of log files that Terra generates during a submission should be zero: From a philosophical standpoint, logs aren't files–they're streams of text. If they were instead streamed to cloud logging, they could easily be tailed, rotated, monitoring, aggregated, analyzed, exported, etc. With the current setup, you can't really do any of those things.

Comments

1 comment

Please sign in to leave a comment.