Job Manager page chokes on large submissions
With many samples (>3000) across multiple tasks, including scatter-gather, the job manager chokes and returns an error when I try to click on one of my running jobs. The same goes for finished jobs that are big, which is particularly an issue for jobs that failed and I need to inspect which task / shard was the issue. In FireCloud I can look at the corresponding job manager OK (it's a bit slow, but works). Any solution?
Error:
_ _ __ __ | | | | | \/ | | | ___ | |__ | \ / | __ _ _ __ __ _ __ _ ___ _ __ _ | |/ _ \| '_ \ | |\/| |/ _` | '_ \ / _` |/ _` |/ _ \ '__|| |__| | (_) | |_) | | | | | (_| | | | | (_| | (_| | __/ | \____/ \___/|_.__/ |_| |_|\__,_|_| |_|\__,_|\__, |\___|_| __/ | |___/
Job Manager is running but encountered a problem getting data from its workflow server.
504: OK
Comments
10 comments
Hello Damian-
If you are able, can you share the name of your workspace and share the workspace so that we can look at the submissions that have caused this error? We can take a closer look.
Hi Sushma,
It's here: https://app.terra.bio/#workspaces/rjxmicrobiome/rjxmicrobiome/job_history/905ce8dc-20cc-42c4-84fb-acaf7585e142
Damian -
Thank you for sharing your workspace. After a bit of digging we believe that this is an issue that we are encountering internally as well. When you see the error you described below, do you happen to recall if the URL had the following pattern <job-manager-url>/?jobs/undefined with "undefined" being the keyword to look for?
It seems that with large submissions or an overload of the system, Job Manager shows this Server Error when the UUID for the workflow has yet to be generated but the View link is enabled.
For the time being, you should be able to see Job Manager if you wait a few minutes and refresh.
Hi Sushma,
It's the following link that doesn't generate a report:
https://job-manager.dsde-prod.broadinstitute.org/jobs/1b7ee88f-c59d-4ba1-9d50-7c02b8bc074b
I tried refreshing a few times, still the same error.
Could your team double check? I can give you access if needed.
Damian
Damian -
Thanks for sharing the link. It looks like you shared the link to the workspace but can you also add
GROUP_FireCloud-Support@firecloud.org as a Writer to your workspace so we can look into this.
Sushma,
I added you through the Terra interface under group management page, but I could only add you as member or admin (selected the former). You are also in the rjxmicrobiome workspace as writer.
Another issue that came up now: in Terra interface when I try to share a workspace and start typing in the field "User Email", I get:
Hi Damian -
We have fixed the sharing workspaces error and it should no longer be an issue!
Hey Damian,
I inspected this a bit and it seems that the information Job Manager loads is much more than FireCloud, hence it's taking longer and the page is experiencing a timeout. We are looking at ways to speed up this page loading time and have suggestions. I should have an update in a few days.
Hi Ruchi,
Thank you! While you are at it, we are also observing failed attempts at listing Job Manager for the opposite case - single sample only (https://job-manager.dsde-prod.broadinstitute.org/jobs/ad3cd0f9-326d-4344-9e36-9047833ac25e). Could you confirm what's wrong?
Damian
I had a large job submitted that contained several thousands of targets in a single scatter. The job manager fails to display helpful status info when trying to explore it. Screenshot attached.
Please sign in to leave a comment.