Job Manager page chokes on large submissions

Post author
dplichta

With many samples (>3000) across multiple tasks, including scatter-gather, the job manager chokes and returns an error when I try to click on one of my running jobs. The same goes for finished jobs that are big, which is particularly an issue for jobs that failed and I need to inspect which task / shard was the issue. In FireCloud I can look at the corresponding job manager OK (it's a bit slow, but works). Any solution?

Error:

_ _ __ __ | | | | | \/ | | | ___ | |__ | \ / | __ _ _ __ __ _ __ _ ___ _ __ _ | |/ _ \| '_ \ | |\/| |/ _` | '_ \ / _` |/ _` |/ _ \ '__|| |__| | (_) | |_) | | | | | (_| | | | | (_| | (_| | __/ | \____/ \___/|_.__/ |_| |_|\__,_|_| |_|\__,_|\__, |\___|_| __/ | |___/

 

Job Manager is running but encountered a problem getting data from its workflow server.

Click here to start over.

504: OK

 

Comments

10 comments

  • Comment author
    Sushma Chaluvadi

    Hello Damian-

    If you are able, can you share the name of your workspace and share the workspace so that we can look at the submissions that have caused this error? We can take a closer look.

    0
  • Comment author
    dplichta

    Hi Sushma,

     

    It's here: https://app.terra.bio/#workspaces/rjxmicrobiome/rjxmicrobiome/job_history/905ce8dc-20cc-42c4-84fb-acaf7585e142

    rjxmicrobiome/Map_Gene_Abundance_mhhOqI5a85M
     
    Damian
    0
  • Comment author
    Sushma Chaluvadi

    Damian -

    Thank you for sharing your workspace. After a bit of digging we believe that this is an issue that we are encountering internally as well. When you see the error you described below, do you happen to recall if the URL had the following pattern <job-manager-url>/?jobs/undefined with "undefined" being the keyword to look for?

    It seems that with large submissions or an overload of the system, Job Manager shows this Server Error when the UUID for the workflow has yet to be generated but the View link is enabled. 

    For the time being, you should be able to see Job Manager if you wait a few minutes and refresh.

    0
  • Comment author
    dplichta

    Hi Sushma,

    It's the following link that doesn't generate a report:

    https://job-manager.dsde-prod.broadinstitute.org/jobs/1b7ee88f-c59d-4ba1-9d50-7c02b8bc074b

    I tried refreshing a few times, still the same error.

    Could your team double check? I can give you access if needed.

    Damian

    0
  • Comment author
    Sushma Chaluvadi

    Damian -

    Thanks for sharing the link. It looks like you shared the link to the workspace but can you also add

    GROUP_FireCloud-Support@firecloud.org as a Writer to your workspace so we can look into this. 

     

    0
  • Comment author
    dplichta

    Sushma,

    I added you through the Terra interface under group management page, but I could only add you as member or admin (selected the former). You are also in the rjxmicrobiome workspace as writer.

    Another issue that came up now: in Terra interface when I try to share a workspace and start typing in the field "User Email", I get:

     

    0
  • Comment author
    Sushma Chaluvadi

    Hi Damian -

    We have fixed the sharing workspaces error and it should no longer be an issue!

    0
  • Comment author
    Ruchi Munshi

    Hey Damian,

    I inspected this a bit and it seems that the information Job Manager loads is much more than FireCloud, hence it's taking longer and the page is experiencing a timeout. We are looking at ways to speed up this page loading time and have suggestions. I should have an update in a few days.

    0
  • Comment author
    dplichta

    Hi Ruchi,

    Thank you! While you are at it, we are also observing failed attempts at listing Job Manager for the opposite case - single sample only (https://job-manager.dsde-prod.broadinstitute.org/jobs/ad3cd0f9-326d-4344-9e36-9047833ac25e). Could you confirm what's wrong?

    Damian

    0
  • Comment author
    Brian Haas

    I had a large job submitted that contained several thousands of targets in a single scatter.  The job manager fails to display helpful status info when trying to explore it.  Screenshot attached.

    0

Please sign in to leave a comment.