Common issue/error "cheat sheet"

Lin Betancourt
  • Updated

Learn how to identify and solve some common errors you may experience in Terra. For more troubleshooting tips, see How to troubleshoot failed workflows

Web Portal: Internal Server Error or Gateway Timeout

What it means

Terra may be temporarily down.

What to do

Please wait five minutes and refresh your browser page. If you still see this message, please let us know through the Terra Community Forum.

VM preemption (50001)

Issue (in the statusEvents field for a job)

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to Spot Preemption with exit code 50001.

What happened

This issue occurs when a Spot VM for the job is preempted during run time.

Solution

To resolve the issue, do one of the following:

  • Retry the task either by using automated task retries or manually re-running the job.
  • To guarantee there is no preemption, use VMs with the standard provisioning model instead.

VM reporting timeout (50002)

Issue (in the statusEvents field for a job)

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to Batch no longer receives VM updates with exit code 50002.

What happened

This issue occurs when there is a timeout in the backend that caused Batch to no longer receive updates from a VM for the job. Unfortunately, many hardware failures or software failures can cause a VM to be unresponsive—for example, a VM might crash due to a temporary host event or insufficient resources.

Solution

  1. In case the issue is temporary and resolves itself, retry the task either by using automated task retries or manually re-running the job.
  2. If the issue persists, identify and resolve what is causing the VM to be unresponsive by doing one or more of the following:
    • Recommended: Get support through Google Cloud Support or the Batch label on Cloud Forums.
    • Try to identify and resolve the issue yourself. For example, if you are familiar with Compute Engine, you can try to troubleshoot the job's VMs by doing the following:
      • To identify the names of your job's VMs, do the following:
        • View logs for the job.
        • Filter logs for entries that contain phrase report agent state:.
        • Review the logs to determine the VM for each attempt of each task. Each log is similar to the following, in which there is one instance: phrase and one or more task_id: phrases.
          report agent state: ... instance:"INSTANCE_NAME" ... task_id:"task/JOB_UID-group0-TASK_INDEX/TASK_RETRIES/0 ..."

          This log includes the following values:
          • INSTANCE_NAME: The name of the VM.
          • JOB_UID: The unique ID (UID) of the job.
          • TASK_INDEX: The index of the task.
          • TASK_RETRIES: The attempt of the task that ran on this VM, which is formatted as the number of retries. For example, this value is 0 for the first attempt of a task. Each task is only attempted once unless you enable automated task retries.
      • Troubleshoot your job's VMs using the Compute Engine documentation. For example, see Troubleshooting VM shutdowns and reboots and Troubleshooting VM startup.

VM rebooted during execution (50003)

Issue in the statusEvents field for a job

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to VM is rebooted during task execution with exit code 50003.

What happened

This issue occurs when a VM for a job unexpectedly reboots during run time.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

VM and task are unresponsive (50004)

Issue in the statusEvents field for a job

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to tasks cannot be canceled with exit code 50004.

This issue occurs when a task reaches the unresponsive time limit and cannot be cancelled.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

Task runs over the maximum runtime (50005)

Issue in the statusEvents field for a job

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to task runs over the maximum runtime with exit code 50005.

What happened

This issue occurs in the following cases

  • A task's run time exceeds the time limit specified in the maxRunDuration field
  • A runnable's run time exceeds the time limit specified in the timeout field

To identify specifically which time limit was exceeded, view logs for the job and find a log that mentions the 50005 exit code. This textPayload field of this log indicates where and when the time limit was exceeded.

Important: Due to a known issue, the logs generated by Batch for exceeded timeout don't indicate whether the task's timeout or the runnable's timeout was exceeded. For a workaround that explains how to identify which timeout was exceeded, see the known issue.

Solution

To resolve the issue, attempt to verify the total run time required by the task or runnable that exceeded the time limit. Then, do one of the following.

  • If you only occasionally expect this error, such as for a task or runnable with an inconsistent run time, you can try to recreate the job and configure it to automate task retries to try to increase the success rate.
  • Otherwise, if the task or runnable consistently and intentionally needs more time to finish running than the current timeout allows, you can try to restructure your workflow to divide the work among multiple tasks. If this isn't possible, please add a description of your use case and upvote this feature request.

VM recreated during execution (50006)

Issue in the statusEvents field for a job

Task state is updated from PRE-STATE to FAILED on zones/ZONE/instances/INSTANCE_ID due to VM is recreated during task execution with exit code 50006.

What happened

This issue occurs when a VM for a job is unexpectedly recreated during run time.

Solution

To resolve this issue, retry the task either by using automated task retries or manually re-running the job.

Error message: AccessDeniedException

The error text

The error might look something like this:

AccessDeniedException: 403 serviceaccountname@serviceaccountdomain.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission ‘storage.objects.list’ denied on resource (or it may not exist).

What's happening

You might encounter this error when running a workflow that uses a pet service account to access data stored outside of your Terra workspace.

This means that an account that needs to access your data for an analysis doesn’t have the necessary access permissions. This could happen for several reasons:

  1. You never gave the account permissions to access the Google Cloud bucket where the data is stored.
  2. The account isn’t registered for Terra.
  3. You added the account to the Google Cloud bucket, but the underlying Google id has since changed.

What to do

1. Check whether the account has the necessary permissions for the Google Cloud bucket--and adjust the permissions if necessary--by following the instructions in How to access external Google Cloud resources

    • Note: you must have an owner or admin role on the Google Cloud bucket in order to edit the bucket's permissions.

2. Register the account in Terra. If the error persists after verifying that the account has permission to access the Google bucket, it's possible that the account has not been registered in Terra. 

3. Start over with a new service account. If the error persists after you've completed steps 1 and 2, the account may be attached to an orphaned Terra account. This can happen if a service account was deleted and then re-created (for example, when someone left an institution). In this case, the best solution is to create a new service account and add it to the Google bucket instead. For step-by-step instructions, see How to use a service account in Terra and Best practices for using service accounts in Terra.

Google Cloud Billing account does not appear in the Billing menu

What to do 

  1. Verify that terra-billing@firecloud.org is a billing account user on your Google Cloud Billing account. Refer to the instructions in How to set up billing in Terra to add Terra as a user on your Google Cloud Billing account. After doing this, you should be able to create a Terra Billing project and link it to your Google Cloud Billing account.
  2. Another reason you may not see your Google Cloud Billing account? You may have enabled Terra billing permissions for a Google ID (e.g., Gmail address) that is different from your Terra user ID. This could happen if you have both a professional and a personal Google account, and are signed into your personal account (not the account used for your Terra user ID) in the browser where you're working.

How to switch Google accounts in Chrome

Please enable Terra billing permissions for the Google ID under which you created your Cloud Billing account. You may need to close your Terra session and/or log out of the incorrect Google account before re-enabling Terra billing permissions.

1. In your Chrome browser, sign in to Google.

2. On the top right (to the left of the three vertical dots), select your profile image or initial.

3. On the menu, select or add the account corresponding to your Terra user ID.

Expired Google Billing account trial

What's happening

Google Billing accounts set up using a free trial will automatically expire after 60 days unless you upgrade to a paid account. If your Google Billing account expired, you will be unable to create a new Terra Billing Project.

What to do

To verify that your Google Billing account is active, log in to the Google Developers Console Billing page. You can select Show all accounts to display an expired Google Billing account. Google will prompt you to upgrade your free account to a full account backed by a credit card or bank account. 

If your Google Billing account is inactive, we suggest that you create a new Google Billing account.

Didn't find your problem/solution? Try these resources

  • Bugs & Feature Requests: List of known bugs, limitations, and requested enhancements.
  • Forum: Ask the support team a question directly or see if other users have asked your question. The support team tries to respond to every question within one business day.

 

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.