Workflow error invalid value for field 'resource.machineType'
Hi - This is my first time running a workflow in Terra and I would like help to interpret the error. I used the public workflow "bcftools_filter_bedtools_intersect" which takes in vcf and intersects with a bed file. I tried the workflow on a small sample vcf with no problems. Then I tried it on my own vcf which is 1.3TB with the same bed file and got a 'not enough space' error. I made changes to # cores (tried 4 then 1) and disk (2700 Gb) and memory (1.4 Tb).
Now I get an error that I don't understand "Task bcftools_filter_bedtools_intersect.filter_intersect:NA:1 failed. The job was stopped before the command finished. PAPI error code 3. Execution failed: allocating: creating instance: inserting instance: Invalid value for field 'resource.machineType': 'zones/us-central1-c/machineTypes/custom-216-1433600'. Custom Machine type with name 'custom-216-1433600' does not exist." I have no idea where/how I specified such a specific machine or how I recommend it choose another. I tried again thinking maybe it was random and reduced #cores to 1 in case, but still the same error. Help interpreting and troubleshooting appreciated.
Comments
15 comments
Hi Diane,
Thank you for writing in about this issue! Can you share the workspace where you are seeing this issue with Terra-Support@firecloud.org by clicking the Share button in your workspace? The Share option is in the three-dots menu at the top-right.
Please provide us with
Kind regards,
Pamela
Hi - I realized I replied to the email and therefore it didn't go through to this question as a reply. I shared the workspace and here are the details:
https://app.terra.bio/#workspaces/Epi4K_CandidateGenes/Epi25K%20Analysis
submission: 3d9bb9d4-6d94-49b1-ba69-819fd55f590a
Workflow Configuration
Now that it has been a week, I also wanted to inquire on the job I mentioned that seems to run without error (baac0b95-bd34-439d-84e8-80111feb393b). I am very new to this and since it has been 7 days, I am unsure how to tell if there is actually something happening and it requires more time, or if this job has stalled but is just showing that it is still running.
Hi Diane Shao,
For some reason, I am still unable to access your workspace. Could you please ensure that it is shared with Terra-Support@firecloud.org? In regards to your second message, just for clarification, this particular job has been running for 7 days?
Kind regards,
Pamela
You're right the share didn't go through before, but I think I successfully shared it now.
This submission gave the error in the initial post:
3d9bb9d4-6d94-49b1-ba69-819fd55f590a
This submission running now for 1 week:
baac0b95-bd34-439d-84e8-80111feb393b
Hi Diane,
Thank you for sharing your workspace. With the machines that Terra uses by default, you are only able to specify up to 96 CPUs and 624GB of memory (6.5GB per CPU). This is a limitation set by Google for N1 machines, which Terra uses by default. You can find more information about machine types and limits here. From the error message you're receiving, it looks like you are trying to request 216 CPUs which would be over the limit and is likely why the job runs successfully when you didn't specify the memory. For the job that is currently still running, did you use the same memory specifications?
Kind regards,
Pamela
Thank you for clarifying (I had no idea of the # of CPUs). In the job that is currently still running I had left the memory spot blank. That is why I'm worried that perhaps it is stalled. I would receive an error, or some notification if it were a problem, is that correct? So if it shows running, it likely just needs more time?
Hi Diane,
Yes, I would assume that if something were wrong, you would receive an error message and the job would fail. Given that it still says the job is running, I would wait to see if it progresses further. If there is no progression, don't hesitate to reach back out so we can look further into what might be going on.
Kind regards,
Pamela
Hi Pamela - I think that is my question. How do I know if there is "progression"?
Hi Diane,
That's a great question! You can monitor the progression of the job by clicking on the job history tab, then clicking on the relevant submission. You can then click on the "job manager" (first screenshot) which will show the tasks and inputs that are being run. You can then click on the "backend log" (second screenshot) which will show the specific tasks being executed. In your case, it looks like this job may not be progressing as it should as there isn't anything present in the "timing diagram" and not much appears in the backend log. It's possible that you have not allocated enough RAM to the job which could prevent it from progressing.I would recommend taking a look at this article which goes through some troubleshooting steps for when workflows aren't running as they should.
Please let me know if you have any additional questions.
Kind regards,
Pamela
Hi Diane,
I wanted to jump in to also add that an apparent lack of progress in the backend log doesn't necessarily mean there isn't anything happening in the job. It all depends on what tools you're running and what feedback those commands give (if any). Some commands run for long periods of time without any feedback. It's helpful to view your log and see what step it was running last to see whether the time it's taking to run that step is within the range of your expectations time-wise, based on how many resources you've allocated for that task (CPUs, memory, etc.).
Kind regards,
Jason
Hi Diane,
Thank you for writing in about the new error. Where are you seeing this error? When I click on the job manager for your most recent workflow, I only see the following error:
"Task bcftools_filter_bedtools_intersect.filter_intersect:NA:1 failed. The job was stopped before the command finished. PAPI error code 4. User specified operation timeout reached"
Could you send the relevant submission ID or screenshot of the access error you're seeing? It does seem likely that the error is occurring due to the "anonymous" caller. It's possible that the job did not reach the task where the file was required for a few days, which is when the error occurred.
Kind regards,
Pamela
So strange. Now I also cannot find where I saw that note about the Anonymous user. Perhaps it is indeed just a script taking too long to run and timed out. Do you have a recommendation for a workflow that can breakdown my extremely large vcf (perhaps by chromosome or some other parameter) into chunks to reduce the run time?
Hi Diane,
I would suggest to first try allocating more RAM to the job to see if this allows it to run faster. I'm not sure of a workflow or guidelines for breaking up the vcf into smaller pieces but I can look into it if this is what you decide you'd like to do.
Kind regards,
Pamela
Please sign in to leave a comment.