Hail does not import on Spark Cluster

Post author
Cecile Avery

Hello!

Apologies if this is a low-level question. I have been using a jupyter notebook to develop a filtering script, but it's been running extremely slowly with a single spark cluster. I had been reading that if the operation on each row is independent, it may be beneficial to use a spark cluster for parallelization.

 

I thought it would be as simple as updating my environment, but when I try to import hail on the spark cluster, I get the following error:

Invalid maximum heap size: -Xmx0m
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

Am I missing some other environment configuration? I couldn't find any examples that load hail outside of the basic:

import hail as hl
hl.init(default_reference='GRCh38', idempotent=True)

 

Thank you!

Comments

3 comments

  • Comment author
    Josh Evans

    Hi Cecile,

    Thanks for writing in! Have you changed the configuration within Terra itself to use Spark Clusters? That setting can be found on the Cloud Environment's configuration page

    You can read more about this at this link: https://support.terra.bio/hc/en-us/articles/5075814468379-Starting-and-customizing-your-Jupyter-app

    Please give that a try if you haven't already. If you have, please let me know and I'll provide some other troubleshooting steps.

    Let me know if you have any questions.

    Best,

    Josh

    0
  • Comment author
    Cecile Avery

    Hi Josh,

     

    Thank you for your response! 

    Yes, I have changed the configuration itself to use the spark cluster. The environment updates accordingly, and then when I re-run my notebook, hail does not import. On a single node, I do not run into any problems.

    0
  • Comment author
    Josh Evans

    Hi Cecile,

    Thanks for the reply. I tired replicating the errors you were getting and was unable to do so. I will say that when I tried the commands you provided, I did get an error that Using hl.init with a default_reference argument is deprecated. So I used hl.default_reference('GRCh38') and was able to load the genome. 

    I would suggest trying that just in case it's part of the issue.  If that doesn't then you may want to delete everything and try again.  (Please export any data before deletion.) 

    If you're still getting the error at that point, then I'd like to look over all the code you're using with this cluster. If you're using a notebook file with this Cloud Environment, please share your workspace with Support and I'll take a look as soon as I can.  If not, please add all of the code and commands that you are running before the two that you mention at the beginning of this thread.

    Please let me know if that information was helpful or if you have any questions.

    Best,

    Josh

    0

Please sign in to leave a comment.