Terra Terminal default configuration does not support using `gsutil` to copy/download Google composite files Completed
Current problem description:
The Terra Terminal default configuration does not support using `gsutil` to copy/download Google composite files.
For example, in Terra, previewing a DOS/DRS URI, then copying the provided `gsutil` command into a Terra Terminal window as follows:
```
$ gsutil cp gs://org-humancellatlas-dss-checkout-staging/blobs/b14003b2fefd97b1488b1702e5ebb247a0dfc3d3438821f6a04d57d61ca8a3ff.3654ad5a518bcf4222460e7daf3977d430a2587b.26361563a02af50391021fc3c379351f-5.c97722b5 .
```
produces the following output:
```
Copying gs://org-humancellatlas-dss-checkout-staging/blobs/b14003b2fefd97b1488b1702e5ebb247a0dfc3d3438821f6a04d57d61ca8a3ff.3654ad5a518bcf4222460e7daf3977d430a2587b.26361563a02af50391021fc3c379351f-5.c97722b5...
==> NOTE: You are downloading one or more large file(s), which would
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").
CommandException:
Downloading this composite object requires integrity checking with CRC32c,
but your crcmod installation isn't using the module's C extension, so the
hash computation will likely throttle download performance. For help
installing the extension, please see "gsutil help crcmod".
To download regardless of crcmod performance or to skip slow integrity
checks, see the "check_hashes" option in your boto config file.
NOTE: It is strongly recommended that you not disable integrity checks. Doing so
could allow data corruption to go undetected during uploading/downloading.
```
The output recommends using the instructions provided by `gsutil help crcmod` which are rather lengthy and more than some Terra users may want to, or be able to, resolve.
Desired behavior:
The Terra Terminal default/initial configuration should already be setup for `gsutil` to work successfully for Google composite files, without error or warnings such as those shown above.
Steps to reproduce:
The full sequence used to produce this problem is listed below:
1. Access the HCA Data Browser, here: https://dev.data.humancellatlas.org/explore/projects
In this case, we are obtaining our large sample file by exporting data from HCA to Terra,
although the issue with `gsutil` being able to copy composite files is more general than just HCA data.
2. Select the checkbox next to the project titled "Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns." (you may have to forward through multiple pages to find it)
3. Select the `Export to Terra (Demo)` button at the top of the page.
4. In the "Select Export File Types" select `bam` and `fastq.gz`, then select `Export to Terra` button.``
5. In the Terra import form, select create a new workspace`, provide a workspace name, and a billing project, then select `Create Workspace`.
6. In the Terra workspace, select the Data tab, then select the `participant` table.
7. Scroll to find the data column named `__bam_0__dos_url`, then select one of the values in this column, which will display the preview dialog. The size of the selected file will likely be a few hundred megabytes. Copy the `gsutil` command displayed in the preview dialog.
8. Start the Terra Terminal, by clicking the icon towards the top right of the Terra UI. Wait for the Terminal to start ...
9. Paste the `gsutil` command into the Terminal and hit the return key
10 Observe the following warning/error:
```
==> NOTE: You are downloading one or more large file(s), which would
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").
CommandException:
Downloading this composite object requires integrity checking with CRC32c,
but your crcmod installation isn't using the module's C extension, so the
hash computation will likely throttle download performance. For help
installing the extension, please see "gsutil help crcmod".
To download regardless of crcmod performance or to skip slow integrity
checks, see the "check_hashes" option in your boto config file.
NOTE: It is strongly recommended that you not disable integrity checks. Doing so
could allow data corruption to go undetected during uploading/downloading.
```
The file has not been copied, instead the user is instructed to follow the instructions produced by the following cammand: `gsutil help crcmod`, which are rather involved:
```
$ gsutil help crcmod
NAME
crc32c - CRC32C and Installing crcmod
OVERVIEW
Google Cloud Storage provides a cyclic redundancy check (CRC) header that
allows clients to verify the integrity of object contents. For non-composite
objects Google Cloud Storage also provides an MD5 header to allow clients to
verify object integrity, but for composite objects only the CRC is available.
gsutil automatically performs integrity checks on all uploads and downloads.
Additionally, you can use the ``gsutil hash`` command to calculate a CRC for
any local file.
The CRC variant used by Google Cloud Storage is called CRC32C (Castagnoli),
which is not available in the standard Python distribution. The implementation
of CRC32C used by gsutil is provided by a third-party Python module called
`crcmod <https://pypi.python.org/pypi/crcmod>`_.
The crcmod module contains a pure-Python implementation of CRC32C, but using
it results in very poor performance. A Python C extension is also provided by
crcmod, which requires compiling into a binary module for use. gsutil ships
with a precompiled crcmod C extension for macOS; for other platforms, see
the installation instructions below.
At the end of each copy operation, the ``gsutil cp`` and ``gsutil rsync``
commands validate that the checksum of the source file/object matches the
checksum of the destination file/object. If the checksums do not match,
gsutil will delete the invalid copy and print a warning message. This very
rarely happens, but if it does, please contact gs-team@google.com.
CONFIGURATION
To determine if the compiled version of crcmod is available in your Python
environment, you can inspect the output of the ``gsutil version`` command for
the "compiled crcmod" entry:
$ gsutil version -l
...
compiled crcmod: True
...
If your crcmod library is compiled to a native binary, this value will be
True. If using the pure-Python version, the value will be False.
To control gsutil's behavior in response to crcmod's status, you can set the
"check_hashes" configuration variable. For details on this variable, see the
surrounding comments in your boto configuration file. If "check_hashes"
is not present in your configuration file, rerun ``gsutil config`` to
regenerate the file.
INSTALLATION
These installation instructions assume that:
- You have ``pip`` installed. Consult the `pip installation instructions
<https://pip.pypa.io/en/stable/installing/>`_ for details on how
to install ``pip``.
- Your installation of ``pip`` can be found in your ``PATH`` environment
variable. If it cannot, you may need to replace ``pip`` in the commands
below with the full path to the executable.
- You are installing the crcmod package for use with your system installation
of Python, and thus use the ``sudo`` command. If installing crcmod for a
different Python environment (e.g. in a virtualenv), you should omit
``sudo`` from the commands below.
CentOS, RHEL, and Fedora
------------------------
Note that CentOS 6 and similar variants use Python 2.6 by default, which will
not run gsutil. To enable Python 2.7 and compile/install crcmod on CentOS 6:
sudo su # Run as root; need shell session with Python 2.7 enabled
yum install gcc python-devel python-setuptools redhat-rpm-config
source /opt/rh/python27/enable # Make default `python` executable use 2.7.X
python -m pip install -U pip # Upgrade old default version of pip
python -m pip uninstall crcmod
python -m pip install --no-cache-dir -U crcmod
exit # Exit su session
To compile and install crcmod on OS versions that use Python 2.7 by default:
sudo yum install gcc python-devel python-setuptools redhat-rpm-config
sudo pip uninstall crcmod
sudo pip install --no-cache-dir -U crcmod
Debian and Ubuntu
-----------------
To compile and install crcmod:
sudo apt-get install gcc python-dev python-setuptools
sudo pip uninstall crcmod
sudo pip install --no-cache-dir -U crcmod
Enterprise SUSE
-----------------
To compile and install crcmod:
sudo zypper install gcc python-devel
sudo pip uninstall crcmod
sudo pip install --no-cache-dir -U crcmod
macOS
-----
gsutil distributes a pre-compiled version of crcmod for macOS, so you shouldn't
need to compile and install it yourself. If for some reason the pre-compiled
version is not being detected, please let the Google Cloud Storage team know
(see ``gsutil help support``).
To compile manually on macOS, you will first need to install
`XCode <https://developer.apple.com/xcode/>`_ and then run:
sudo pip install -U crcmod
Windows
-------
An installer is available for the compiled version of crcmod from the Python
Package Index (PyPi) at the following URL:
https://pypi.python.org/pypi/crcmod/1.7
MSI installers are available for the 32-bit versions of Python 2.7.
Make sure to install to a 32-bit Python directory. If you're using 64-bit
Python it won't work with 32-bit crcmod, and instead you'll need to install
32-bit Python in order to use crcmod.
Note: If you have installed crcmod and gsutil hasn't detected it, it may have
been installed to the wrong directory. It should be located at
<python_dir>\files\Lib\site-packages\crcmod\
In some cases, the installer will incorrectly install to
<python_dir>\Lib\site-packages\crcmod\
Manually copying the crcmod directory to the correct location should resolve
the issue.
```
Comments
11 comments
Hi all,
crcmod is now available by default in our Jupyter base image! Thanks again for writing in and voicing your support.
Kind regards,
Jason
Has anyone worked to fix this issue? I am having a similar problem in Jupyter notebooks.
For example if I run:
I get the following error:
I am launching the Jupyter notebook with the Python VM (Python 3.7.9, pandas...). When I checked the installed packages it looks like this VM should have CRC32c.
But I am still having issues using gsutil to download chunked files.
Hi Julian Lucas,
Thank you for writing in. A member of our notebooks team will look into this and we'll get back to you as soon as we can.
Kind regards,
Jason
Hi Julian,
You can provide your runtime a startup script with the Debian/Ubuntu installation instructions found here get this working while our notebooks team considers adding this to the default runtimes.
Your script can look something like this:
#!/usr/bin/env bash
apt-get install gcc pytho3n-dev python3-setuptools
pip3 uninstall crcmod
pip3 install --no-cache-dir -U crcmod
Simply save this .sh script to a Google bucket (like your workspace bucket) and provide its path where it says Startup Script in your Cloud environment configuration.
I've confirmed this works by running the specified gsutil command:
If you have any questions, please let us know.
Kind regards,
Jason
Thanks, Jason. This works for me.
-Julian
Hi Julian,
Glad to hear. We'll be happy to let you know if a compiled crcmod gets added to our default runtimes.
Kind regards,
Jason
I could use this too! Im using the Rstudio image -- is there any way to deploy this?
Thanks!
Mike
Hi Michael Schatz,
Thanks for writing in! Are you looking to get a compiled crcmod working in Rstudio, or use gsutil more generally? Are you working with DOS/DRS files?
Kind regards,
Jason
Im looking to get crcmod working with gsutil when run from the Rstudio terminal console. But I already figured out I can install the crcmod package at the terminal using the conda installation of pip
Thanks!
Mike
Hi Michael Schatz,
Gotcha - thanks for letting us know! The Interactive Analysis team will be working on building in support for using startup scripts with RStudio environments. This can work as an alternative solution to your conda install, once it's made available.
I'm glad to hear you found a solution of your own!
Kind regards,
Jason
Startup scripts are now supported for RStudio!
Please sign in to leave a comment.