This release note corresponds to the dates May 18, 2023 to May 25, 2023. This release includes back-end updates to workflows, interactive analysis (Notebooks, Galaxy, RStudio), user interface, Data Repository, and Google and Azure integrations to improve upcoming features.
Data Repo
- There is a new dataset endpoint 'lookupDatasetColumnStatisticsById'. This call returns statistics about text and numeric columns. This currently supports GCP datasets, but support for Azure is coming soon. For text columns (data types string, text, fileref, dirref), we return a list of unique values and occurrence count per value. For numeric columns (data types float, float64, int, int64), we return the minimum and maximum value found in the column values.
- If a GCP-backed dataset has a dedicated (unique) service account, then TDR now bypasses the costly permission checking of all user-specified GCS files. Such service accounts are unique to the dataset: flights will correctly fail if they lack needed file permissions, and they will fail fast if the caller does not have permission to initiate the flight in question. Benchmarking showed that file ingests ran up to 2.5x faster with this change.
- When a snapshot is created from a dataset with secure monitoring enabled, TDR will register a policy in the Terra Policy Service so that it can only be exported into a workspace with appropriate security/monitoring constraints.