WARNING
cbiohub is a Python package and CLI tool designed to simplify the analysis of data from cBioPortal, including those hosted on the cBioPortal Datahub. Unlike existing API clients, which focus on slices of data via the REST API, cbiohub supports bulk analysis of harmonized datasets. By using combined parquet files instead of per-study CSV/TSV files, it enables faster data loading and querying.
cbiohub features:
- A data module for ingesting and converting cBioPortal Datahub files into parquet format
- An analysis module leveraging DuckDB for efficient local data exploration
With parquet’s widespread compatibility, cbiohub allows seamless integration with other programming languages and data warehousing tools.
For convenience, pre-combined parquet files for datahub are available from Hugging Face.
You can e.g. download the cBioPortal datahub files:
git clone [email protected]:cbioportal/datahub ~/git/datahub
Now ingest them i.e. convert them into parquet files on your local machine:
cbiohub data ingest ~/git/datahub/public/
All the data by default gets stored in ~/cbiohub/
. Combine all the study data together into a single study:
cbiohub data combine
Now you can use the cbiohub
package to analyze the data quickly. For example,
you can load the combined study data into a pandas DataFrame:
import cbiohub
df = cbiohub.get_combined_df()
Or you can use the cbiohub cli to do quick analyses:
> cbiohub find BRAF V600E
✅ Variant found in 3595 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...
Search for the same BRAF V600E variant but with a specific genomic change (A>T):
> cbiohub find 7 140453136 140453136 A T
✅ Variant found in 3571 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...
Determine the variant frequency across different cancer types:
> cbiohub variant-frequency 7 140453136 140453136 A T
✅ Variant frequency per CANCER_TYPE:
CANCER_TYPE altered total freq
Thyroid Cancer 770 1774 83.1
Melanoma 831 2902 64.5
Histiocytosis 42 160 55
Colorectal Cancer 501 6479 31.5
...
Instead of displaying the results you can also get the sql directly with the
--sql
flag. Under the hood cbiohub
uses duckdb
to run the sql queries. By
piping the output to duckdb
you can run the sql queries directly (and edit
them to your liking):
> cbiohub variant-frequency BRAF V600E --sql | duckdb
┌───────────────────────────────────────┬─────────┬───────┬────────┐
│ CANCER_TYPE │ altered │ total │ freq │
│ varchar │ int64 │ int64 │ double │
├───────────────────────────────────────┼─────────┼───────┼────────┤
│ Thyroid Cancer │ 775 │ 1774 │ 43.7 │
...
After data digestion, cbiohub
mainly provides a convenient command line
interface to run sql queries against a set of harmonized parquet files.
Remove all local parquet files.
cbiohub clean
To set up the development environment, install the development dependencies:
poetry install
You can run the cli using e.g.:
poetry run cbiohub data ingest ~/git/datahub/public/
and
poetry run cbiohub find BRAF V600E
You can also use IPython for interactive exploration:
poetry run ipython
- For
variant-frequency
command handle gene panels - Add github action datahub that uses cbiohub to push combined parquet data to hugging face (https://huggingface.co/datasets/cBioPortal/datahub)