cbiohub

WARNING ⚠️: This package is still under construction.

cbiohub is a Python package and CLI tool designed to simplify the analysis of data from cBioPortal, including those hosted on the cBioPortal Datahub. Unlike existing API clients, which focus on slices of data via the REST API, cbiohub supports bulk analysis of harmonized datasets. By using combined parquet files instead of per-study CSV/TSV files, it enables faster data loading and querying.

cbiohub features:

A data module for ingesting and converting cBioPortal Datahub files into parquet format
An analysis module leveraging DuckDB for efficient local data exploration

With parquet’s widespread compatibility, cbiohub allows seamless integration with other programming languages and data warehousing tools.

For convenience, pre-combined parquet files for datahub are available from Hugging Face.

Usage

Analyze Local files

Step 1: Obtain data files

You can e.g. download the cBioPortal datahub files:

git clone [email protected]:cbioportal/datahub ~/git/datahub

Step 2: Ingest and Combine

Now ingest them i.e. convert them into parquet files on your local machine:

cbiohub data ingest ~/git/datahub/public/

All the data by default gets stored in ~/cbiohub/. Combine all the study data together into a single study:

cbiohub data combine

Step 3: Analyze

Now you can use the cbiohub package to analyze the data quickly. For example, you can load the combined study data into a pandas DataFrame:

import cbiohub

df = cbiohub.get_combined_df()

Or you can use the cbiohub cli to do quick analyses:

> cbiohub find BRAF V600E
✅ Variant found in 3595 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...

Search for the same BRAF V600E variant but with a specific genomic change (A>T):

> cbiohub find 7 140453136 140453136 A T
✅ Variant found in 3571 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...

Determine the variant frequency across different cancer types:

> cbiohub variant-frequency 7 140453136 140453136 A T
✅ Variant frequency per CANCER_TYPE:
CANCER_TYPE                              altered    total    freq
Thyroid Cancer                               770     1774    83.1
Melanoma                                     831     2902    64.5
Histiocytosis                                 42      160    55
Colorectal Cancer                            501     6479    31.5
...

Instead of displaying the results you can also get the sql directly with the --sql flag. Under the hood cbiohub uses duckdb to run the sql queries. By piping the output to duckdb you can run the sql queries directly (and edit them to your liking):

> cbiohub variant-frequency BRAF V600E  --sql | duckdb
┌───────────────────────────────────────┬─────────┬───────┬────────┐
│              CANCER_TYPE              │ altered │ total │  freq  │
│                varchar                │  int64  │ int64 │ double │
├───────────────────────────────────────┼─────────┼───────┼────────┤
│ Thyroid Cancer                        │     775 │  1774 │   43.7 │
...

After data digestion, cbiohub mainly provides a convenient command line interface to run sql queries against a set of harmonized parquet files.

Clean

Remove all local parquet files.

cbiohub clean

Development

To set up the development environment, install the development dependencies:

poetry install

You can run the cli using e.g.:

poetry run cbiohub data ingest ~/git/datahub/public/

and

poetry run cbiohub find BRAF V600E

You can also use IPython for interactive exploration:

poetry run ipython

TODO

For variant-frequency command handle gene panels
Add github action datahub that uses cbiohub to push combined parquet data to hugging face (https://huggingface.co/datasets/cBioPortal/datahub)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
config		config
src/cbiohub		src/cbiohub
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cbiohub

Usage

Analyze Local files

Step 1: Obtain data files

Step 2: Ingest and Combine

Step 3: Analyze

Clean

Development

TODO

About

Releases 1

Languages

License

cBioPortal/cbiohubpy

Folders and files

Latest commit

History

Repository files navigation

cbiohub

Usage

Analyze Local files

Step 1: Obtain data files

Step 2: Ingest and Combine

Step 3: Analyze

Clean

Development

TODO

About

Resources

License

Stars

Watchers

Forks

Releases 1

Languages