Skip to content

Python package and CLI tool simplifying the analysis of data from cBioPortal

License

Notifications You must be signed in to change notification settings

cBioPortal/cbiohubpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cbiohub

WARNING ⚠️: This package is still under construction.

cbiohub is a Python package and CLI tool designed to simplify the analysis of data from cBioPortal, including those hosted on the cBioPortal Datahub. Unlike existing API clients, which focus on slices of data via the REST API, cbiohub supports bulk analysis of harmonized datasets. By using combined parquet files instead of per-study CSV/TSV files, it enables faster data loading and querying.

cbiohub features:

  • A data module for ingesting and converting cBioPortal Datahub files into parquet format
  • An analysis module leveraging DuckDB for efficient local data exploration

With parquet’s widespread compatibility, cbiohub allows seamless integration with other programming languages and data warehousing tools.

image

For convenience, pre-combined parquet files for datahub are available from Hugging Face.

Usage

Analyze Local files

Step 1: Obtain data files

You can e.g. download the cBioPortal datahub files:

git clone [email protected]:cbioportal/datahub ~/git/datahub

Step 2: Ingest and Combine

Now ingest them i.e. convert them into parquet files on your local machine:

cbiohub data ingest ~/git/datahub/public/

All the data by default gets stored in ~/cbiohub/. Combine all the study data together into a single study:

cbiohub data combine

Step 3: Analyze

Now you can use the cbiohub package to analyze the data quickly. For example, you can load the combined study data into a pandas DataFrame:

import cbiohub

df = cbiohub.get_combined_df()

Or you can use the cbiohub cli to do quick analyses:

> cbiohub find BRAF V600E
✅ Variant found in 3595 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...

Search for the same BRAF V600E variant but with a specific genomic change (A>T):

> cbiohub find 7 140453136 140453136 A T
✅ Variant found in 3571 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...

Determine the variant frequency across different cancer types:

> cbiohub variant-frequency 7 140453136 140453136 A T
✅ Variant frequency per CANCER_TYPE:
CANCER_TYPE                              altered    total    freq
Thyroid Cancer                               770     1774    83.1
Melanoma                                     831     2902    64.5
Histiocytosis                                 42      160    55
Colorectal Cancer                            501     6479    31.5
...

Instead of displaying the results you can also get the sql directly with the --sql flag. Under the hood cbiohub uses duckdb to run the sql queries. By piping the output to duckdb you can run the sql queries directly (and edit them to your liking):

> cbiohub variant-frequency BRAF V600E  --sql | duckdb
┌───────────────────────────────────────┬─────────┬───────┬────────┐
│              CANCER_TYPE              │ altered │ total │  freq  │
│                varchar                │  int64  │ int64 │ double │
├───────────────────────────────────────┼─────────┼───────┼────────┤
│ Thyroid Cancer                        │     775 │  1774 │   43.7 │
...

After data digestion, cbiohub mainly provides a convenient command line interface to run sql queries against a set of harmonized parquet files.

Clean

Remove all local parquet files.

cbiohub clean

Development

To set up the development environment, install the development dependencies:

poetry install

You can run the cli using e.g.:

poetry run cbiohub data ingest ~/git/datahub/public/

and

poetry run cbiohub find BRAF V600E

You can also use IPython for interactive exploration:

poetry run ipython

TODO

About

Python package and CLI tool simplifying the analysis of data from cBioPortal

Resources

License

Stars

Watchers

Forks

Languages