Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FAQ for Upcoming GDC Data Release #10894

Merged
merged 16 commits into from
Aug 1, 2024
45 changes: 33 additions & 12 deletions docs/user-guide/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@
* [The data today is different than the last time i looked. What happened?](/user-guide/faq.md#the-data-today-is-different-than-the-last-time-i-looked-what-happened)
* [How do I access data from AACR Project GENIE?](/user-guide/faq.md#how-do-i-access-data-from-aacr-project-genie)
* [TCGA](/user-guide/faq.md#tcga)
* [How does TCGA data in cBioPortal compare to TCGA data in Genome Data Commons?](/user-guide/faq.md#how-does-tcga-data-in-cbioportal-compare-to-tcga-data-in-genome-data-commons)
* [What are the TCGA studies sourced from the Genomic Data Commons (GDC)?](#what-are-the-tcga-studies-sourced-from-the-genomic-data-commons-gdc)
* [How do the different TCGA datasets compare?](#how-do-the-different-tcga-datasets-compare)
* [What happened to TCGA Provisional datasets?](/user-guide/faq.md#what-happened-to-tcga-provisional-datasets)
jamesqo marked this conversation as resolved.
Show resolved Hide resolved
* [What are TCGA Firehose Legacy datasets and how do they compare to the publication-associated datasets and the PanCancer Atlas datasets?](/user-guide/faq.md#what-are-tcga-firehose-legacy-datasets-and-how-do-they-compare-to-the-publication-associated-datasets-and-the-pancancer-atlas-datasets)
* [Where do the thresholded copy number call in TCGA Firehose Legacy data come from?](/user-guide/faq.md#where-do-the-thresholded-copy-number-call-in-tcga-firehose-legacy-data-come-from)
* [Which studies have MutSig and GISTIC results? How do these results compare to the data in the TCGA publications?](/user-guide/faq.md#which-studies-have-mutsig-and-gistic-results-how-do-these-results-compare-to-the-data-in-the-tcga-publications)
* [How can I download the PanCancer Atlas data?](/user-guide/faq.md#how-can-i-download-the-pancancer-atlas-data)
Expand All @@ -49,7 +49,7 @@
* [How are protein domains in the mutational lollipop diagrams specified?](/user-guide/faq.md#how-are-protein-domains-in-the-mutational-lollipop-diagrams-specified)
* [What is the difference between a “splice site” mutation and a “splice region” mutation?](/user-guide/faq.md#what-is-the-difference-between-a-splice-site-mutation-and-a-splice-region-mutation)
* [What do “Amplification”, “Gain”, “Deep Deletion”, “Shallow Deletion” and "-2", "-1", "0", "1", and "2" mean in the copy-number data?](/user-guide/faq.md#what-do-amplification-gain-deep-deletion-shallow-deletion-and--2--1-0-1-and-2-mean-in-the-copy-number-data)
* [What is GISTIC? What is RAE?](/user-guide/faq.md#what-is-gistic-what-is-rae)
* [What is GISTIC? What is RAE? What is ASCAT?](/user-guide/faq.md#what-is-gistic-what-is-rae-what-is-ascat)
* [RNA](/user-guide/faq.md#rna)
* [Does the portal store raw or probe-level data?](/user-guide/faq.md#does-the-portal-store-raw-or-probe-level-data)
* [What are mRNA and microRNA Z-Scores?](/user-guide/faq.md#what-are-mrna-and-microrna-z-scores)
Expand Down Expand Up @@ -126,6 +126,8 @@ You can bookmark your query results and share the URL with collaborators. We sto
The cBioPortal is an exploratory analysis tool for exploring large-scale cancer genomic data sets that hosts data from large consortium efforts, like [TCGA](https://cancergenome.nih.gov/) and [TARGET](https://ocg.cancer.gov/programs/target), as well as publications from individual labs. You can quickly view genomic alterations across a set of patients, across a set of cancer types, perform survival analysis and perform group comparisons. If you want to explore specific genes or a pathway of interest in one or more cancer types, the cBioPortal is probably where you want to start.

By contrast, the [Genomic Data Commons (GDC)](https://gdc.cancer.gov/) aims to be the definitive place for full-download and access to all data generated by TCGA and TARGET. If you want to download raw mRNA expression files or full segmented copy number files, the GDC is probably where you want to start.

As of August 2024, the public cBioPortal contains datasets sourced from the GDC through [ISB-CGC BigQuery](https://bq-search.isb-cgc.org/search?status=current). Currently TCGA and CPTAC are supported, with more programs coming in the future. For an explanation of how these studies differ from their non-GDC counterparts, [see below](#how-do-the-tcga-studies-sourced-from-genomic-data-commons-gdc-compare-to-the-other-tcga-datasets-which-one-should-i-use).
#### Does the cBioPortal provide a Web Service API? R interface? MATLAB interface?
Yes, the cBioPortal provides a [Swagger API](https://www.cbioportal.org/api/swagger-ui.html), and [R/MATLAB interfaces](/web-API-and-Clients.md#r-client).
#### Can I use cBioPortal with my own data?
Expand Down Expand Up @@ -169,7 +171,7 @@ Check out the [Data Sets Page](https://www.cbioportal.org/datasets) for the comp
#### Which resources are integrated for variant annotation?
cBioPortal supports the annotation of variants from several different databases. These databases provide information about the recurrence of, or prior knowledge about, specific amino acid changes. For each variant, the number of occurrences of mutations at the same amino acid position present in the COSMIC database are reported. Furthermore, variants are annotated as “hotspots” if the amino acid positions were found to be recurrent linear hotspots, as defined by the Cancer Hotspots method ([cancerhotspots.org](https://www.cancerhotspots.org/)), or three-dimensional hotspots, as defined by 3D Hotspots ([3dhotspots.org](https://www.3dhotspots.org/)). Prior knowledge about variants, including clinical actionability information, is provided from three different sources: OncoKB ([www.oncokb.org](https://www.oncokb.org/)), CIViC ([civicdb.org](https://civicdb.org/)), as well as My Cancer Genome ([mycancergenome.org](https://www.mycancergenome.org/)). For OncoKB, exact levels of clinical actionability are displayed in cBioPortal, as defined by [the OncoKB paper](https://ascopubs.org/doi/full/10.1200/PO.17.00011).
#### What version of the human reference genome is being used in cBioPortal?
The [public cBioPortal](https://www.cbioportal.org) is currently using hg19/GRCh37.
The [public cBioPortal](https://www.cbioportal.org) largely uses hg19/GRCh37. However, there are studies that use the hg38/GRCh38 reference genome, including datasets sourced from the GDC through ISB-CGC BigQuery.
#### How does cBioPortal handle duplicate samples or sample IDs across different studies?
The cBioPortal generally assumes that samples or patients that have the same ID are actually the same. This is important for cross-cancer queries, where each sample should only be counted once. If a sample is part of multiple cancer cohorts, its alterations are only counted once in the Mutations tab (it will be listed multiple times in the table, but is only counted once in the lollipop plot). However, other tabs (including OncoPrint and Cancer Types Summary) will count the sample twice - for this reason, we advise against querying multiple studies that contain the same samples (e.g., TCGA PanCancer Atlas and TCGA Firehose Legacy).
#### Are there any normal tissue samples available through cBioPortal?
Expand All @@ -186,16 +188,21 @@ If you need to reference an old version of a dataset, you can find previous vers
Data from AACR Project GENIE are provided in a [dedicated instance of cBioPortal](https://www.cbioportal.org/genie/). You can also download GENIE data from the [Synapse Platform](https://synapse.org/genie). Note that you will need to register before accessing the data. Additional information about AACR Project GENIE can be found on the [AACR website](https://www.aacr.org/Research/Research/Pages/aacr-project-genie.aspx).

### TCGA
#### How does TCGA data in cBioPortal compare to TCGA data in Genome Data Commons?
We do not currently load the mutation data from the GDC. Instead, we have the original mutation data generated by the individual TCGA sequencing centers. The source of the data is the Broad Firehose (or the publication pages for data that matches a specific manuscript). These data are usually a combination of two mutation callers, but they differ by center (typically a variant caller like MuTect plus an indel caller), and sequencing centers have modified their mutation calling pipelines over time.
#### What happened to TCGA Provisional datasets?
We renamed TCGA Provisional datasets to TCGA Firehose Legacy to better reflect that this data comes from a legacy processing pipeline. The exact same data is now available in TCGA Firehose Legacy studies.
#### What are TCGA Firehose Legacy datasets and how do they compare to the publication-associated datasets and the PanCancer Atlas datasets?
The Firehose Legacy dataset (formerly Provisional datasets) for each TCGA cancer type contains all data available from the Broad Firehose. The publication datasets reflect the data that were used for each of the publications. The samples in a published dataset are usually a subset of the firehose legacy dataset, since manuscripts were often written before TCGA completed their goal of sequencing 500 tumors.
#### What are the TCGA studies sourced from the Genomic Data Commons (GDC)?
The GDC TCGA studies mirror the [Cancer Gateway in the Cloud (ISB-CGC)](https://bq-search.isb-cgc.org/search?status=current) that is hosted on Google BigQuery, which in turn pulls data from GDC. Our [NCI-CRDC pipeline](https://github.com/cBioPortal/nci-crdc-pipeline) pulls data from ISB-CGC and transforms it into cBioPortal-formatted files. The resulting studies are intended to be a pure reflection of what is available inside ISB-CGC; we do not augment them with data from our other TCGA studies. For more information on how ISB-CGC handles GDC data, see [Programs and Data Sets](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/Hosted-Data.html) and [GDC Overview](https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/data/GDC_top.html).

There can be differences between firehose legacy and published data. For example, the mutation data in the publication usually underwent more QC, and false positives might have been removed or, in rare cases, false negatives added. RNA-Seq and copy-number values may also differ slightly, as different versions of analysis pipelines could have been used. Additionally, due to additional curation during the publication process, the clinical data for the publication may be of higher quality or may contain a few more data elements, sometimes derived from the genomic data (e.g., genomic subtypes).
#### How do the different TCGA datasets compare?
The Firehose Legacy dataset (formerly Provisional datasets) for each TCGA cancer type contains all data available from the Broad Firehose. The publication datasets reflect the data that were used for each of the publications. The samples in a published dataset are usually a subset of the Firehose Legacy dataset, since manuscripts were often written before TCGA completed their goal of sequencing 500 tumors.

There can be differences between Firehose Legacy and published data. For example, the mutation data in the publication usually underwent more QC, and false positives might have been removed or, in rare cases, false negatives added. RNA-Seq and copy-number values may also differ slightly, as different versions of analysis pipelines could have been used. Additionally, due to additional curation during the publication process, the clinical data for the publication may be of higher quality or may contain a few more data elements, sometimes derived from the genomic data (e.g., genomic subtypes).

The TCGA PanCancer Atlas datasets derive from an effort to unify TCGA data across all tumor types. Publications resulting from this effort can be found at the [TCGA PanCancer Atlas site](https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html). In the cBioPortal, data from the PanCancer Atlas is divided by tumor type, but these studies have uniform clinical elements, consistent processing and normalization of mutations, copy number, mRNA data and are ideally processed for comparative analyses.

TCGA studies not sourced from GDC have the original mutation data generated by the individual TCGA sequencing centers. The source of the data is the Broad Firehose (or the publication pages for data that matches a specific manuscript). These data are usually a combination of two mutation callers, e.g. a variant caller like MuTect plus an indel caller. Note that the specific tools used and the overall process for identifying mutations can vary between centers and may have changed over time.

TCGA studies sourced from GDC use a newer version of the human reference genome, GRCh38 instead of GRCh37. For more information about the GDC data processing pipeline, see [GDC Data Processing](https://gdc.cancer.gov/about-data/gdc-data-processing). Transformations specific to our NCI-CRDC pipeline are documented in the [cBioPortal Datahub](https://github.com/cBioPortal/datahub).
#### What happened to TCGA Provisional datasets?
We renamed TCGA Provisional datasets to TCGA Firehose Legacy to better reflect that this data comes from a legacy processing pipeline. The exact same data is now available in TCGA Firehose Legacy studies.
#### Where do the thresholded copy number call in TCGA Firehose Legacy data come from?
Thresholded copy number calls in the TCGA Firehouse Legacy datasets are generated by the GISTIC 2.0 algorithm and obtained from the Broad Firehose.
#### Which studies have MutSig and GISTIC results? How do these results compare to the data in the TCGA publications?
Expand Down Expand Up @@ -223,13 +230,27 @@ These levels are derived from copy-number analysis algorithms like GISTIC or RAE
* 2 or Amplification indicate a high-level amplification (more copies, often focal)

Note that these calls are putative. We consider the deep deletions and amplifications as biologically relevant for individual genes by default. Note that these calls are usually not manually reviewed, and due to differences in purity and ploidy between samples, there may be false positives and false negatives.
#### What is GISTIC? What is RAE?
#### What is GISTIC? What is RAE? What is ASCAT?
Copy number data sets within the portal are often generated by the [GISTIC](https://www.ncbi.nlm.nih.gov/sites/entrez?term=18077431) or [RAE](https://www.ncbi.nlm.nih.gov/sites/entrez?term=18784837) algorithms. Both algorithms attempt to identify significantly altered regions of amplification or deletion across sets of patients. Both algorithms also generate putative gene/patient copy number specific calls, which are then input into the portal.

For TCGA studies, the table in allthresholded.bygenes.txt (which is the part of the GISTIC output that is used to determine the copy-number status of each gene in each sample in cBioPortal) is obtained by applying both low- and high-level thresholds to to the gene copy levels of all the samples. The entries with value +/- 2 exceed the high-level thresholds for amplifications/deep deletions, and those with +/- 1 exceed the low-level thresholds but not the high-level thresholds. The low-level thresholds are just the 'ampthresh' and 'delthresh' noise threshold input values to GISTIC (typically 0.1 or 0.3) and are the same for every thresholds.

By contrast, the high-level thresholds are calculated on a sample-by-sample basis and are based on the maximum (or minimum) median arm-level amplification (or deletion) copy number found in the sample. The idea, for deletions anyway, is that this level is a good approximation for hemizygous losses given the purity and ploidy of the sample. The actual cutoffs used for each sample can be found in a table in the output file sample_cutoffs.txt. All GISTIC output files for TCGA are available at: gdac.broadinstitute.org.

[ASCAT (Allele-Specific Copy number Analysis of Tumors)](https://www.pnas.org/doi/full/10.1073/pnas.1009843107) is a tool/algorithm designed to analyze allele-specific copy number variations (CNVs) in tumor DNA. Copy number data from the GDC analysis pipelines is provided in ASCAT format; more detail is available on the [GDC website](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/#ascat-pipelines). ASCAT data is not supported directly by cBioPortal, instead it is converted to discrete GISTIC data using the following thresholds:

| ASCAT Value | GISTIC Value | Meaning |
|---|---|---|
| TCN = 0 | -2 | Deep loss |
| TCN = 1 | -1 | Single-copy loss |
| TCN = 2 | 0 | Diploid |
| 2 < TCN < 7 | 1 | Low-level gain |
| 7 ≤ TCN | 2 | Amplification |

where TCN is the total copy number from ASCAT.

The final conversion threshold (7 ≤ TCN) is somewhat flexible and can vary between different studies depending on the data used. We chose 7 and applied it to all GDC studies after seeing that it resulted in the most consistency between GDC TCGA and PanCancer Atlas.

### RNA
#### Does the portal store raw or probe-level data?
No, the portal only contains gene-level data. Data for different isoforms of a given gene are merged. Raw and probe-level data for data sets are available via [NCBI GEO](https://www.ncbi.nlm.nih.gov/geo/), [dbGaP](https://www.ncbi.nlm.nih.gov/gap/) or through the [GDC](https://portal.gdc.cancer.gov/). See the cancer type description on the main query page or refer to the original publication for links to the raw data.
Expand Down
Loading