Acronyms for diseases #26

dhimmel · 2016-09-26T14:45:06Z

In another discussion @gwaygenomics shared acronyms for TCGA diseases as a text file (tcga_dictionary.txt). The contents are:

tissue	acronym
adrenocortical cancer	ACC
bladder urothelial carcinoma	BLCA
breast invasive carcinoma	BRCA
cervical & endocervical cancer	CESC
cholangiocarcinoma	CHOL
colon adenocarcinoma	COAD
diffuse large B-cell lymphoma	DLBC
esophageal carcinoma	ESCA
glioblastoma multiforme	GBM
head & neck squamous cell carcinoma	HNSC
kidney chromophobe	KICH
kidney clear cell carcinoma	KIRC
kidney papillary cell carcinoma	KIRP
acute myeloid leukemia	LAML
brain lower grade glioma	LGG
liver hepatocellular carcinoma	LIHC
lung adenocarcinoma	LUAD
lung squamous cell carcinoma	LUSC
mesothelioma	MESO
ovarian serous cystadenocarcinoma	OV
pancreatic adenocarcinoma	PAAD
pheochromocytoma & paraganglioma	PCPG
prostate adenocarcinoma	PRAD
rectum adenocarcinoma	READ
sarcoma	SARC
skin cutaneous melanoma	SKCM
stomach adenocarcinoma	STAD
testicular germ cell tumor	TGCT
thyroid carcinoma	THCA
thymoma	THYM
uterine corpus endometrioid carcinoma	UCEC
uterine carcinosarcoma	UCS
uveal melanoma	UVM

The text was updated successfully, but these errors were encountered:

dhimmel · 2016-09-26T14:47:15Z

My questions are whether these acronyms are suitable for inclusion into automated workflows? For example, are they standardized across TCGA datasets? Furthermore, if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation?

Also some brainstorming on areas where the acronyms are more useful than the full names.

gwaybio · 2016-09-26T14:57:55Z

are they standardized across TCGA datasets?

Yes, the acronyms are standardized - stricter than disease names

if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation?

In almost every version of the clinical matrix I've seen, the disease is included with the acronym in two separate columns. This is definitely a concern with this iteration of the data, but considering the full data will be made public soon (late October, I think) we should be ok.

Also some brainstorming on areas where the acronyms are more useful than the full names.

In all disease-specific plots the acronym will be better (takes up less space!)
In the "disease selector" screen, the user can select based on acronym. TCGA disease-types also have designated colors for visualization purposes, we should also adhere to those (E.g. in one of the original pan cancer studies - only 12 diseases, but I believe colors are picked for all 33)

dhimmel · 2016-09-26T15:24:50Z

@gwaygenomics nice --- the standardization makes me more comfortable here. I see many benefits to the abbreviations. For example, covariates.tsv would be much nicer with these space-free and short names.

What about the TCGA Study Abbreviations page from the Genomic Data Commons. Should we use it instead of the TCGA Data Portal, which claims to be deprecated?

TCGA disease-types also have designated colors for visualization purposes, we should also adhere to those

Agree. I think we will want a file called diseases.tsv in this repository with columns for name, abbreviation, and color.

As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use?

gwaybio · 2016-09-26T15:30:17Z

Yes, lets use the GDC.

As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use?

Doesn't seem to be consistent anywhere I look. I've seen disease, tissue, cohort, acronym, and now abbreviation. I do not have a preference!

dhimmel · 2016-09-26T18:51:48Z

I compared the disease names in our diseases.tsv at 54140cf to the GDC listing. Since GDC seems to use sentence case whereas Xena Browser uses all lowercase, I converted GDC names to lowercase and looked for Xena diseases without a match. The following table shows my manual mapping of the diseases which didn't match:

Xena Browser Disease Name	GDC Study Name	GDC Study Abbreviation
adrenocortical cancer	Adrenocortical carcinoma	ACC
cervical & endocervical cancer	Cervical squamous cell carcinoma and endocervical adenocarcinoma	CESC
diffuse large B-cell lymphoma	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma	DLBC
head & neck squamous cell carcinoma	Head and Neck squamous cell carcinoma	HNSC
kidney clear cell carcinoma	Kidney renal clear cell carcinoma	KIRC
kidney papillary cell carcinoma	Kidney renal papillary cell carcinoma	KIRP
pheochromocytoma & paraganglioma	Pheochromocytoma and Paraganglioma	PCPG
testicular germ cell tumor	Testicular Germ Cell Tumors	TGCT
uterine corpus endometrioid carcinoma	Uterine Corpus Endometrial Carcinoma	UCEC

Alerting @jingchunzhu and @maryjgoldman that the Xena disease names have diverged with the GDC names.

So I have a few thoughts/questions:

Should we use abbreviations rather than full names because they may be more standardized?
@jingchunzhu or @maryjgoldman, is there a Xena-provided mapping from disease to abbreviation?

@gwaygenomics, you've convinced me that these abbreviations are important enough that we should add them to our workflow. Hopefully, we can find a solution on the upstream/automated side, but I'm willing to settle for a manual solution as a fallback.

maryjgoldman · 2016-09-26T19:29:25Z

Yes, we haven't started pulling data from the GDC yet (we're still using data from cgHub), so we haven't pulled in their names.

I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data.

dhimmel · 2016-09-26T21:00:57Z

I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data.

Hmm. The GDC mapping we found uses different disease names than Xena. @maryjgoldman, you're saying GDC would know how to resolve the differences?

I Googled for some the of the disease names from Xena and their abbreviations. There were really only three hits. Conveniently, the two GitHub hits are from @gwaygenomics and @jingchunzhu:

from @gwaygenomics there's tcga_dictionary.tsv
from @jingchunzhu there's TCGAUtil.py. It looks like the cancerGroupTitle dictionary could include the mapping we need.

@jingchunzhu do you have any comments on cancerGroupTitle in TCGAUtil.py and whether this would be the right mapping for us?

gwaybio · 2016-09-27T13:10:17Z

@jingchunzhu - also (slightly unrelated), would you know where an updated code tables report is for TCGA barcodes? The code tables report page here is deprecated

Manually created `download/diseases.tsv` from @gwaygenomics `tcga_dictionary.tsv` file at https://git.io/vPvTb. See cognoma#26. Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather than full disease name for more manageable column names. Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for samples to `2.TCGA-process.ipynb`.

Manually created `download/diseases.tsv` from @gwaygenomics `tcga_dictionary.tsv` file at https://git.io/vPvTb. See #26. Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather than full disease name for more manageable column names. Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for samples to `2.TCGA-process.ipynb`.

dhimmel mentioned this issue Sep 26, 2016

Simplify API docs for a minimalist design cognoma/core-service#24

Merged

dhimmel added the task label Sep 26, 2016

This was referenced Sep 27, 2016

Add disease acronyms and update covariates.tsv #27

Merged

Color coding TCGA disease/cancer types cBioPortal/cbioportal#1728

Closed

dhimmel closed this as completed Oct 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acronyms for diseases #26

Acronyms for diseases #26

dhimmel commented Sep 26, 2016

dhimmel commented Sep 26, 2016

gwaybio commented Sep 26, 2016

dhimmel commented Sep 26, 2016

gwaybio commented Sep 26, 2016

dhimmel commented Sep 26, 2016 •

edited

Loading

maryjgoldman commented Sep 26, 2016

dhimmel commented Sep 26, 2016 •

edited

Loading

gwaybio commented Sep 27, 2016

Acronyms for diseases #26

Acronyms for diseases #26

Comments

dhimmel commented Sep 26, 2016

dhimmel commented Sep 26, 2016

gwaybio commented Sep 26, 2016

dhimmel commented Sep 26, 2016

gwaybio commented Sep 26, 2016

dhimmel commented Sep 26, 2016 • edited Loading

maryjgoldman commented Sep 26, 2016

dhimmel commented Sep 26, 2016 • edited Loading

gwaybio commented Sep 27, 2016

dhimmel commented Sep 26, 2016 •

edited

Loading

dhimmel commented Sep 26, 2016 •

edited

Loading