Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acronyms for diseases #26

Closed
dhimmel opened this issue Sep 26, 2016 · 8 comments
Closed

Acronyms for diseases #26

dhimmel opened this issue Sep 26, 2016 · 8 comments
Labels

Comments

@dhimmel
Copy link
Member

dhimmel commented Sep 26, 2016

In another discussion @gwaygenomics shared acronyms for TCGA diseases as a text file (tcga_dictionary.txt). The contents are:

tissue acronym
adrenocortical cancer ACC
bladder urothelial carcinoma BLCA
breast invasive carcinoma BRCA
cervical & endocervical cancer CESC
cholangiocarcinoma CHOL
colon adenocarcinoma COAD
diffuse large B-cell lymphoma DLBC
esophageal carcinoma ESCA
glioblastoma multiforme GBM
head & neck squamous cell carcinoma HNSC
kidney chromophobe KICH
kidney clear cell carcinoma KIRC
kidney papillary cell carcinoma KIRP
acute myeloid leukemia LAML
brain lower grade glioma LGG
liver hepatocellular carcinoma LIHC
lung adenocarcinoma LUAD
lung squamous cell carcinoma LUSC
mesothelioma MESO
ovarian serous cystadenocarcinoma OV
pancreatic adenocarcinoma PAAD
pheochromocytoma & paraganglioma PCPG
prostate adenocarcinoma PRAD
rectum adenocarcinoma READ
sarcoma SARC
skin cutaneous melanoma SKCM
stomach adenocarcinoma STAD
testicular germ cell tumor TGCT
thyroid carcinoma THCA
thymoma THYM
uterine corpus endometrioid carcinoma UCEC
uterine carcinosarcoma UCS
uveal melanoma UVM
@dhimmel
Copy link
Member Author

dhimmel commented Sep 26, 2016

My questions are whether these acronyms are suitable for inclusion into automated workflows? For example, are they standardized across TCGA datasets? Furthermore, if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation?

Also some brainstorming on areas where the acronyms are more useful than the full names.

@gwaybio
Copy link
Member

gwaybio commented Sep 26, 2016

are they standardized across TCGA datasets?

Yes, the acronyms are standardized - stricter than disease names

if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation?

In almost every version of the clinical matrix I've seen, the disease is included with the acronym in two separate columns. This is definitely a concern with this iteration of the data, but considering the full data will be made public soon (late October, I think) we should be ok.

Also some brainstorming on areas where the acronyms are more useful than the full names.

  1. In all disease-specific plots the acronym will be better (takes up less space!)
  2. In the "disease selector" screen, the user can select based on acronym. TCGA disease-types also have designated colors for visualization purposes, we should also adhere to those (E.g. in one of the original pan cancer studies - only 12 diseases, but I believe colors are picked for all 33)

@dhimmel
Copy link
Member Author

dhimmel commented Sep 26, 2016

@gwaygenomics nice --- the standardization makes me more comfortable here. I see many benefits to the abbreviations. For example, covariates.tsv would be much nicer with these space-free and short names.

What about the TCGA Study Abbreviations page from the Genomic Data Commons. Should we use it instead of the TCGA Data Portal, which claims to be deprecated?

TCGA disease-types also have designated colors for visualization purposes, we should also adhere to those

Agree. I think we will want a file called diseases.tsv in this repository with columns for name, abbreviation, and color.

As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use?

@dhimmel dhimmel added the task label Sep 26, 2016
@gwaybio
Copy link
Member

gwaybio commented Sep 26, 2016

Yes, lets use the GDC.

As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use?

Doesn't seem to be consistent anywhere I look. I've seen disease, tissue, cohort, acronym, and now abbreviation. I do not have a preference!

@dhimmel
Copy link
Member Author

dhimmel commented Sep 26, 2016

I compared the disease names in our diseases.tsv at 54140cf to the GDC listing. Since GDC seems to use sentence case whereas Xena Browser uses all lowercase, I converted GDC names to lowercase and looked for Xena diseases without a match. The following table shows my manual mapping of the diseases which didn't match:

Xena Browser Disease Name GDC Study Name GDC Study Abbreviation
adrenocortical cancer Adrenocortical carcinoma ACC
cervical & endocervical cancer Cervical squamous cell carcinoma and endocervical adenocarcinoma CESC
diffuse large B-cell lymphoma Lymphoid Neoplasm Diffuse Large B-cell Lymphoma DLBC
head & neck squamous cell carcinoma Head and Neck squamous cell carcinoma HNSC
kidney clear cell carcinoma Kidney renal clear cell carcinoma KIRC
kidney papillary cell carcinoma Kidney renal papillary cell carcinoma KIRP
pheochromocytoma & paraganglioma Pheochromocytoma and Paraganglioma PCPG
testicular germ cell tumor Testicular Germ Cell Tumors TGCT
uterine corpus endometrioid carcinoma Uterine Corpus Endometrial Carcinoma UCEC

Alerting @jingchunzhu and @maryjgoldman that the Xena disease names have diverged with the GDC names.

So I have a few thoughts/questions:

  • Should we use abbreviations rather than full names because they may be more standardized?
  • @jingchunzhu or @maryjgoldman, is there a Xena-provided mapping from disease to abbreviation?

@gwaygenomics, you've convinced me that these abbreviations are important enough that we should add them to our workflow. Hopefully, we can find a solution on the upstream/automated side, but I'm willing to settle for a manual solution as a fallback.

@maryjgoldman
Copy link

Yes, we haven't started pulling data from the GDC yet (we're still using data from cgHub), so we haven't pulled in their names.

I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data.

@dhimmel
Copy link
Member Author

dhimmel commented Sep 26, 2016

I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data.

Hmm. The GDC mapping we found uses different disease names than Xena. @maryjgoldman, you're saying GDC would know how to resolve the differences?

I Googled for some the of the disease names from Xena and their abbreviations. There were really only three hits. Conveniently, the two GitHub hits are from @gwaygenomics and @jingchunzhu:

@jingchunzhu do you have any comments on cancerGroupTitle in TCGAUtil.py and whether this would be the right mapping for us?

@gwaybio
Copy link
Member

gwaybio commented Sep 27, 2016

@jingchunzhu - also (slightly unrelated), would you know where an updated code tables report is for TCGA barcodes? The code tables report page here is deprecated

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Sep 27, 2016
Manually created `download/diseases.tsv` from @gwaygenomics
`tcga_dictionary.tsv` file at https://git.io/vPvTb. See cognoma#26.

Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather
than full disease name for more manageable column names.

Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for
samples to `2.TCGA-process.ipynb`.
dhimmel added a commit to dhimmel/cancer-data that referenced this issue Sep 27, 2016
Manually created `download/diseases.tsv` from @gwaygenomics
`tcga_dictionary.tsv` file at https://git.io/vPvTb. See cognoma#26.

Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather
than full disease name for more manageable column names.

Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for
samples to `2.TCGA-process.ipynb`.
dhimmel added a commit to dhimmel/cancer-data that referenced this issue Sep 27, 2016
Manually created `download/diseases.tsv` from @gwaygenomics
`tcga_dictionary.tsv` file at https://git.io/vPvTb. See cognoma#26.

Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather
than full disease name for more manageable column names.

Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for
samples to `2.TCGA-process.ipynb`.
dhimmel added a commit that referenced this issue Sep 29, 2016
Manually created `download/diseases.tsv` from @gwaygenomics
`tcga_dictionary.tsv` file at https://git.io/vPvTb. See #26.

Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather
than full disease name for more manageable column names.

Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for
samples to `2.TCGA-process.ipynb`.
@dhimmel dhimmel closed this as completed Oct 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants