-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acronyms for diseases #26
Comments
My questions are whether these acronyms are suitable for inclusion into automated workflows? For example, are they standardized across TCGA datasets? Furthermore, if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation? Also some brainstorming on areas where the acronyms are more useful than the full names. |
Yes, the acronyms are standardized - stricter than disease names
In almost every version of the clinical matrix I've seen, the disease is included with the acronym in two separate columns. This is definitely a concern with this iteration of the data, but considering the full data will be made public soon (late October, I think) we should be ok.
|
@gwaygenomics nice --- the standardization makes me more comfortable here. I see many benefits to the abbreviations. For example, What about the TCGA Study Abbreviations page from the Genomic Data Commons. Should we use it instead of the TCGA Data Portal, which claims to be deprecated?
Agree. I think we will want a file called As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use? |
Yes, lets use the GDC.
Doesn't seem to be consistent anywhere I look. I've seen |
I compared the disease names in our
Alerting @jingchunzhu and @maryjgoldman that the Xena disease names have diverged with the GDC names. So I have a few thoughts/questions:
@gwaygenomics, you've convinced me that these abbreviations are important enough that we should add them to our workflow. Hopefully, we can find a solution on the upstream/automated side, but I'm willing to settle for a manual solution as a fallback. |
Yes, we haven't started pulling data from the GDC yet (we're still using data from cgHub), so we haven't pulled in their names. I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data. |
Hmm. The GDC mapping we found uses different disease names than Xena. @maryjgoldman, you're saying GDC would know how to resolve the differences? I Googled for some the of the disease names from Xena and their abbreviations. There were really only three hits. Conveniently, the two GitHub hits are from @gwaygenomics and @jingchunzhu:
@jingchunzhu do you have any comments on |
@jingchunzhu - also (slightly unrelated), would you know where an updated |
Manually created `download/diseases.tsv` from @gwaygenomics `tcga_dictionary.tsv` file at https://git.io/vPvTb. See cognoma#26. Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather than full disease name for more manageable column names. Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for samples to `2.TCGA-process.ipynb`.
Manually created `download/diseases.tsv` from @gwaygenomics `tcga_dictionary.tsv` file at https://git.io/vPvTb. See cognoma#26. Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather than full disease name for more manageable column names. Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for samples to `2.TCGA-process.ipynb`.
Manually created `download/diseases.tsv` from @gwaygenomics `tcga_dictionary.tsv` file at https://git.io/vPvTb. See cognoma#26. Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather than full disease name for more manageable column names. Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for samples to `2.TCGA-process.ipynb`.
Manually created `download/diseases.tsv` from @gwaygenomics `tcga_dictionary.tsv` file at https://git.io/vPvTb. See #26. Added acronym column to `samples.tsv`. In `covariates.tsv` use acronym rather than full disease name for more manageable column names. Simplified parts of `4.covariates.ipynb`. Moved n_mutation computation for samples to `2.TCGA-process.ipynb`.
In another discussion @gwaygenomics shared acronyms for TCGA diseases as a text file (
tcga_dictionary.txt
). The contents are:The text was updated successfully, but these errors were encountered: