-
Notifications
You must be signed in to change notification settings - Fork 23
/
LINK_TO_MUTATIONAL_DATA
79 lines (71 loc) · 5.14 KB
/
LINK_TO_MUTATIONAL_DATA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
Access to the full mutational data can be obtained here: https://drive.google.com/open?id=0B1tQDSL9FmNLTmo1dl9SRF9USUE
Note, file is ~1.8G in size. md5sum: 94a3538afdbfbc8b52c75914c60c9f78
The first 34 columns are standard MAF format, and described here:
https://wiki.nci.nih.gov/x/eJaPAQ
Subsequent columns include:
35. HGVSc - the coding sequence of the variant in HGVS recommended format
36. HGVSp - the protein sequence of the variant in HGVS recommended format
37. HGVSp_Short - Same as HGVSp, but using 1-letter amino-acid codes
38. Transcript_ID - transcript onto which the consequence of the variant has been mapped
39. Exon_Number - the exon number (out of total number)
40. t_depth - read depth across this locus in tumor BAM
41. t_ref_count - read depth supporting the reference allele in tumor BAM
42. t_alt_count - read depth supporting the variant allele in tumor BAM
43. n_depth - read depth across this locus in normal BAM
44. n_ref_count - read depth supporting the reference allele in normal BAM
45. n_alt_count - read depth supporting the variant allele in normal BAM
The next column is relevant to analyses that consider the effect of the variant on all alternate
isoforms of the gene, or on non-coding/regulatory transcripts. The effects are sorted first by
transcript biotype priority, then by effect severity, and finally by decreasing order of transcript
length. Each effect in the list is in the format [SYMBOL,Consequence,HGVSp,Transcript_ID,RefSeq].
46. all_effects - a semicolon delimited list of all possible variant effects, sorted by priority
All remaining columns are straight out of Ensembl's VEP annotator, as described here:
http://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#output
47. Allele - the variant allele used to calculate the consequence
48. Gene - stable Ensembl ID of affected gene
49. Feature - stable Ensembl ID of feature
50. Feature_type - type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature
51. Consequence - consequence type of this variation
52. cDNA_position - relative position of base pair in cDNA sequence
53. CDS_position - relative position of base pair in coding sequence
54. Protein_position - relative position of amino acid in protein
55. Amino_acids - only given if the variation affects the protein-coding sequence
56. Codons - the alternative codons with the variant base in upper case
57. Existing_variation - known identifier of existing variation
58. ALLELE_NUM - allele number from input; 0 is reference, 1 is first alternate etc
59. DISTANCE - shortest distance from variant to transcript
60. STRAND - the DNA strand (1 or -1) on which the transcript/feature lies
61. SYMBOL - the gene symbol
62. SYMBOL_SOURCE - the source of the gene symbol
63. HGNC_ID - gene identifier from the HUGO Gene Nomenclature Committee
64. BIOTYPE - biotype of transcript
65. CANONICAL - a flag indicating if the transcript is denoted as the canonical transcript for this gene
66. CCDS - the CCDS identifer for this transcript, where applicable
67. ENSP - the Ensembl protein identifier of the affected transcript
68. SWISSPROT - UniProtKB/Swiss-Prot identifier of protein product
69. TREMBL - UniProtKB/TrEMBL identifier of protein product
70. UNIPARC - UniParc identifier of protein product
71. RefSeq - RefSeq identifier for this transcript
72. SIFT - the SIFT prediction and/or score, with both given as prediction (score)
73. PolyPhen - the PolyPhen prediction and/or score
74. EXON - the exon number (out of total number)
75. INTRON - the intron number (out of total number)
76. DOMAINS - the source and identifer of any overlapping protein domains
77. GMAF - Non-reference allele and frequency of existing variant in 1000 Genomes
78. AFR_MAF - Non-reference allele and frequency of existing variant in 1000 Genomes combined African population
79. AMR_MAF - Non-reference allele and frequency of existing variant in 1000 Genomes combined American population
80. ASN_MAF - Non-reference allele and frequency of existing variant in 1000 Genomes combined Asian population
81. EAS_MAF - Non-reference allele and frequency of existing variant in 1000 Genomes combined East Asian population
82. EUR_MAF - Non-reference allele and frequency of existing variant in 1000 Genomes combined European population
83. SAS_MAF - Non-reference allele and frequency of existing variant in 1000 Genomes combined South Asian population
84. AA_MAF - Non-reference allele and frequency of existing variant in NHLBI-ESP African American population
85. EA_MAF - Non-reference allele and frequency of existing variant in NHLBI-ESP European American population
86. CLIN_SIG - clinical significance of variant from dbSNP
87. SOMATIC - somatic status of existing variation(s)
88. TUMORTYPE - Sample tumor type
89. Is_Ref - TRUE/FALSE flag of whether Reference_Allele matches hg19 reference
90. Ref_Tri - Reference trinucleotide context of mutation
91. Amino_Acid_Length - Amino acid length of protein
92. Amino_Acid_Position - For protein-coding mutations, codon in which the mutation resides
93. ccf - Cancer cell fraction calculated from absCN-seq (See Methods)
94. Master_ID - Unique sample identifier: Tumor_Sample_Barcode, Source of sample, and cancer type delimited by '.' (period)