-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathglycosylation-sites-UniCarbKB.txt
93 lines (79 loc) · 4.92 KB
/
glycosylation-sites-UniCarbKB.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
Provenance Domain
Name: human_proteoform_glycosylation_sites_unicarbkb_glytoucan.csv
Title: Glycosylation Sites [UniCarbKB]
Created: 2018-07-09T11:34:02-5:00
Created by: Rahi Navelkar [[email protected]]
Modified: 2018-11-22T14:44:02-5:00
Modified by: Rahi Navelkar [[email protected]]
Digital Signature: RYFNNKE22594E007JKV457
Review status: Approved
Contribution: Matthew Campbell - [email protected] [contributedBy], Brian Fochtman - [email protected] [createdBy], Robel Kahsay - [email protected] [createdBy], Rahi Navelkar [curatedBy]
License: Data - Attribution 4.0 International CC BY 4.0 [https://creativecommons.org/licenses/by/4.0/]
Scripts - GNU General Public License v3.0 [https://www.gnu.org/licenses/gpl-3.0.en.html]
Readme - Attribution 4.0 International CC BY 4.0 [https://creativecommons.org/licenses/by/4.0/]
Usability Domain
List of human [taxid:9606] proteins with information on glycosylation sites from UniCarbKB database [https://academic.oup.com/nar/article/42/D1/D215/1052197, https://doi.org/10.1093/nar/gkt1128]
The file also includes GlyTouCan accessions and UniCarbKB structure ids for associated glycan structures.
Description Domain
Keywords: protein, canonical, glycosylation, glycan
Pipeline Steps:
Step 1: The input file was retrieved directly from source was "cleaned" for further processing.
Step 2: The glycosylation type (linkage type) was assigned using a python script [make-proteoform_glycosylation_sites_unicarbkb_glytoucan-csv-step2.py] based on motif label provided by the author.
Step 3: The UniProtKB protein accessions were mapped to canonical accessions and the final data was processed for quality check using a python script[make-proteoform_glycosylation_sites_unicarbkb_glytoucan-csv-step3.py]. Records which fall under one or more following criteria's were flagged and eliminated (eliminated records can be accessed using the log file through Output subdomain):
a. If the protein accession is not included in UniProt protein list [UniProt Nov-2017 Release]
b. If the amino acid position does not match to the amino acid on the associated position on fasta sequence [UniProt Nov-2017 Release]
c. If the id (UnicarbKB structure id) is not present in input file.
d. If the glycosylation type (linkage type) is not retrieved through step 2.
e. If a serine or threonine is reported for an N-linked glycan structure.
f. If an asparagine is reported for an O-linked glycan structure.
g. If the glycosylation type (linkage type) has both "N-linked;O-linked" assignment through step 2.
Execution Domain:
Script Access Type: Text
Scripts: make-proteoform_glycosylation_sites_unicarbkb_glytoucan-csv-step2.py, make-proteoform_glycosylation_sites_unicarbkb_glytoucan-csv-step3.py
Script Location: https://github.com/glygener/glygen-backend/blob/master/integration/
Script Driver: manual
Platform: CentOS7
Software Prerequisites:
Name: Python
Version: 2.7.13
I/O Domain
Input Subdomain:
name: unicarbkb_human_2018_10_31_02_22_23.clean.csv
mediatype: txt
source/uri: http://data.glygen.org/datasets/source/unicarbkb_human_2018_10_31_02_22_23.clean.csv
name: human_glytoucan_140918_2018_10_31_02_17_32.txt
mediatype: txt
source/uri: http://data.glygen.org/datasets/source/human_glytoucan_140918_2018_10_31_02_17_32.txt
name: human_protein_all.fasta
mediatype: fasta
source/uri:http://data.glygen.org/GLYDS00053
Output Subdomain:
name: human_glycosylation_types.csv
mediatype: txt
source/uri: http://data.glygen.org/datasets/source/human_glycosylation_types.csv
name: human_glycosylation_types.log
mediatype: txt
source/uri: http://data.glygen.org/datasets/logs/human_glycosylation_types.log
name: human_proteoform_glycosylation_sites_unicarbkb_glytoucan.log
mediatype: text
source/uri: http://data.glygen.org/datasets/logs/human_proteoform_glycosylation_sites_unicarbkb_glytoucan.log
name: human_proteoform_glycosylation_sites_unicarbkb_glytoucan.csv
mediatype: csv
source/uri: http://data.glygen.org/GLYDS00040
Content:
Column Headers:
uniprotkb_canonical_ac: Accession assigned to the protein isoform chosen to be the canonical sequence in UniProtKB database
glycosylation_site: Site on the protein sequence where glycosylation is observed
evidence: NCBI PubMed Id (PMID) as evidence for the entry
uckb_id: UnicarbKB data identifier
glytoucan_acc: Unique accession assigned to the registered glycan structure in GlyTouCan database
amino_acid: Three letter code abbreviation of the amino acid
glycosylation_type: Type of glycosylation [Linkage type]
Statistics [Unique Value]:
uniprotkb_acc_canonical: 58
glycosylation_site: 162
evidence: 127
uckb_id: 807
glytoucan_acc: 675
amino_acid: 3
glycosylation_type: 2