This repository collects contributions related to the "Annotations on Structures" topic in the COVID-19 Biohackathon April 5-11 2020.
The context is SWISS-MODEL's involvement in an EU project to combat COVID-19. To accelerate our plan to map relevant annotations onto those structures, we collect tools/platforms which can automatically generate such annotations based on the latest data.
We mainly hope to receive two types of contributions:
- Find/generate relevant sequence data (see issues list for inspirational ideas) to be displayed on structures (see section on SWISS-MODEL's annotation system). This should be scripted to enable automated fetching of the latest data.
- Write reusable scripts to map the sequence data onto the frame of reference of proteins (this might need translation from position on genome data to position on proteins of SARS-CoV-2 as listed here). These scripts are expected to be useful for the scripts in point 1.
Additional topics of interest:
- For visualization experts: alternative ways to visualize the protein structures.
- For RDF/JSON-LD experts: define an RDF ontology and map our json-data (example) to RDF to be used in other knowledge graph efforts. Some efforts exist from PDBj to map structures to RDF but they focus on experimental meta data while we consider structural coverage of the proteins more relevant. Probably SIFTS mappings are the better starting point here. With a minimal "@context" section referring to UniProt we might also be able to turn our existing json to valid json-ld.
- For protein modelling experts: custom modeling of proteins of interest (e.g. using careful expert-curated target-template alignments or combination of templates)
- Programming languages used within SWISS-MODEL: Python (3.6), C++
- Dealing with protein structure and sequence data: OpenStructure
Follow the biohackathon's code of conduct and this project's contributions guidelines.
NOTE: this is work-in-progress and subject to change.
The beta-server of SWISS-MODEL is used to allow users to upload annotations: https://beta.swissmodel.expasy.org/repository/covid_annotation_upload
Both the user annotations and the display of the viral polyprotein (R1AB_SARS2) are still work-in-progress and may have bugs. If you find problems with those prototype SWISS-MODEL features, please add issues to this github project and we will try to address them as soon as possible.
The annotation format is a plain-text format:
- One line per annotation
- Each annotation will consist of 5 or 6 comma- or tab-separated values:
- ID (UniProtKB AC or MD5 checksum of the sequence)
- Start position (1-based)
- End position
- Color value
- Reference (optional)
- Annotation comment
- Example:
P0DTD1 3400 3450 #FF00FF https://swissmodel.expasy.org/repository/ My Awesome Annotation P0DTC2 230 330 #FFA500 A text reference One more!
- UniProtKB ACs with links can be found in UniProtKB
- Our SARS-CoV-2 page shows mapping to mature proteins and the correspondence to RefSeq and GenBank.
- For cleaved proteins, use the parent protein. For instance an annotation on nsp3 (Non-structural protein 3) must be reported on P0DTD1 (the "parent" protein) with an offset of 818 (as nsp3 start on position 819 of P0DTD1).
- ViralZone has a well described overview of the proteome here.
- We propose to ignore the shorter polyprotein (P0DTC1, R1A_SARS2) as it's cleaved into the same mature proteins as the longer one (P0DTD1, R1AB_SARS2) with the exception of a very short peptide (Non-structural protein 11 (nsp11), YP_009725312.1).
- Two proteins of unknown function (P0DTD2 and P0DTD3) are missing from our SARS-CoV-2 page but can safely be used to map annotations and we will provide structures if possible.
Also we are actively working on extending the structural coverage of the SARS-CoV-2 proteome by using protein predictions from colleagues participating in CASP.
Protein structure predictions of SARS-CoV-2 have already proven useful to several research projects. To list a few examples which used our models:
- A potential role for integrins in host cell entry by SARS-CoV-2, Antiviral Research
- Targeting Novel Coronavirus 2019: A Systematic Drug Repurposing Approach to Identify Promising Inhibitors Against 3C-like Proteinase and 2'-O-Ribose Methyltransferase
- Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding, The Lancet
- Insilico Medicine publishes molecular structures for the key protein target of 2019-nCoV
- Targeting 2019-nCoV: GHDDI Info Sharing Portal
Thanks goes to these wonderful people (emoji key):
Gerardo Tauriello 📆 |
Xavier Robin 🔧 📖 |
bienchen 🔧 |
Andrew W 🔧 🎨 |
schdaude 🔧 |
BarbaraTerlouw 🤔 |
Vasilis J Promponas 🤔 |
Ben Busby 🤔 🖋 |
Laura Blum 🖋 |
This project follows the all-contributors specification. Contributions of any kind welcome!