The PST manuscript methods section map to different repositories or files located in this main repository. Files referenced in these notebooks are located in the DRYAD repository (datasets, supplementary data, or supplementary tables). The supplementary tables may also be found associated with the manuscript itself.
The files should have the same names. However, due to the combined sized of all datasets/
files (>170GB), these files are individually grouped into subgroups in the DRYAD repository. The specific file names are the same as referenced in these notebooks, but the DRYAD README
will tell you what specific tarball you need.
Be warned that the memory requirements of some of these analyses can reach up to 1TB if you try to reproduce these analyses with the full datasets.
- ESM2 protein language model embeddings
- Modified Leave-One-Group-Out cross validation and hyperparameter tuning
- Part of the specific implementation is also found here
- GenSLM open reading frame (ORF) and genome embeddings
- Hyena-DNA genome embeddings
- Tetranucleotide frequency vectors as simple genome embeddings
- Clustering genome and protein embeddings
- Genome and protein clustering evaluation
- Average amino acid identity (AAI)
- Averaging AAI over each genome cluster found here
- Average amino acid identity (AAI) genome clustering
- Protein functional annotation
- Protein attention scaling and analysis
- Protein annotation improvement
- Protein function co-clustering
- Protein functional module detection
- Capsid structure searches
- Graph-based host prediction framework
- Constructing the virus-host interaction network
- Specific knowledge graphs can be found here
- Host prediction model evaluation