Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature chemical models of prediction and generation support with string representation. #49

Open
linjing-lab opened this issue Nov 23, 2024 · 2 comments

Comments

@linjing-lab
Copy link

Notice tracel-ai from burn framework, this software must substitute to high performance predictions, like robotics, predict from data lake. Some molecular pretrained models use RoBERTa as base model, like ChemBERTa, ChemBERTa-2, MFBERT, SELFormer, Semi-RoBERTa. Some protein pretrained models use RoBERTa as base model, like ESM-1b, ESM-2, PromptProtein, KeAP. Those are encode-only tasks which compatible with models from tracel-ai from the inference performance perspective, recommend models provide burn-based multi-strings examples for molecules, proteins, genomics, and multi-modal level sets.

This repository has CRAFT model which may used in Structure-based task, but wasn't clear enough in reality design, like MolCRAFT of continuous parameter space for drug design. Clear chemical compatibility was constrained with maximized purpose of script character interpretation, not only abstract design for kind machine schedules. Abstract interpretation can always export new distributed abstract operators, which reflect machine memory and times, think tracel-ai features more decode tasks, and seek low memory from correlation when string to continuous space. Now multi-objective and chemical prediction happens in one possible history, from explanation, to distributed stream pattern.

@antimora
Copy link
Collaborator

antimora commented Nov 25, 2024

@linjing-lab. Thanks for you the feedback. So we understand your request, can you confirm the following issue description is accurate? This is my interpretation of your issue raised.


Title: Support for Chemical and Biological Sequence Models Utilizing String Representations

Description:

To enhance the repository's applicability in cheminformatics and bioinformatics, it is proposed to integrate models capable of processing chemical and biological sequences represented as strings. This includes handling molecular structures via SMILES (Simplified Molecular Input Line Entry System) and protein sequences through amino acid representations.

Proposed Enhancements:

  1. Incorporate Molecular Models:

    • Develop and include models similar to ChemBERTa, ChemBERTa-2, MFBERT, SELFormer, and Semi-RoBERTa, which are based on the RoBERTa architecture and designed for molecular data processing.
  2. Integrate Protein Sequence Models:

    • Add models akin to ESM-1b, ESM-2, PromptProtein, and KeAP, which utilize RoBERTa for protein sequence analysis.
  3. Enhance Existing Models:

    • Refine the current CRAFT model to improve its design clarity and functionality, enabling support for continuous parameter spaces in drug design, similar to MolCRAFT.

Objective:

These enhancements aim to broaden the repository's utility in fields such as drug discovery and genomics by providing high-performance models built with the Burn framework, capable of efficient inference on molecular and protein sequence data.

@linjing-lab
Copy link
Author

From system execution, unified pipelines connected database and its concrete analysis, downstream alignments with checkpoints. Compiled model of specific usages now served as new end-to-end checkpoint from simple daily distributions. Enhancements are trying to explain interpretable models from capable training, and efficiently deploy clarity and functionality in inference. Statistical collections always note latent contributions in datasets, which served for easy alignments in quoting data and mapping rules, real selective models from which control datasets on a latent growth. Think rust pretrained models need to apply in scale from system sequences, search from all reality topics activates self applications in continuously abstractive rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants