Skip to content

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery (EMNLP'24)

License

Notifications You must be signed in to change notification settings

yuzhimanhua/Awesome-Scientific-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Scientific Language Models

Awesome Stars

Papers License: MIT PRWelcome

A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, materials science, biology, medicine, geoscience), covering different model sizes (from 100M to 100B parameters) and modalities (e.g., language, graph, vision, table, molecule, protein, genome, climate time series).

The repository is part of our survey paper A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery and will be continuously updated.

NOTE 1: To avoid ambiguity, when we talk about the number of parameters in a model, "Base" refers to 110M (i.e., BERT-Base), and "Large" refers to 340M (i.e., BERT-Large). Other numbers will be written explicitly.

NOTE 2: In each subsection, papers are sorted chronologically. If a paper has a preprint (e.g., arXiv or bioRxiv) version, its publication date is according to the preprint service. Otherwise, its publication date is according to the conference proceeding or journal.

NOTE 3: We appreciate contributions. If you have any suggested papers, feel free to reach out to [email protected] or submit a pull request. For format consistency, we will include a paper after (1) it has a version with author names AND (2) its GitHub and/or Hugging Face links are available.

Contents

General

Language

Language + Graph

  • (SPECTER) SPECTER: Document-level Representation Learning using Citation-informed Transformers ACL 2020
    [Paper] [GitHub] [Model (Base)]

  • (OAG-BERT) OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services KDD 2022
    [Paper] [GitHub]

  • (ASPIRE) Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity NAACL 2022
    [Paper] [GitHub] [Model (Base)]

  • (SciNCL) Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings EMNLP 2022
    [Paper] [GitHub] [Model (Base)]

  • (SPECTER 2.0) SciRepEval: A Multi-Format Benchmark for Scientific Document Representations EMNLP 2023
    [Paper] [GitHub] [Model (113M)]

  • (SciPatton) Patton: Language Model Pretraining on Text-Rich Networks ACL 2023
    [Paper] [GitHub]

  • (SciMult) Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding EMNLP 2023 Findings
    [Paper] [GitHub] [Model (138M)]

Mathematics

Language

Language + Vision

  • (Inter-GPS) Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning ACL 2021
    [Paper] [GitHub]

  • (Geoformer) UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression EMNLP 2022
    [Paper] [GitHub]

  • (SCA-GPS) A Symbolic Character-Aware Model for Solving Geometry Problems ACM MM 2023
    [Paper] [GitHub]

  • (UniMath-Flan-T5) UniMath: A Foundational and Multimodal Mathematical Reasoner EMNLP 2023
    [Paper] [GitHub]

  • (G-LLaVA) G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

Other Modalities (Table)

  • (TAPAS) TAPAS: Weakly Supervised Table Parsing via Pre-training ACL 2020
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (TaBERT) TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables ACL 2020
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (GraPPa) GraPPa: Grammar-Augmented Pre-training for Table Semantic Parsing ICLR 2021
    [Paper] [GitHub] [Model (355M)]

  • (TUTA) TUTA: Tree-Based Transformers for Generally Structured Table Pre-training KDD 2021
    [Paper] [GitHub]

  • (RCI) Capturing Row and Column Semantics in Transformer Based Question Answering over Tables NAACL 2021
    [Paper] [GitHub] [Model (12M)]

  • (TABBIE) TABBIE: Pretrained Representations of Tabular Data NAACL 2021
    [Paper] [GitHub]

  • (TAPEX) TAPEX: Table Pre-training via Learning a Neural SQL Executor ICLR 2022
    [Paper] [GitHub] [Model (140M)] [Model (406M)]

  • (FORTAP) FORTAP: Using Formulas for Numerical-Reasoning-Aware Table Pretraining ACL 2022
    [Paper] [GitHub]

  • (OmniTab) OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-Based Question Answering NAACL 2022
    [Paper] [GitHub] [Model (406M)]

  • (ReasTAP) ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples EMNLP 2022
    [Paper] [GitHub] [Model (406M)]

  • (Table-GPT) Table-GPT: Table-tuned GPT for Diverse Table Tasks SIGMOD 2024
    [Paper]

  • (TableLlama) TableLlama: Towards Open Large Generalist Models for Tables NAACL 2024
    [Paper] [GitHub] [Model (7B)]

  • (TableLLM) TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios arXiv 2024
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

Physics

Language

  • (astroBERT) Building astroBERT, a Language Model for Astronomy & Astrophysics arXiv 2021
    [Paper] [Model (Base)]

  • (AstroLLaMA) AstroLLaMA: Towards Specialized Foundation Models in Astronomy AACL 2023 Workshop
    [Paper] [Model (7B)]

  • (AstroLLaMA-Chat) AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets Research Notes of the AAS 2024
    [Paper] [Model (7B)]

  • (PhysBERT) PhysBERT: A Text Embedding Model for Physics Scientific Literature arXiv 2024
    [Paper] [Model (Base)]

  • (Astro-HEP-BERT) Astro-HEP-BERT: A Bidirectional Language Model for Studying the Meanings of Concepts in Astrophysics and High Energy Physics arXiv 2024
    [Paper] [Model (Base)]

Chemistry and Materials Science

Language

  • (ChemBERT) Automated Chemical Reaction Extraction from Scientific Literature Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub] [Model (Base)]

  • (MatSciBERT) MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction npj Computational Materials 2022
    [Paper] [GitHub] [Model (Base)]

  • (MatBERT) Quantifying the Advantage of Domain-Specific Pre-training on Named Entity Recognition Tasks in Materials Science Patterns 2022
    [Paper] [GitHub]

  • (BatteryBERT) BatteryBERT: A Pretrained Language Model for Battery Database Enhancement Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub] [Model (Base)]

  • (MaterialsBERT) A General-Purpose Material Property Data Extraction Pipeline from Large Polymer Corpora using Natural Language Processing npj Computational Materials 2023
    [Paper] [Model (Base)]

  • (Recycle-BERT) Recycle-BERT: Extracting Knowledge about Plastic Waste Recycling by Natural Language Processing ACS Sustainable Chemistry & Engineering 2023
    [Paper] [GitHub]

  • (CatBERTa) Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models ACS Catalysis 2023
    [Paper] [GitHub]

  • (LLM-Prop) LLM-Prop: Predicting Physical and Electronic Properties of Crystalline Solids from Their Text Descriptions arXiv 2023
    [Paper] [GitHub]

  • (ChemDFM) ChemDFM: Dialogue Foundation Model for Chemistry arXiv 2024
    [Paper] [GitHub] [Model (13B)]

  • (CrystalLLM) Fine-Tuned Language Models Generate Stable Inorganic Materials as Text ICLR 2024
    [Paper] [GitHub]

  • (ChemLLM) ChemLLM: A Chemical Large Language Model arXiv 2024
    [Paper] [Model (7B)]

  • (LlaSMol) LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset COLM 2024
    [Paper] [GitHub] [Model (6.7B, Galactica)] [Model (7B, LLaMA-2)] [Model (7B, Mistral)]

Language + Graph

  • (Text2Mol) Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries EMNLP 2021
    [Paper] [GitHub]

  • (KV-PLM) A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals Nature Communications 2022
    [Paper] [GitHub] [Model (Base)]

  • (MolT5) Translation between Molecules and Natural Language EMNLP 2022
    [Paper] [GitHub] [Model (60M)] [Model (220M)] [Model (770M)]

  • (MoMu) A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language arXiv 2022
    [Paper] [GitHub]

  • (MoleculeSTM) Multi-modal Molecule Structure-text Model for Text-Based Retrieval and Editing Nature Machine Intelligence 2023
    [Paper] [GitHub]

  • (Text+Chem T5) Unifying Molecular and Textual Representations via Multi-task Language Modelling ICML 2023
    [Paper] [GitHub] [Model (60M)] [Model (220M)]

  • (GIMLET) GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning NeurIPS 2023
    [Paper] [GitHub] [Model (60M)]

  • (MolFM) MolFM: A Multimodal Molecular Foundation Model arXiv 2023
    [Paper] [GitHub]

  • (MolCA) MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter EMNLP 2023
    [Paper] [GitHub]

  • (MolLM) MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations Bioinformatics 2024
    [Paper] [GitHub]

  • (InstructMol) InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery arXiv 2023
    [Paper] [GitHub]

  • (3D-MoLM) Towards 3D Molecule-Text Interpretation in Language Models ICLR 2024
    [Paper] [GitHub]

Language + Vision

  • (GIT-Mol) GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text Computers in Biology and Medicine 2024
    [Paper] [GitHub]

Other Modalities (Molecule)

  • (SMILES-BERT) SMILES-BERT: Large Scale Unsupervised Pre-training for Molecular Property Prediction ACM BCB 2019
    [Paper] [GitHub]

  • (MAT) Molecule Attention Transformer arXiv 2020
    [Paper] [GitHub]

  • (ChemBERTa) ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction arXiv 2020
    [Paper] [GitHub] [Model (125M)]

  • (MolBERT) Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary Tasks arXiv 2020
    [Paper] [GitHub] [Model (Base)]

  • (rxnfp) Mapping the Space of Chemical Reactions using Attention-Based Neural Networks Nature Machine Intelligence 2021
    [Paper] [GitHub] [Model (Base)]

  • (RXNMapper) Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical Reactions Science Advances 2021
    [Paper] [GitHub]

  • (MoLFormer) Large-Scale Chemical Language Representations Capture Molecular Structure and Properties Nature Machine Intelligence 2022
    [Paper] [GitHub] [Model (47M)]

  • (Chemformer) Chemformer: A Pre-trained Transformer for Computational Chemistry Machine Learning: Science and Technology 2022
    [Paper] [GitHub] [Model (45M)] [Model (230M)]

  • (R-MAT) Relative Molecule Self-Attention Transformer Journal of Cheminformatics 2024
    [Paper] [GitHub]

  • (MolGPT) MolGPT: Molecular Generation using a Transformer-Decoder Model Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (T5Chem) Unified Deep Learning Model for Multitask Reaction Predictions with Explanation Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (ChemGPT) Neural Scaling of Deep Chemical Models Nature Machine Intelligence 2023
    [Paper] [Model (4.7M)] [Model (19M)] [Model (1.2B)]

  • (Uni-Mol) Uni-Mol: A Universal 3D Molecular Representation Learning Framework ICLR 2023
    [Paper] [GitHub]

  • (TransPolymer) TransPolymer: A Transformer-Based Language Model for Polymer Property Predictions npj Computational Materials 2023
    [Paper] [GitHub]

  • (polyBERT) polyBERT: A Chemical Language Model to Enable Fully Machine-Driven Ultrafast Polymer Informatics Nature Communications 2023
    [Paper] [GitHub] [Model (86M)]

  • (MFBERT) Large-Scale Distributed Training of Transformers for Chemical Fingerprinting Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (SPMM) Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model Nature Communications 2024
    [Paper] [GitHub]

  • (BARTSmiles) BARTSmiles: Generative Masked Language Models for Molecular Representations Journal of Chemical Information and Modeling 2024
    [Paper] [GitHub] [Model (406M)]

  • (MolGen) Domain-Agnostic Molecular Generation with Self-feedback ICLR 2024
    [Paper] [GitHub] [Model (406M, BART)] [Model (7B, LLaMA)]

  • (SELFormer) SELFormer: Molecular Representation Learning via SELFIES Language Models Machine Learning: Science and Technology 2023
    [Paper] [GitHub] [Model (58M)] [Model (87M)]

  • (PolyNC) PolyNC: A Natural and Chemical Language Model for the Prediction of Unified Polymer Properties Chemical Science 2024
    [Paper] [GitHub] [Model (220M)]

Biology and Medicine

Acknowledgment: We referred to Wang et al.'s survey paper Pre-trained Language Models in Biomedical Domain: A Systematic Survey and He et al.'s survey paper Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions when writing some parts of this section.

Language

  • (BioBERT) BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining Bioinformatics 2020
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (BioELMo) Probing Biomedical Embeddings from Language Models NAACL 2019 Workshop
    [Paper] [GitHub] [Model (93M)]

  • (ClinicalBERT, Alsentzer et al.) Publicly Available Clinical BERT Embeddings NAACL 2019 Workshop
    [Paper] [GitHub] [Model (Base)]

  • (ClinicalBERT, Huang et al.) ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission arXiv 2019
    [Paper] [GitHub] [Model (Base)]

  • (BlueBERT, f.k.a. NCBI-BERT) Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets ACL 2019 Workshop
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (BEHRT) BEHRT: Transformer for Electronic Health Records Scientific Reports 2020
    [Paper] [GitHub]

  • (EhrBERT) Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)–Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study JMIR Medical Informatics 2019
    [Paper] [GitHub]

  • (Clinical XLNet) Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation EMNLP 2020 Workshop
    [Paper] [GitHub]

  • (ouBioBERT) Pre-training Technique to Localize Medical BERT and Enhance Biomedical BERT arXiv 2020
    [Paper] [GitHub] [Model (Base)]

  • (COVID-Twitter-BERT) COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter Frontiers in Artificial Intelligence 2023
    [Paper] [GitHub] [Model (Large)]

  • (Med-BERT) Med-BERT: Pretrained Contextualized Embeddings on Large-Scale Structured Electronic Health Records for Disease Prediction npj Digital Medicine 2021
    [Paper] [GitHub]

  • (Bio-ELECTRA) On the Effectiveness of Small, Discriminatively Pre-trained Language Representation Models for Biomedical Text Mining EMNLP 2020 Workshop
    [Paper] [GitHub] [Model (Base)]

  • (BiomedBERT, f.k.a. PubMedBERT) Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing ACM Transactions on Computing for Healthcare 2021
    [Paper] [Model (Base)] [Model (Large)]

  • (MCBERT) Conceptualized Representation Learning for Chinese Biomedical Text Mining arXiv 2020
    [Paper] [GitHub] [Model (Base)]

  • (BRLTM) Bidirectional Representation Learning from Transformers using Multimodal Electronic Health Record Data to Predict Depression JBHI 2021
    [Paper] [GitHub]

  • (BioRedditBERT) COMETA: A Corpus for Medical Entity Linking in the Social Media EMNLP 2020
    [Paper] [GitHub] [Model (Base)]

  • (BioMegatron) BioMegatron: Larger Biomedical Domain Language Model EMNLP 2020
    [Paper] [GitHub] [Model (345M)]

  • (SapBERT) Self-Alignment Pretraining for Biomedical Entity Representations NAACL 2021
    [Paper] [GitHub] [Model (Base)]

  • (ClinicalTransformer) Clinical Concept Extraction using Transformers JAMIA 2020
    [Paper] [GitHub] [Model (Base, BERT)] [Model (125M, RoBERTa)] [Model (12M, ALBERT)] [Model (Base, ELECTRA)] [Model (Base, XLNet)] [Model (149M, Longformer)] [Model (86M, DeBERTa)]

  • (BioRoBERTa) Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art EMNLP 2020 Workshop
    [Paper] [GitHub] [Model (125M)] [Model (355M)]

  • (RAD-BERT) Highly Accurate Classification of Chest Radiographic Reports using a Deep Learning Natural Language Model Pre-trained on 3.8 Million Text Reports Bioinformatics 2020
    [Paper] [GitHub]

  • (BioMedBERT) BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR COLING 2020
    [Paper] [GitHub]

  • (LBERT) LBERT: Lexically Aware Transformer-Based Bidirectional Encoder Representation Model for Learning Universal Bio-Entity Relations Bioinformatics 2021
    [Paper] [GitHub]

  • (ELECTRAMed) ELECTRAMed: A New Pre-trained Language Representation Model for Biomedical NLP arXiv 2021
    [Paper] [GitHub] [Model (Base)]

  • (KeBioLM) Improving Biomedical Pretrained Language Models with Knowledge NAACL 2021 Workshop
    [Paper] [GitHub]

  • (SciFive) SciFive: A Text-to-Text Transformer Model for Biomedical Literature arXiv 2021
    [Paper] [GitHub] [Model (220M)] [Model (770M)]

  • (BioALBERT) Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT BMC Bioinformatics 2022
    [Paper] [GitHub] [Model (12M)] [Model (18M)]

  • (Clinical-Longformer) Clinical-Longformer and Clinical-BigBird: Transformers for Long Clinical Sequences arXiv 2022
    [Paper] [GitHub] [Model (149M, Longformer)] [Model (Base, BigBird)]

  • (BioBART) BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model ACL 2022 Workshop
    [Paper] [GitHub] [Model (140M)] [Model (406M)]

  • (BioGPT) BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining Briefings in Bioinformatics 2022
    [Paper] [GitHub] [Model (355M)] [Model (1.5B)]

  • (Med-PaLM) Large Language Models Encode Clinical Knowledge Nature 2023
    [Paper]

  • (GatorTron) A Large Language Model for Electronic Health Records npj Digital Medicine 2022
    [Paper] [GitHub] [Model (345M)] [Model (3.9B)] [Model (8.9B)]

  • (ChatDoctor) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) using Medical Domain Knowledge Cureus 2023
    [Paper] [GitHub]

  • (DoctorGLM) DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task arXiv 2023
    [Paper] [GitHub]

  • (BenTsao, f.k.a. HuaTuo) HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge arXiv 2023
    [Paper] [GitHub]

  • (MedAlpaca) MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

  • (PMC-LLaMA) PMC-LLaMA: Towards Building Open-source Language Models for Medicine JAMIA 2024
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

  • (Med-PaLM 2) Towards Expert-Level Medical Question Answering with Large Language Models arXiv 2023
    [Paper]

  • (HuatuoGPT) HuatuoGPT, towards Taming Language Model to Be a Doctor EMNLP 2023 Findings
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

  • (MedCPT) MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval Bioinformatics 2023
    [Paper] [GitHub] [Model (Base)]

  • (Zhongjing) Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue AAAI 2024
    [Paper] [GitHub] [Model (13B)]

  • (DISC-MedLLM) DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation arXiv 2023
    [Paper] [GitHub] [Model (13B)]

  • (DRG-LLaMA) DRG-LLaMA: Tuning LLaMA Model to Predict Diagnosis-related Group for Hospitalized Patients npj Digital Medicine 2024
    [Paper] [GitHub]

  • (Qilin-Med) Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model arXiv 2023
    [Paper] [GitHub]

  • (AlpaCare) AlpaCare: Instruction-tuned Large Language Models for Medical Application arXiv 2023
    [Paper] [GitHub] [Model (7B, LLaMA)] [Model (7B, LLaMA-2)] [Model (13B, LLaMA)] [Model (13B, LLaMA-2)]

  • (BianQue) BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT arXiv 2023
    [Paper] [GitHub] [Model (6B)]

  • (HuatuoGPT-II) HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (34B)]

  • (Taiyi) Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks JAMIA 2024
    [Paper] [GitHub] [Model (7B)]

  • (MEDITRON) MEDITRON-70B: Scaling Medical Pretraining for Large Language Models arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (70B)]

  • (PLLaMa) PLLaMa: An Open-source Large Language Model for Plant Science arXiv 2024
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

  • (BioMistral) BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains ACL 2024 Findings
    [Paper] [Model (7B)]

  • (Me-LLaMA) Me-LLaMA: Foundation Large Language Models for Medical Applications arXiv 2024
    [Paper] [GitHub]

  • (BiMediX) BiMediX: Bilingual Medical Mixture of Experts LLM arXiv 2024
    [Paper] [GitHub] [Model (8x7B)]

  • (MMedLM) Towards Building Multilingual Language Model for Medicine arXiv 2024
    [Paper] [GitHub] [Model (7B, InternLM)] [Model (1.8B, InternLM2)] [Model (7B, InternLM2)] [Model (8B, LLaMA-3)]

  • (BioMedLM, f.k.a. PubMedGPT) BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text arXiv 2024
    [Paper] [GitHub] [Model (2.7B)]

  • (Hippocrates) Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare arXiv 2024
    [Paper] [Model (7B, LLaMA-2)] [Model (7B, Mistral)]

  • (BMRetriever) BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers arXiv 2024
    [Paper] [GitHub] [Model (410M, Pythia)] [Model (1B, Pythia)] [Model (2B, Gemma)] [Model (7B, Mistral)]

  • (Panacea) Panacea: A Foundation Model for Clinical Trial Search, Summarization, Design, and Recruitment arXiv 2024
    [Paper] [GitHub]

Language + Graph

  • (G-BERT) Pre-training of Graph Augmented Transformers for Medication Recommendation IJCAI 2019
    [Paper] [GitHub]

  • (CODER) CODER: Knowledge Infused Cross-Lingual Medical Term Embedding for Term Normalization JBI 2022
    [Paper] [GitHub] [Model (Base)]

  • (MoP) Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT EMNLP 2021
    [Paper] [GitHub]

  • (BioLinkBERT) LinkBERT: Pretraining Language Models with Document Links ACL 2022
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (DRAGON) Deep Bidirectional Language-Knowledge Graph Pretraining NeurIPS 2022
    [Paper] [GitHub] [Model (360M)]

Language + Vision

  • (ConVIRT) Contrastive Learning of Medical Visual Representations from Paired Images and Text MLHC 2022
    [Paper] [GitHub]

  • (MMBERT) MMBERT: Multimodal BERT Pretraining for Improved Medical VQA ISBI 2021
    [Paper] [GitHub]

  • (MedViLL) Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-training JBHI 2022
    [Paper] [GitHub]

  • (GLoRIA) GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition ICCV 2021
    [Paper] [GitHub]

  • (LoVT) Joint Learning of Localized Representations from Medical Images and Reports ECCV 2022
    [Paper] [GitHub]

  • (BioViL) Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing ECCV 2022
    [Paper] [GitHub]

  • (M3AE) Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-training MICCAI 2022
    [Paper] [GitHub] [Model]

  • (ARL) Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge ACM MM 2022
    [Paper] [GitHub]

  • (CheXzero) Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning Nature Biomedical Engineering 2022
    [Paper] [GitHub] [Model]

  • (MGCA) Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning NeurIPS 2022
    [Paper] [GitHub] [Model]

  • (MedCLIP) MedCLIP: Contrastive Learning from Unpaired Medical Images and Text EMNLP 2022
    [Paper] [GitHub]

  • (BioViL-T) Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing CVPR 2023
    [Paper] [GitHub] [Model]

  • (BiomedCLIP) BiomedCLIP: A Multimodal Biomedical Foundation Model Pretrained from Fifteen Million Scientific Image-Text Pairs arXiv 2023
    [Paper] [Model]

  • (PMC-CLIP) PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents MICCAI 2023
    [Paper] [GitHub] [Model]

  • (Xplainer) Xplainer: From X-Ray Observations to Explainable Zero-Shot Diagnosis MICCAI 2023
    [Paper] [GitHub]

  • (RGRG) Interactive and Explainable Region-Guided Radiology Report Generation CVPR 2023
    [Paper] [GitHub] [Model]

  • (BiomedGPT) A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks Nature Medicine 2024
    [Paper] [GitHub] [Model (33M)] [Model (93M)] [Model (182M)]

  • (Med-UniC) Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias NeurIPS 2023
    [Paper] [GitHub]

  • (LLaVA-Med) LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day NeurIPS 2023
    [Paper] [GitHub] [Model (7B)]

  • (MI-Zero) Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images CVPR 2023
    [Paper] [GitHub] [Model]

  • (XrayGPT) XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models ACL 2024 Workshop
    [Paper] [GitHub]

  • (MONET) Transparent Medical Image AI via an Image–Text Foundation Model Grounded in Medical Literature Nature Medicine 2024
    [Paper] [GitHub]

  • (QuiltNet) Quilt-1M: One Million Image-Text Pairs for Histopathology NeurIPS 2023
    [Paper] [GitHub] [Model]

  • (MUMC) Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering MICCAI 2023
    [Paper] [GitHub]

  • (M-FLAG) M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization MICCAI 2023
    [Paper] [GitHub]

  • (PRIOR) PRIOR: Prototype Representation Joint Learning from Medical Images and Reports ICCV 2023
    [Paper] [GitHub]

  • (Med-PaLM M) Towards Generalist Biomedical AI NEJM AI 2024
    [Paper] [GitHub]

  • (CITE) Text-Guided Foundation Model Adaptation for Pathological Image Classification MICCAI 2023
    [Paper] [GitHub]

  • (Med-Flamingo) Med-Flamingo: A Multimodal Medical Few-shot Learner ML4H 2023
    [Paper] [GitHub]

  • (RadFM) Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data arXiv 2023
    [Paper] [GitHub] [Model]

  • (PLIP) A Visual–Language Foundation Model for Pathology Image Analysis using Medical Twitter Nature Medicine 2023
    [Paper] [GitHub] [Model]

  • (MaCo) Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning Nature Communications 2024
    [Paper] [GitHub]

  • (CXR-CLIP) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training MICCAI 2023
    [Paper] [GitHub]

  • (Qilin-Med-VL) Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare arXiv 2023
    [Paper] [GitHub] [Model]

  • (BioCLIP) BioCLIP: A Vision Foundation Model for the Tree of Life CVPR 2024
    [Paper] [GitHub] [Model]

  • (M3D) M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models arXiv 2024
    [Paper] [GitHub] [Model]

  • (Med-Gemini) Capabilities of Gemini Models in Medicine arXiv 2024
    [Paper]

  • (Med-Gemini-2D/3D/Polygenic) Advancing Multimodal Medical Capabilities of Gemini arXiv 2024
    [Paper]

  • (Mammo-CLIP) Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography MICCAI 2024
    [Paper] [GitHub] [Model]

Other Modalities (Protein)

Other Modalities (DNA)

Other Modalities (RNA)

  • (RNABERT) Informative RNA-base Embedding for Functional RNA Structural Alignment and Clustering by Deep Representation Learning NAR Genomics and Bioinformatics 2022
    [Paper] [GitHub]

  • (RNA-FM) Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions arXiv 2022
    [Paper] [GitHub]

  • (SpliceBERT) Self-Supervised Learning on Millions of Primary RNA Sequences from 72 Vertebrates Improves Sequence-Based RNA Splicing Prediction Briefings in Bioinformatics 2024
    [Paper] [GitHub] [Model (19.4M)]

  • (RNA-MSM) Multiple Sequence-Alignment-Based RNA Language Model and its Application to Structural Inference Nucleic Acids Research 2024
    [Paper] [GitHub]

  • (CodonBERT) CodonBERT: Large Language Models for mRNA Design and Optimization bioRxiv 2023
    [Paper] [GitHub]

  • (UTR-LM) A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions Nature Machine Intelligence 2024
    [Paper] [GitHub]

Other Modalities (Multiomics)

  • (scBERT) scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data Nature Machine Intelligence 2022
    [Paper] [GitHub]

  • (scGPT) scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics using Generative AI Nature Methods 2024
    [Paper] [GitHub]

  • (scFoundation) Large Scale Foundation Model on Single-cell Transcriptomics Nature Methods 2024
    [Paper] [GitHub] [Model (100M)]

  • (Geneformer) Transfer Learning Enables Predictions in Network Biology Nature 2023
    [Paper] [Model (10M)] [Model (40M)]

  • (CellLM) Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning arXiv 2023
    [Paper] [GitHub]

  • (CellPLM) CellPLM: Pre-training of Cell Language Model Beyond Single Cells ICLR 2024
    [Paper] [GitHub] [Model (82M)]

  • (scMulan) scMulan: A Multitask Generative Pre-trained Language Model for Single-Cell Analysis bioRxiv 2024
    [Paper] [GitHub]

Geography, Geology, and Environmental Science

Language

  • (ClimateBERT) ClimateBERT: A Pretrained Language Model for Climate-Related Text arXiv 2021
    [Paper] [GitHub] [Model (82M)]

  • (SpaBERT) SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation EMNLP 2022 Findings
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (MGeo) MGeo: Multi-Modal Geographic Pre-training Method SIGIR 2023
    [Paper] [GitHub]

  • (K2) K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization WSDM 2024
    [Paper] [GitHub] [Model (7B)]

  • (OceanGPT) OceanGPT: A Large Language Model for Ocean Science Tasks ACL 2024
    [Paper] [GitHub] [Model (7B)]

  • (ClimateBERT-NetZero) ClimateBERT-NetZero: Detecting and Assessing Net Zero and Reduction Targets EMNLP 2023
    [Paper] [Model (82M)]

  • (GeoLM) GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding EMNLP 2023
    [Paper] [GitHub]

  • (GeoGalactica) GeoGalactica: A Scientific Large Language Model in Geoscience arXiv 2024
    [Paper] [GitHub] [Model (30B)]

Language + Graph

  • (ERNIE-GeoL) ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps KDD 2022
    [Paper]

  • (PK-Chat) PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue Model arXiv 2023
    [Paper] [GitHub]

Language + Vision

  • (UrbanCLIP) UrbanCLIP: Learning Text-Enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web WWW 2024
    [Paper] [GitHub]

Other Modalities (Climate Time Series)

  • (FourCastNet) FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators arXiv 2022
    [Paper] [GitHub]

  • (Pangu-Weather) Accurate Medium-Range Global Weather Forecasting with 3D Neural Networks Nature 2023
    [Paper] [GitHub]

  • (ClimaX) ClimaX: A Foundation Model for Weather and Climate ICML 2023
    [Paper] [GitHub]

  • (FengWu) FengWu: Pushing the Skillful Global Medium-Range Weather Forecast beyond 10 Days Lead arXiv 2023
    [Paper] [GitHub]

  • (W-MAE) W-MAE: Pre-trained Weather Model with Masked Autoencoder for Multi-Variable Weather Forecasting arXiv 2023
    [Paper] [GitHub]

  • (FuXi) FuXi: A Cascade Machine Learning Forecasting System for 15-day Global Weather Forecast npj Climate and Atmospheric Science 2023
    [Paper] [GitHub]

Citation

If you find this repository useful, please cite the following paper:

@article{zhang2024comprehensive,
  title={A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery},
  author={Zhang, Yu and Chen, Xiusi and Jin, Bowen and Wang, Sheng and Ji, Shuiwang and Wang, Wei and Han, Jiawei},
  booktitle={EMNLP'24},
  pages={8783--8817},
  year={2024}
}

About

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery (EMNLP'24)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published