Genome-Llama-2: An Autoregressive Foundation Model for Multi-Species Genomes

Introduction

Genome-Llama-2 is a family of autoregressive large language models with sizes ranging from 119 million to 744 million parameters, specifically trained on extensive multi-species genomic data. This model integrates cutting-edge advancements in natural language processing and deep learning, making it a powerful tool for genomic research. Building on the Llama-2 architecture, Genome-Llama-2 is optimized to approach the performance of DNABERT-2. The family includes models of varying sizes to accommodate different computational needs:

Base model: 119 million parameters.
Medium model: 411 million parameters.
Large model: 744 million parameters.

By treating genome sequences as text, Genome-Llama-2 can learn intricate patterns and relationships within the data. This capability makes it suitable for a range of tasks, including Epigenetic Marks Prediction, Covid Variants Classification, Splice Site Prediction, and other tasks specified by the GUE benchmark.

This repository contains the complete training pipeline for Genome-Llama-2, leveraging PyTorch Lightning to enable efficient training in a distributed environment. Whether you are a researcher or developer, Genome-Llama-2 offers a robust framework for advancing your genomic studies.

Pre-training

We used the same dataset as DNABERT-2 to pretrain Genome-Llama-2. The training data can be accessed here.

Example Usage

To tokenize the pre-training dataset, use the following command:

python genome_llama2/tokenization/tokenize_data.py --tokenize_config genome_llama2/config/pretrain_config.yaml

To pre-train the model, use the following command:

python genome_llama2/pretrain_model.py --pretrain_config genome_llama2/config/pretrain_config.yaml

To fine-tune the model on a specific data, use the following command:

python genome_llama2/finetune_model.py --finetune_config genome_llama2/config/finetune_config.yaml

The user can change the configuration in the pretrain_config.yaml and finetune_config.yaml files to customize tokenization, pre-training, and fine-tuning according to their requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
config		config
dataset		dataset
model		model
processor		processor
tokenization		tokenization
util		util
.gitignore		.gitignore
README.md		README.md
finetune_model.py		finetune_model.py
pretrain_model.py		pretrain_model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genome-Llama-2: An Autoregressive Foundation Model for Multi-Species Genomes

Table of Contents

Introduction

Pre-training

Example Usage

About

Releases

Packages

Languages

nyuolab/genome_llama2

Folders and files

Latest commit

History

Repository files navigation

Genome-Llama-2: An Autoregressive Foundation Model for Multi-Species Genomes

Table of Contents

Introduction

Pre-training

Example Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages