Skip to content

MGM (Microbial General Model) as a large-scaled pretrained language model for interpretable microbiome data analysis.

License

Notifications You must be signed in to change notification settings

HUST-NingKang-Lab/MGM

Repository files navigation

English| 中文

MGM

Downloads

Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM supports a variety of tasks, including data preparation, model training, and inference, making it a versatile tool for microbiome research.

MGM Pipeline

Installation

From PyPI

Install MGM using pip:

pip install microformer-mgm

From Source

Install MGM from the source code:

python setup.py install

MicroCorpus-260K

The MicroCorpus-260K dataset includes 263,302 microbiome samples sourced from MGnify, ideal for training your own MGM model. It is available for download on OneDrive. The dataset includes:

  • MicroCorpus-260K.pkl: Normalized microbiome corpus (mean and standard deviation across all samples).
  • MicroCorpus-260K_unnorm.pkl: Unnormalized microbiome corpus.
  • mgnify_biomes.csv: Metadata for the samples in the dataset.

Loading MicroCorpus-260K

Load the dataset in Python:

from pickle import load
corpus = load(open('MicroCorpus-260K.pkl', 'rb'))
corpus[0]  # Access the first sample (dict with input_ids and attention_mask)
abundance = corpus.data  # Access the abundance data

Usage

MGM is accessed via a command-line interface (CLI) with various modes. The general syntax is:

mgm <mode> [options]

Below, the modes are grouped into Data Preparation, Model Training, and Inference for better organization.

Data Preparation

construct

Converts abundance data into a microbiome corpus, normalized using phylogeny, and ranked from high to low genus abundance.

  • Input: Abundance data in hdf5, csv, or tsv format (features in rows, samples in columns)
  • Output: A .pkl file containing the microbiome corpus

Example:

mgm construct -i data/abundance.csv -o data/corpus.pkl

Note: For hdf5 files, use -k to specify the key (default is genus).

Model Training

pretrain

Pretrains the MGM model using causal language modeling on a microbiome corpus. Optionally, trains a generator with labeled data.

  • Input:
    • Microbiome corpus (.pkl)
    • Optional: Label file (.csv, two columns: sample ID and label)
  • Output: Pretrained MGM model

Examples:

mgm pretrain -i data/corpus.pkl -o models/pretrained_model
mgm pretrain -i data/corpus.pkl -l data/labels.csv -o models/generator_model --with-label

Note: Use --from-scratch to train from scratch instead of loading pretrained weights. If a label file is provided, the tokenizer and model embedding layer are updated.

train

Trains a supervised MGM model from scratch using labeled data.

  • Input:
    • Microbiome corpus (.pkl)
    • Label file (.csv, two columns: sample ID and label)
  • Output: Supervised MGM model

Example:

mgm train -i data/corpus.pkl -l data/labels.csv -o models/supervised_model

finetune

Finetunes a pretrained MGM model for a specific task using labeled data.

  • Input:
    • Microbiome corpus (.pkl)
    • Label file (.csv, two columns: sample ID and label)
    • Optional: Pretrained model (defaults to MicroCorpus-260K pretrained model if not specified)
  • Output: Finetuned MGM model

Example:

mgm finetune -i data/corpus.pkl -l data/labels.csv -m models/pretrained_model -o models/finetuned_model

Inference

predict

Generates predictions using a finetuned MGM model. Optionally evaluates against ground truth labels.

  • Input:
    • Microbiome corpus (.pkl)
    • Optional: Label file (.csv) for evaluation
    • Supervised MGM model
  • Output: Prediction results (.csv)

Example:

mgm predict -E -i data/corpus.pkl -l data/labels.csv -m models/finetuned_model -o data/predictions.csv

Note: Use -E with a label file to compare predictions with ground truth.

generate

Generates synthetic microbiome data using a pretrained MGM model.

  • Input:
    • Pretrained MGM model
    • Optional: Prompt file (.txt, one label per line) for labeled generation
  • Output: Synthetic genus tensors (.pkl)

Example:

mgm generate -m models/generator_model -p data/prompt.txt -n 100 -o data/synthetic.pkl

Note: Use -n to specify the number of samples to generate.

reconstruct

Reconstructs abundance data from a ranked corpus, with optional training of a reconstructor model or label decoding.

  • Input:
    • Abundance file (e.g., csv) for training the reconstructor, or a trained model checkpoint.
    • Ranked corpus (.pkl) for reconstruction
    • Optional: Generator model and prompt file (text, one label per line) for labeled data
  • Output:
    • Reconstructed corpus (.pkl)
    • Reconstructor model (if training)
    • Decoded labels (if applicable)

Examples:

mgm reconstruct -a data/abundance.csv -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructor
mgm reconstruct -r data/reconstructor_model.ckpt -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructed

For more details on any mode, run:

mgm <mode> --help

Maintainers

Name Email Organization
Haohong Zhang [email protected] PhD Student, School of Life Science and Technology, HUST
Zixin Kang [email protected] Undergraduate, School of Life Science and Technology, HUST
Kang Ning [email protected] Professor, School of Life Science and Technology, HUST

About

MGM (Microbial General Model) as a large-scaled pretrained language model for interpretable microbiome data analysis.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages