MGM

English| 中文

MGM

Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM supports a variety of tasks, including data preparation, model training, and inference, making it a versatile tool for microbiome research.

Installation

From PyPI

Install MGM using pip:

pip install microformer-mgm

From Source

Install MGM from the source code:

python setup.py install

MicroCorpus-260K

The MicroCorpus-260K dataset includes 263,302 microbiome samples sourced from MGnify, ideal for training your own MGM model. It is available for download on OneDrive. The dataset includes:

MicroCorpus-260K.pkl: Normalized microbiome corpus (mean and standard deviation across all samples).
MicroCorpus-260K_unnorm.pkl: Unnormalized microbiome corpus.
mgnify_biomes.csv: Metadata for the samples in the dataset.

Loading MicroCorpus-260K

Load the dataset in Python:

from pickle import load
corpus = load(open('MicroCorpus-260K.pkl', 'rb'))
corpus[0]  # Access the first sample (dict with input_ids and attention_mask)
abundance = corpus.data  # Access the abundance data

Usage

MGM is accessed via a command-line interface (CLI) with various modes. The general syntax is:

mgm <mode> [options]

Below, the modes are grouped into Data Preparation, Model Training, and Inference for better organization.

Data Preparation

`construct`

Converts abundance data into a microbiome corpus, normalized using phylogeny, and ranked from high to low genus abundance.

Input: Abundance data in hdf5, csv, or tsv format (features in rows, samples in columns)
Output: A .pkl file containing the microbiome corpus

Example:

mgm construct -i data/abundance.csv -o data/corpus.pkl

Note: For hdf5 files, use -k to specify the key (default is genus).

Model Training

`pretrain`

Pretrains the MGM model using causal language modeling on a microbiome corpus. Optionally, trains a generator with labeled data.

Input:
- Microbiome corpus (.pkl)
- Optional: Label file (.csv, two columns: sample ID and label)
Output: Pretrained MGM model

Examples:

mgm pretrain -i data/corpus.pkl -o models/pretrained_model
mgm pretrain -i data/corpus.pkl -l data/labels.csv -o models/generator_model --with-label

Note: Use --from-scratch to train from scratch instead of loading pretrained weights. If a label file is provided, the tokenizer and model embedding layer are updated.

`train`

Trains a supervised MGM model from scratch using labeled data.

Input:
- Microbiome corpus (.pkl)
- Label file (.csv, two columns: sample ID and label)
Output: Supervised MGM model

Example:

mgm train -i data/corpus.pkl -l data/labels.csv -o models/supervised_model

`finetune`

Finetunes a pretrained MGM model for a specific task using labeled data.

Input:
- Microbiome corpus (.pkl)
- Label file (.csv, two columns: sample ID and label)
- Optional: Pretrained model (defaults to MicroCorpus-260K pretrained model if not specified)
Output: Finetuned MGM model

Example:

mgm finetune -i data/corpus.pkl -l data/labels.csv -m models/pretrained_model -o models/finetuned_model

Inference

`predict`

Generates predictions using a finetuned MGM model. Optionally evaluates against ground truth labels.

Input:
- Microbiome corpus (.pkl)
- Optional: Label file (.csv) for evaluation
- Supervised MGM model
Output: Prediction results (.csv)

Example:

mgm predict -E -i data/corpus.pkl -l data/labels.csv -m models/finetuned_model -o data/predictions.csv

Note: Use -E with a label file to compare predictions with ground truth.

`generate`

Generates synthetic microbiome data using a pretrained MGM model.

Input:
- Pretrained MGM model
- Optional: Prompt file (.txt, one label per line) for labeled generation
Output: Synthetic genus tensors (.pkl)

Example:

mgm generate -m models/generator_model -p data/prompt.txt -n 100 -o data/synthetic.pkl

Note: Use -n to specify the number of samples to generate.

`reconstruct`

Reconstructs abundance data from a ranked corpus, with optional training of a reconstructor model or label decoding.

Input:
- Abundance file (e.g., csv) for training the reconstructor, or a trained model checkpoint.
- Ranked corpus (.pkl) for reconstruction
- Optional: Generator model and prompt file (text, one label per line) for labeled data
Output:
- Reconstructed corpus (.pkl)
- Reconstructor model (if training)
- Decoded labels (if applicable)

Examples:

mgm reconstruct -a data/abundance.csv -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructor
mgm reconstruct -r data/reconstructor_model.ckpt -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructed

For more details on any mode, run:

mgm <mode> --help

Maintainers

Name	Email	Organization
Haohong Zhang	[email protected]	PhD Student, School of Life Science and Technology, HUST
Zixin Kang	[email protected]	Undergraduate, School of Life Science and Technology, HUST
Kang Ning	[email protected]	Professor, School of Life Science and Technology, HUST

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
infant_data		infant_data
mgm		mgm
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh.md		README_zh.md
pipeline.png		pipeline.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MGM

Installation

From PyPI

From Source

MicroCorpus-260K

Loading MicroCorpus-260K

Usage

Data Preparation

`construct`

Model Training

`pretrain`

`train`

`finetune`

Inference

`predict`

`generate`

`reconstruct`

Maintainers

About

Releases 7

Packages

Contributors 2

Languages

License

HUST-NingKang-Lab/MGM

Folders and files

Latest commit

History

Repository files navigation

MGM

Installation

From PyPI

From Source

MicroCorpus-260K

Loading MicroCorpus-260K

Usage

Data Preparation

construct

Model Training

pretrain

train

finetune

Inference

predict

generate

reconstruct

Maintainers

About

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 2

Languages

`construct`

`pretrain`

`train`

`finetune`

`predict`

`generate`

`reconstruct`

Packages