English| 中文
Microbial General Model (MGM) is a large-scale pretrained language model designed for interpretable microbiome data analysis. MGM supports a variety of tasks, including data preparation, model training, and inference, making it a versatile tool for microbiome research.
Install MGM using pip:
pip install microformer-mgm
Install MGM from the source code:
python setup.py install
The MicroCorpus-260K dataset includes 263,302 microbiome samples sourced from MGnify, ideal for training your own MGM model. It is available for download on OneDrive. The dataset includes:
MicroCorpus-260K.pkl
: Normalized microbiome corpus (mean and standard deviation across all samples).MicroCorpus-260K_unnorm.pkl
: Unnormalized microbiome corpus.mgnify_biomes.csv
: Metadata for the samples in the dataset.
Load the dataset in Python:
from pickle import load
corpus = load(open('MicroCorpus-260K.pkl', 'rb'))
corpus[0] # Access the first sample (dict with input_ids and attention_mask)
abundance = corpus.data # Access the abundance data
MGM is accessed via a command-line interface (CLI) with various modes. The general syntax is:
mgm <mode> [options]
Below, the modes are grouped into Data Preparation, Model Training, and Inference for better organization.
Converts abundance data into a microbiome corpus, normalized using phylogeny, and ranked from high to low genus abundance.
- Input: Abundance data in
hdf5
,csv
, ortsv
format (features in rows, samples in columns) - Output: A
.pkl
file containing the microbiome corpus
Example:
mgm construct -i data/abundance.csv -o data/corpus.pkl
Note: For
hdf5
files, use-k
to specify the key (default isgenus
).
Pretrains the MGM model using causal language modeling on a microbiome corpus. Optionally, trains a generator with labeled data.
- Input:
- Microbiome corpus (
.pkl
) - Optional: Label file (
.csv
, two columns: sample ID and label)
- Microbiome corpus (
- Output: Pretrained MGM model
Examples:
mgm pretrain -i data/corpus.pkl -o models/pretrained_model
mgm pretrain -i data/corpus.pkl -l data/labels.csv -o models/generator_model --with-label
Note: Use
--from-scratch
to train from scratch instead of loading pretrained weights. If a label file is provided, the tokenizer and model embedding layer are updated.
Trains a supervised MGM model from scratch using labeled data.
- Input:
- Microbiome corpus (
.pkl
) - Label file (
.csv
, two columns: sample ID and label)
- Microbiome corpus (
- Output: Supervised MGM model
Example:
mgm train -i data/corpus.pkl -l data/labels.csv -o models/supervised_model
Finetunes a pretrained MGM model for a specific task using labeled data.
- Input:
- Microbiome corpus (
.pkl
) - Label file (
.csv
, two columns: sample ID and label) - Optional: Pretrained model (defaults to MicroCorpus-260K pretrained model if not specified)
- Microbiome corpus (
- Output: Finetuned MGM model
Example:
mgm finetune -i data/corpus.pkl -l data/labels.csv -m models/pretrained_model -o models/finetuned_model
Generates predictions using a finetuned MGM model. Optionally evaluates against ground truth labels.
- Input:
- Microbiome corpus (
.pkl
) - Optional: Label file (
.csv
) for evaluation - Supervised MGM model
- Microbiome corpus (
- Output: Prediction results (
.csv
)
Example:
mgm predict -E -i data/corpus.pkl -l data/labels.csv -m models/finetuned_model -o data/predictions.csv
Note: Use
-E
with a label file to compare predictions with ground truth.
Generates synthetic microbiome data using a pretrained MGM model.
- Input:
- Pretrained MGM model
- Optional: Prompt file (
.txt
, one label per line) for labeled generation
- Output: Synthetic genus tensors (
.pkl
)
Example:
mgm generate -m models/generator_model -p data/prompt.txt -n 100 -o data/synthetic.pkl
Note: Use
-n
to specify the number of samples to generate.
Reconstructs abundance data from a ranked corpus, with optional training of a reconstructor model or label decoding.
- Input:
- Abundance file (e.g.,
csv
) for training the reconstructor, or a trained model checkpoint. - Ranked corpus (
.pkl
) for reconstruction - Optional: Generator model and prompt file (text, one label per line) for labeled data
- Abundance file (e.g.,
- Output:
- Reconstructed corpus (
.pkl
) - Reconstructor model (if training)
- Decoded labels (if applicable)
- Reconstructed corpus (
Examples:
mgm reconstruct -a data/abundance.csv -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructor
mgm reconstruct -r data/reconstructor_model.ckpt -i data/synthetic.pkl -g models/generator_model -w True -o data/reconstructed
For more details on any mode, run:
mgm <mode> --help
Name | Organization | |
---|---|---|
Haohong Zhang | [email protected] | PhD Student, School of Life Science and Technology, HUST |
Zixin Kang | [email protected] | Undergraduate, School of Life Science and Technology, HUST |
Kang Ning | [email protected] | Professor, School of Life Science and Technology, HUST |