This is the code for the framework of the paper Learning the syntax of plant assemblages to be submitted in Nature Plants.
If you use this code for your work and wish to credit the authors, you can cite the paper (it will be submitted to arXiv very soon):
@article{leblanc2025learning,
title = {Learning the syntax of plant assemblages},
author = {Leblanc, César and Bonnet, Pierre and Servajean, Maximilien and Thuiller, Wilfried and Chytrý, Milan and Joly, Alexis},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2025},
}
This framework aims to leverage large language models to learn the "syntax" of plant species co-occurrence patterns. In particular, because Pl@ntBERT captures latent dependencies between species across diverse ecosystems, the framework can be used to identify the habitats of vegetation plots.
- Prerequisites
- Data
- Installation
- Examples
- Libraries
- Roadmap
- Unlicense
- Contributing
- Troubleshooting
- Team
- Structure
Python version 3.8 or higher, pip, Git, CUDA, and Git LFS are required.
On many systems Python comes pre-installed. You can try running the following command to check and see if a correct version is already installed:
python --version
If Python is not already installed or if it is installed with version 3.7 or lower, you will need to install a functional version Python on your system by following the official documentation that contains a detailed guide on how to setup Python.
If you have Python version 3.4 or later (which is required to use Pl@ntBERT), pip should be included by default. To make sure you have it, you can type:
pip --version
If pip is not installed, you can install it by following the instructions here.
To check whether git is already installed or not, you can run:
git --version
If git is not installed, please install it by following the official instructions here.
To check whether CUDA is already installed or not on your system, you can try running the following command:
nvcc --version
If it is not, make sure to follow the instructions here.
To check whether Git LFS is already installed or not on your system, you can try running the following command:
git-lfs --version
If Git LFS is not installed, please install it by following the official instructions here.
The framework is optimized for data files from the European Vegetation Archive (EVA). These files contain all the information required for the proper functioning of the framework, i.e., for each vegetation plot the full list of vascular plant species, the estimates of cover abundance of each species, the location and the EUNIS classification. Once the database is downloaded (more information here), make sure you rename species and header data files respectively as species.csv
and header.csv
. All columns from the files are not needed, but if you decide to remove some of them to save space on your computer, make sure that the values are comma-separated and that you have at least:
- the columns
PlotObservationID
,Species
andCover
from the species file (vegetation-plot data) - the columns
PlotObservationID
,Habitat
,Longitude
andLatitude
from the header file (plot attributes)
You can have others columns, but they will be ignored. Two examples of how your files should look like are present within the Data
folder (species_example.csv
and header_example.csv
).
Firstly, Pl@ntBERT
can be installed via repository cloning:
git clone https://github.com/cesar-leblanc/plantbert.git Pl@ntBERT
cd Pl@ntBERT
Secondly, make sure that the dependencies listed in the environment.yml
and requirements.txt
files are installed.
One way to do so is to use venv
:
python -m venv ~/environments/pl@ntbert
source ~/environments/pl@ntbert/bin/activate
pip install -r requirements.txt
Thirdly, make sure you installed the pre-trained and fine-tuned models:
git lfs install
git clone https://huggingface.co/CesarLeblanc/bert-base-uncased Models/bert-base-uncased
git clone https://huggingface.co/CesarLeblanc/bert-large-uncased Models/bert-large-uncased
git clone https://huggingface.co/CesarLeblanc/plantbert_fill_mask_model Models/plantbert_fill_mask_model
git clone https://huggingface.co/CesarLeblanc/plantbert_text_classification_model Models/plantbert_text_classification_model
Starting from this point, all commands have to be launched within the Scripts
folder:
cd Scripts
Then, to check that the installation went well, use the following command:
python main.py --pipeline check
If the framework was properly installed, it should output:
Files are all present.
Dependencies are correctly installed.
Environment is properly configured.
Make sure to place your species and header data files inside the Data
folder before going further.
To pre-process the data from the European Vegetation Archive and create the fill-mask and text classification datasets:
python main.py --pipeline curation
Some changes can be made from this command to create another dataset. Here is an example to create a dataset with 5 different splits with blocks of 30 arc-minutes and while considering that species and habitat types appearing less than 5 times are rare:
python main.py --pipeline curation --k_folds 5 --spacing 0.5 --occurrences 5
To train and evaluate a masked language model on the datasets previously obtained using cross validation, run the following command:
python main.py --pipeline masking
Some changes can be made from this command to evaluate other parameters. Here is an example to train the model with a batch size of 4 and a learning rate of 1e-5 during 10 epochs:
python main.py --pipeline masking --batch_size 4 --learning_rate 1e-5 --epochs 10
To train a habitat type classifier from the labeled dataset previously obtained and save its weights, run the following command:
python main.py --pipeline classification
Some changes can be made from this command to train another classifier. Here is an example to train a large model on a pair of (train, validation) sets while sorting the species in a random order:
python main.py --pipeline classification --model large --method random --folds 2
Before making predictions, make sure you include a new file that describes the vegetation data of your choice in the Datasets
folder: vegetation_plots.csv
. The file, tab-separated, should contain only one column (if there are others columns they will be ignored):
Observations
(strings): a list of comma-separated names of species, ranked (if possible) in order of abundance
An example of how your file should look like is present within the Datasets
folder (vegetation_plots_example.csv
).
To predict the missing species and habitat classes of the new samples using previously trained models, make sure the weights of the desired models are stored in the Models
folder. You can also use the models already provided (i.e., first fold of a base model trained on dominance-ordered species sequences with a batch size of 2 and a learning rate of 2e-5 that encodes binomial names as one token) and then run the following command:
python main.py --pipeline inference
Some changes can be made from this command to predict differently. Here is an example to predict the 3 most likely habitat types using the first fold of an already trained base model with a batch size of 2 and a learning rate of 1e-05 on randomly-ordered species sequences:
python main.py --pipeline inference --model_habitat plantbert_text_classification_model_base_random_1_1e-05_0 --predict_species False --k_habitat 3
To run the full pipeline and perform all tasks at once (i.e., checking if the framework is correctly installed, pre-processing data to create curated datasets, training and evaluating a masked language model, using it to fine-tune a habitat type classifier, and predicting missing species and habitat types of vegetation plots), run the following command:
python main.py --pipeline check curation masking classification inference
This section lists every major frameworks/libraries used to create the models included in the project:
- - for tensor computation with strong GPU acceleration
- - for quantifying the quality of the predictions
- - for pretrained models to perform training tasks
- - for fast, flexible, and expressive data structures
- - for processing spatial data and interpolating it
This roadmap outlines the planned features and milestones for the project. Please note that the roadmap is subject to change and may be updated as the project progress.
- Implement multilingual user support
- English
- French
- Integrate new popular LLMs
- BERT
- RoBERTa
- DistilBERT
- ALBERT
- BioBERT
- Add more habitat typologies
- EUNIS
- NPMS
- Include other data aggregators
- EVA
- TAVA
- Offer several powerful frameworks
- PyTorch
- TensorFlow
- JAX
- Allow data parallel training
- Multithreading
- Multiprocessing
- Supply different classification strategies
- Top-k classification
- Average-k classification
This framework is distributed under the Unlicense, meaning that it is dedicated to public domain. See UNLICENSE.txt
for more information.
If you plan to contribute new features, please first open an issue and discuss the feature with us. See CONTRIBUTING.md
for more information.
- an internet connection is necessary for the check task (for GitHub access) and for the curation and inference tasks (for GBIF normalization).
- before using a model for inference, make sure you trained this exact same model (with the required set of parameters) on the required task.
- before curating a dataset, make sure it contains enough vegetation data (the more the better, both for vegetation plots and observations).
Pl@ntBERT is a community-driven project with several skillful engineers and researchers contributing to it.
Pl@ntBERT is currently maintained by César Leblanc with major contributions coming from Alexis Joly, Pierre Bonnet, Maximilien Servajean, and the amazing people from the Pl@ntNet Team in various forms and means.
.
├── .github -> GitHub-specific files
│ ├── ISSUE_TEMPLATE -> Templates for issues
│ │ ├── bug_report.md -> Bug report template
│ │ └── feature_request.md -> Feature request template
│ └── pull_request_template.md -> Pull request template
├── CODE_OF_CONDUCT.md -> Community guidelines
├── CONTRIBUTING.md -> Contribution instructions
├── Data -> Data files
│ ├── eunis_habitats.xlsx -> EUNIS habitat data
│ ├── header_example.csv -> Header example file
│ └── species_example.csv -> Species example data
├── Datasets -> Vegetation datasets
│ └── vegetation_plots_example.csv -> Vegetation plots example data
├── Images -> Image assets
│ └── logo.png -> Project logo
├── Models -> Pre-trained and fine-tuned models
├── README.md -> Project overview
├── SECURITY.md -> Security policy
├── Scripts -> Code scripts for the project
│ ├── __init__.py -> Package initialization
│ ├── cli.py -> Command-line interface
│ ├── data -> Data-related scripts
│ │ ├── __init__.py -> Package initialization
│ │ ├── load_data.py -> Load data scripts
│ │ ├── preprocess_data.py -> Preprocess data scripts
│ │ ├── save_data.py -> Save data scripts
│ │ └── utils_data.py -> Data utilities
│ ├── epoch -> Training and testing scripts
│ │ ├── __init__.py -> Package initialization
│ │ ├── test_epoch.py -> Test models per epoch
│ │ ├── train_epoch.py -> Train models per epoch
│ │ └── utils_epoch.py -> Epoch-related utilities
│ ├── main.py -> Main entry point
│ ├── metrics -> Metric computation
│ │ ├── __init__.py -> Package initialization
│ │ ├── accuracy.py -> Accuracy calculation
│ │ ├── f1.py -> F1-score calculation
│ │ ├── precision.py -> Precision calculation
│ │ └── recall.py -> Recall calculation
│ ├── modeling -> Model-related scripts
│ │ ├── __init__.py -> Package initialization
│ │ ├── load_modeling.py -> Load model scripts
│ │ ├── preprocess_modeling.py -> Preprocess input for models
│ │ ├── save_modeling.py -> Save trained models
│ │ └── utils_modeling.py -> Model utilities
│ ├── pipelines -> Task-specific pipelines
│ │ ├── __init__.py -> Package initialization
│ │ ├── check.py -> Debug pipelines
│ │ ├── classification.py -> Text classification pipeline
│ │ ├── curation.py -> Dataset curation pipeline
│ │ ├── inference.py -> Inference pipeline
│ │ └── masking.py -> Fill-mask pipeline
│ ├── submission_script.sh -> Job submission script
│ └── utils.py -> General utilities
├── UNLICENSE.txt -> Public domain license
├── environment.yml -> Conda environment file
└── requirements.txt -> Python dependencies