Merge pull request #132 from roboflow/release/maestro-1.0.0

florence_2.md, paligemma_2.md, qwen_2_5_vl.md docs added + maestro_qwen2_5_vl_json_extraction cookbook
roboflow · Feb 5, 2025 · 296969f · 296969f
2 parents aa78e00 + 61f6e5b
commit 296969f
Show file tree

Hide file tree

Showing 14 changed files with 816 additions and 1,510 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -18,7 +18,7 @@ repos:
       - id: detect-private-key
       - id: pretty-format-json
         args: ['--autofix', '--no-sort-keys', '--indent=4']
-        exclude: ".*\\.ipynb$"
+        exclude: /.*\.ipynb
       - id: end-of-file-fixer
       - id: mixed-line-ending
 
@@ -35,7 +35,7 @@ repos:
       -   id: ruff
           args: [--fix, --exit-non-zero-on-fix]
       -   id: ruff-format
-          types_or: [ python, pyi, jupyter ]
+          types_or: [ python, pyi, jupyter]
 
   -   repo: https://github.com/pre-commit/mirrors-mypy
       rev: 'v1.14.1'

diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 
   <h1>maestro</h1>
 
+  <h3>VLM fine-tuning for everyone</h1>
+
   <br>
 
   <div>

diff --git a/cookbooks/maestro_florence2_object_detection.ipynb b/cookbooks/maestro_florence2_object_detection.ipynb
diff --git a/cookbooks/maestro_florence2_visual_question_answering.ipynb b/cookbooks/maestro_florence2_visual_question_answering.ipynb
diff --git a/cookbooks/maestro_qwen2_5_vl_json_extraction.ipynb b/cookbooks/maestro_qwen2_5_vl_json_extraction.ipynb
diff --git a/docs/florence-2.md b/docs/florence-2.md
diff --git a/docs/index.md b/docs/index.md
@@ -2,6 +2,8 @@
 
   <h1>maestro</h1>
 
+  <h3>VLM fine-tuning for everyone</h1>
+
   <br>
 
   <div>

diff --git a/docs/metrics.md b/docs/metrics.md
diff --git a/docs/models/florence_2.md b/docs/models/florence_2.md
@@ -0,0 +1,76 @@
+## Overview
+
+Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license. It offers strong zero-shot and fine-tuning capabilities for tasks such as image captioning, object detection, visual grounding, and segmentation. Despite its compact size, training on the extensive FLD-5B dataset (126 million images and 5.4 billion annotations) enables Florence-2 to perform on par with much larger models like Kosmos-2. You can try out the model via HF Spaces, Google Colab, or our interactive playground.
+
+## Install
+
+```bash
+pip install maestro[florence_2]
+```
+
+## Train
+
+The training routines support various optimization strategies such as LoRA, and freezing the vision encoder. Customize your fine-tuning process via CLI or Python to align with your dataset and task requirements.
+
+### CLI
+
+Kick off training from the command line by running the command below. Be sure to replace the dataset path and adjust the hyperparameters (such as epochs and batch size) to suit your needs.
+
+```bash
+maestro florence_2 train \
+  --dataset "dataset/location" \
+  --epochs 10 \
+  --batch-size 4 \
+  --optimization_strategy "lora" \
+  --metrics "edit_distance"
+```
+
+### Python
+
+For more control, you can fine-tune Florence-2 using the Python API. Create a configuration dictionary with your training parameters and pass it to the train function to integrate the process into your custom workflow.
+
+```python
+from maestro.trainer.models.florence_2.core import train
+
+config = {
+    "dataset": "dataset/location",
+    "epochs": 10,
+    "batch_size": 4,
+    "optimization_strategy": "lora",
+    "metrics": ["edit_distance"],
+}
+
+train(config)
+```
+
+## Load
+
+Load a pre-trained or fine-tuned Florence-2 model along with its processor using the load_model function. Specify your model's path and the desired optimization strategy.
+
+```python
+from maestro.trainer.models.florence_2.checkpoints import (
+    OptimizationStrategy, load_model)
+
+processor, model = load_model(
+    model_id_or_path="model/location",
+    optimization_strategy=OptimizationStrategy.NONE
+)
+```
+
+## Predict
+
+Perform inference with Florence-2 using the predict function. Supply an image and a text prefix to obtain predictions, such as object detection outputs or captions.
+
+```python
+from maestro.trainer.common.datasets import RoboflowJSONLDataset
+from maestro.trainer.models.florence_2.inference import predict
+
+ds = RoboflowJSONLDataset(
+    jsonl_file_path="dataset/location/test/annotations.jsonl",
+    image_directory_path="dataset/location/test",
+)
+
+image, entry = ds[0]
+
+predict(model=model, processor=processor, image=image, prefix=entry["prefix"])
+```
diff --git a/docs/models/paligemma_2.md b/docs/models/paligemma_2.md
@@ -0,0 +1,77 @@
+## Overview
+
+PaliGemma 2 is an updated and significantly enhanced version of the original PaliGemma vision-language model (VLM). By combining the efficient SigLIP-So400m vision encoder with the robust Gemma 2 language model, PaliGemma 2 processes images at multiple resolutions and fuses visual and textual inputs to deliver strong performance across diverse tasks such as captioning, visual question answering (VQA), optical character recognition (OCR), object detection, and instance segmentation. Fine-tuning enables users to adapt the model to specific tasks while leveraging its scalable architecture.
+
+## Install
+
+```bash
+pip install maestro[paligemma_2]
+```
+
+## Train
+
+The training routines support various optimization strategies such as LoRA, QLoRA, and freezing the vision encoder. Customize your fine-tuning process via CLI or Python to align with your dataset and task requirements.
+
+### CLI
+
+Kick off training from the command line by running the command below. Be sure to replace the dataset path and adjust the hyperparameters (such as epochs and batch size) to suit your needs.
+
+```bash
+maestro paligemma_2 train \
+  --dataset "dataset/location" \
+  --epochs 10 \
+  --batch-size 4 \
+  --optimization_strategy "qlora" \
+  --metrics "edit_distance"
+```
+
+### Python
+
+For more control, you can fine-tune PaliGemma 2 using the Python API. Create a configuration dictionary with your training parameters and pass it to the train function to integrate the process into your custom workflow.
+
+```python
+from maestro.trainer.models.paligemma_2.core import train
+
+config = {
+    "dataset": "dataset/location",
+    "epochs": 10,
+    "batch_size": 4,
+    "optimization_strategy": "qlora",
+    "metrics": ["edit_distance"],
+}
+
+train(config)
+```
+
+## Load
+
+Load a pre-trained or fine-tuned PaliGemma 2 model along with its processor using the load_model function. Specify your model's path and the desired optimization strategy.
+
+```python
+from maestro.trainer.models.paligemma_2.checkpoints import (
+    OptimizationStrategy, load_model
+)
+
+processor, model = load_model(
+    model_id_or_path="model/location",
+    optimization_strategy=OptimizationStrategy.NONE
+)
+```
+
+## Predict
+
+Perform inference with PaliGemma 2 using the predict function. Supply an image and a text prefix to obtain predictions, such as object detection outputs or captions.
+
+```python
+from maestro.trainer.common.datasets import RoboflowJSONLDataset
+from maestro.trainer.models.paligemma_2.inference import predict
+
+ds = RoboflowJSONLDataset(
+    jsonl_file_path="dataset/location/test/annotations.jsonl",
+    image_directory_path="dataset/location/test",
+)
+
+image, entry = ds[0]
+
+predict(model=model, processor=processor, image=image, prefix=entry["prefix"])
+```
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,6 +2,8 @@ @@
       <h1>maestro</h1>
+      <h3>VLM fine-tuning for everyone</h1>
       <br>
       <div>
@@ Expand Down @@