PaliGemma 2
Overview¶
PaliGemma 2 is an updated and significantly enhanced version of the original PaliGemma vision-language model (VLM). By combining the efficient SigLIP-So400m vision encoder with the robust Gemma 2 language model, PaliGemma 2 processes images at multiple resolutions and fuses visual and textual inputs to deliver strong performance across diverse tasks such as captioning, visual question answering (VQA), optical character recognition (OCR), object detection, and instance segmentation. Fine-tuning enables users to adapt the model to specific tasks while leveraging its scalable architecture.
Install¶
Train¶
The training routines support various optimization strategies such as LoRA, QLoRA, and freezing the vision encoder. Customize your fine-tuning process via CLI or Python to align with your dataset and task requirements.
CLI¶
Kick off training from the command line by running the command below. Be sure to replace the dataset path and adjust the hyperparameters (such as epochs and batch size) to suit your needs.
maestro paligemma_2 train \
--dataset "dataset/location" \
--epochs 10 \
--batch-size 4 \
--optimization_strategy "qlora" \
--metrics "edit_distance"
Python¶
For more control, you can fine-tune PaliGemma 2 using the Python API. Create a configuration dictionary with your training parameters and pass it to the train function to integrate the process into your custom workflow.
from maestro.trainer.models.paligemma_2.core import train
config = {
"dataset": "dataset/location",
"epochs": 10,
"batch_size": 4,
"optimization_strategy": "qlora",
"metrics": ["edit_distance"],
}
train(config)
Load¶
Load a pre-trained or fine-tuned PaliGemma 2 model along with its processor using the load_model function. Specify your model's path and the desired optimization strategy.
from maestro.trainer.models.paligemma_2.checkpoints import (
OptimizationStrategy, load_model
)
processor, model = load_model(
model_id_or_path="model/location",
optimization_strategy=OptimizationStrategy.NONE
)
Predict¶
Perform inference with PaliGemma 2 using the predict function. Supply an image and a text prefix to obtain predictions, such as object detection outputs or captions.
from maestro.trainer.common.datasets.jsonl import JSONLDataset
from maestro.trainer.models.paligemma_2.inference import predict
ds = JSONLDataset(
jsonl_file_path="dataset/location/test/annotations.jsonl",
image_directory_path="dataset/location/test",
)
image, entry = ds[0]
predict(model=model, processor=processor, image=image, prefix=entry["prefix"])