Skip to content

Florence-2

Overview

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license. The model demonstrates strong zero-shot and fine-tuning capabilities across tasks such as captioning, object detection, grounding, and segmentation.

Florence-2: Fine-tune Microsoft’s Multimodal Model.

Architecture

The model takes images and task prompts as input, generating the desired results in text format. It uses a DaViT vision encoder to convert images into visual token embeddings. These are then concatenated with BERT-generated text embeddings and processed by a transformer-based multi-modal encoder-decoder to generate the response.

florence-2-architecture Overview of Florence-2 architecture. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

Fine-tuning Examples

Dataset Format

The Florence-2 model expects a specific dataset structure for training and evaluation. The dataset should be organized into train, test, and validation splits, with each split containing image files and an annotations.jsonl file.

dataset/
├── train/
│   ├── 123e4567-e89b-12d3-a456-426614174000.png
│   ├── 987f6543-a21c-43c3-a562-926514273001.png
│   ├── ...
│   ├── annotations.jsonl
├── test/
│   ├── 456b7890-e32d-44f5-b678-724564172002.png
│   ├── 678c1234-e45b-67f6-c789-813264172003.png
│   ├── ...
│   ├── annotations.jsonl
└── valid/
    ├── 789d2345-f67c-89d7-e891-912354172004.png
    ├── 135e6789-d89f-12e3-f012-456464172005.png
    ├── ...
    └── annotations.jsonl

Depending on the vision task being performed, the structure of the annotations.jsonl file will vary slightly.

Warning

The dataset samples shown below are formatted for improved readability, with each JSON structure spread across multiple lines. In practice, the annotations.jsonl file must contain each JSON object on a single line, without any line breaks between the key-value pairs. Make sure to adhere to this structure to avoid parsing errors during model training.

{
    "image":"123e4567-e89b-12d3-a456-426614174000.png",
    "prefix":"<OD>",
    "suffix":"9 of clubs<loc_138><loc_100><loc_470><loc_448>10 of clubs<loc_388><loc_145><loc_670><loc_453>"
}
{
    "image":"987f6543-a21c-43c3-a562-926514273001.png",
    "prefix":"<OD>",
    "suffix":"5 of clubs<loc_554><loc_2><loc_763><loc_467>6 of clubs<loc_399><loc_79><loc_555><loc_466>"
}
...
{
    "image":"123e4567-e89b-12d3-a456-426614174000.png",
    "prefix":"<VQA> Is the value of Favorable 38 in 2015?",
    "suffix":"Yes"
}
{
    "image":"987f6543-a21c-43c3-a562-926514273001.png",
    "prefix":"<VQA> How many values are below 40 in Unfavorable graph?",
    "suffix":"6"
}
...
{
    "image":"123e4567-e89b-12d3-a456-426614174000.png",
    "prefix":"<OCR>",
    "suffix":"ke begherte Die mi"
}
{
    "image":"987f6543-a21c-43c3-a562-926514273001.png",
    "prefix":"<OCR>",
    "suffix":"mi uort in de middelt"
}
...

CLI

Tip

Depending on the GPU you are using, you may need to adjust the batch-size to ensure that your model trains within memory limits. For larger GPUs with more memory, you can increase the batch size for better performance.

Tip

Depending on the vision task you are executing, you may need to select different vision metrics. For example, tasks like object detection typically use mean_average_precision, while VQA and OCR tasks use metrics like word_error_rate and character_error_rate.

Tip

You may need to use different learning rates depending on the task. We have found that lower learning rates work better for tasks like OCR or VQA, as these tasks require more precision.

maestro florence2 train --dataset='<DATASET_PATH>' \
--epochs=10 --batch-size=8 --lr=5e-6 --metrics=mean_average_precision
maestro florence2 train --dataset='<DATASET_PATH>' \
--epochs=10 --batch-size=8 --lr=1e-6 \
--metrics=word_error_rate, character_error_rate
maestro florence2 train --dataset='<DATASET_PATH>' \
--epochs=10 --batch-size=8 --lr=1e-6 \
--metrics=word_error_rate, character_error_rate

SDK

from maestro.trainer.common import MeanAveragePrecisionMetric
from maestro.trainer.models.florence_2 import train, Configuration

config = Configuration(
    dataset='<DATASET_PATH>',
    epochs=10,
    batch_size=8,
    lr=5e-6,
    metrics=[MeanAveragePrecisionMetric()]
)

train(config)
from maestro.trainer.common import WordErrorRateMetric, CharacterErrorRateMetric
from maestro.trainer.models.florence_2 import train, Configuration

config = Configuration(
    dataset='<DATASET_PATH>',
    epochs=10,
    batch_size=8,
    lr=1e-6,
    metrics=[WordErrorRateMetric(), CharacterErrorRateMetric()]
)

train(config)
from maestro.trainer.common import WordErrorRateMetric, CharacterErrorRateMetric
from maestro.trainer.models.florence_2 import train, Configuration

config = Configuration(
    dataset='<DATASET_PATH>',
    epochs=10,
    batch_size=8,
    lr=1e-6,
    metrics=[WordErrorRateMetric(), CharacterErrorRateMetric()]
)

train(config)

API

Configuration for a Florence-2 model.

This class encapsulates all the parameters needed for training a Florence-2 model, including dataset paths, model specifications, training hyperparameters, and output settings.

Attributes:

Name Type Description
dataset str

Path to the dataset used for training.

model_id str

Identifier for the Florence-2 model.

revision str

Revision of the model to use.

device device

Device to use for training.

cache_dir Optional[str]

Directory to cache the model.

epochs int

Number of training epochs.

optimizer Literal['sgd', 'adamw', 'adam']

Optimizer to use for training.

lr float

Learning rate for the optimizer.

lr_scheduler Literal['linear', 'cosine', 'polynomial']

Learning rate scheduler.

batch_size int

Batch size for training.

val_batch_size Optional[int]

Batch size for validation.

num_workers int

Number of workers for data loading.

val_num_workers Optional[int]

Number of workers for validation data loading.

lora_r int

Rank of the LoRA update matrices.

lora_alpha int

Scaling factor for the LoRA update.

lora_dropout float

Dropout probability for LoRA layers.

bias Literal['none', 'all', 'lora_only']

Which bias to train.

use_rslora bool

Whether to use RSLoRA.

init_lora_weights Union[bool, LoraInitLiteral]

How to initialize LoRA weights.

output_dir str

Directory to save output files.

metrics List[BaseMetric]

List of metrics to track during training.

Source code in maestro/trainer/models/florence_2/core.py
@dataclass(frozen=True)
class Configuration:
    """Configuration for a Florence-2 model.

    This class encapsulates all the parameters needed for training a Florence-2 model,
    including dataset paths, model specifications, training hyperparameters, and output
    settings.

    Attributes:
        dataset (str): Path to the dataset used for training.
        model_id (str): Identifier for the Florence-2 model.
        revision (str): Revision of the model to use.
        device (torch.device): Device to use for training.
        cache_dir (Optional[str]): Directory to cache the model.
        epochs (int): Number of training epochs.
        optimizer (Literal["sgd", "adamw", "adam"]): Optimizer to use for training.
        lr (float): Learning rate for the optimizer.
        lr_scheduler (Literal["linear", "cosine", "polynomial"]): Learning rate
            scheduler.
        batch_size (int): Batch size for training.
        val_batch_size (Optional[int]): Batch size for validation.
        num_workers (int): Number of workers for data loading.
        val_num_workers (Optional[int]): Number of workers for validation data loading.
        lora_r (int): Rank of the LoRA update matrices.
        lora_alpha (int): Scaling factor for the LoRA update.
        lora_dropout (float): Dropout probability for LoRA layers.
        bias (Literal["none", "all", "lora_only"]): Which bias to train.
        use_rslora (bool): Whether to use RSLoRA.
        init_lora_weights (Union[bool, LoraInitLiteral]): How to initialize LoRA
            weights.
        output_dir (str): Directory to save output files.
        metrics (List[BaseMetric]): List of metrics to track during training.
    """

    dataset: str
    model_id: str = DEFAULT_FLORENCE2_MODEL_ID
    revision: str = DEFAULT_FLORENCE2_MODEL_REVISION
    device: torch.device = DEVICE
    cache_dir: Optional[str] = None
    epochs: int = 10
    optimizer: Literal["sgd", "adamw", "adam"] = "adamw"
    lr: float = 1e-5
    lr_scheduler: Literal["linear", "cosine", "polynomial"] = "linear"
    batch_size: int = 4
    val_batch_size: Optional[int] = None
    num_workers: int = 0
    val_num_workers: Optional[int] = None
    lora_r: int = 8
    lora_alpha: int = 8
    lora_dropout: float = 0.05
    bias: Literal["none", "all", "lora_only"] = "none"
    use_rslora: bool = True
    init_lora_weights: Union[bool, LoraInitLiteral] = "gaussian"
    output_dir: str = "./training/florence-2"
    metrics: list[BaseMetric] = field(default_factory=list)

Train a Florence-2 model using the provided configuration.

This function sets up the training environment, prepares the model and data loaders, and runs the training loop. It also handles metric tracking and checkpoint saving.

Parameters:

Name Type Description Default

config

Configuration

The configuration object containing all necessary parameters for training.

required

Returns:

Type Description
None

None

Raises:

Type Description
ValueError

If an unsupported optimizer is specified in the configuration.

Source code in maestro/trainer/models/florence_2/core.py
def train(config: Configuration) -> None:
    """Train a Florence-2 model using the provided configuration.

    This function sets up the training environment, prepares the model and data loaders,
    and runs the training loop. It also handles metric tracking and checkpoint saving.

    Args:
        config (Configuration): The configuration object containing all necessary
            parameters for training.

    Returns:
        None

    Raises:
        ValueError: If an unsupported optimizer is specified in the configuration.
    """
    make_it_reproducible(avoid_non_deterministic_algorithms=False)
    run_dir = create_new_run_directory(
        base_output_dir=config.output_dir,
    )
    config = replace(
        config,
        output_dir=run_dir,
    )
    checkpoint_manager = CheckpointManager(run_dir)

    processor, model = load_model(
        model_id_or_path=config.model_id,
        revision=config.revision,
        device=config.device,
        cache_dir=config.cache_dir,
    )
    train_loader, val_loader, test_loader = create_data_loaders(
        dataset_location=config.dataset,
        train_batch_size=config.batch_size,
        processor=processor,
        device=config.device,
        num_workers=config.num_workers,
        test_loaders_workers=config.val_num_workers,
    )
    peft_model = prepare_peft_model(
        model=model,
        r=config.lora_r,
        lora_alpha=config.lora_alpha,
        lora_dropout=config.lora_dropout,
        bias=config.bias,
        use_rslora=config.use_rslora,
        init_lora_weights=config.init_lora_weights,
        revision=config.revision,
    )
    training_metrics_tracker = MetricsTracker.init(metrics=["loss"])
    metrics = ["loss"]
    for metric in config.metrics:
        metrics += metric.describe()
    validation_metrics_tracker = MetricsTracker.init(metrics=metrics)

    run_training_loop(
        processor=processor,
        model=peft_model,
        data_loaders=(train_loader, val_loader),
        config=config,
        training_metrics_tracker=training_metrics_tracker,
        validation_metrics_tracker=validation_metrics_tracker,
        checkpoint_manager=checkpoint_manager,
    )

    save_metric_plots(
        training_tracker=training_metrics_tracker,
        validation_tracker=validation_metrics_tracker,
        output_dir=os.path.join(config.output_dir, "metrics"),
    )
    training_metrics_tracker.as_json(output_dir=os.path.join(config.output_dir, "metrics"), filename="training.json")
    validation_metrics_tracker.as_json(
        output_dir=os.path.join(config.output_dir, "metrics"), filename="validation.json"
    )

    # Log out paths for latest and best checkpoints
    print(f"Latest checkpoint saved at: {checkpoint_manager.latest_checkpoint_dir}")
    print(f"Best checkpoint saved at: {checkpoint_manager.best_checkpoint_dir}")

Evaluate a Florence-2 model using the provided configuration.

This function loads the model and data, runs predictions on the evaluation dataset, computes specified metrics, and saves the results.

Parameters:

Name Type Description Default

config

Configuration

The configuration object containing all necessary parameters for evaluation.

required

Returns:

Type Description
None

None

Source code in maestro/trainer/models/florence_2/core.py
def evaluate(config: Configuration) -> None:
    """Evaluate a Florence-2 model using the provided configuration.

    This function loads the model and data, runs predictions on the evaluation dataset,
    computes specified metrics, and saves the results.

    Args:
        config (Configuration): The configuration object containing all necessary
            parameters for evaluation.

    Returns:
        None
    """
    processor, model = load_model(
        model_id_or_path=config.model_id,
        revision=config.revision,
        device=config.device,
        cache_dir=config.cache_dir,
    )
    train_loader, val_loader, test_loader = create_data_loaders(
        dataset_location=config.dataset,
        train_batch_size=config.batch_size,
        processor=processor,
        device=config.device,
        num_workers=config.num_workers,
        test_loaders_workers=config.val_num_workers,
    )
    evaluation_loader = test_loader if test_loader is not None else val_loader

    metrics = []
    for metric in config.metrics:
        metrics += metric.describe()
    evaluation_metrics_tracker = MetricsTracker.init(metrics=metrics)

    # Run inference once for all metrics
    _, expected_answers, generated_answers, images = run_predictions(
        loader=evaluation_loader, processor=processor, model=model
    )

    for metric in config.metrics:
        if isinstance(metric, MeanAveragePrecisionMetric):
            classes = get_unique_detection_classes(train_loader.dataset)
            targets, predictions = process_output_for_detection_metric(
                expected_answers=expected_answers,
                generated_answers=generated_answers,
                images=images,
                classes=classes,
                processor=processor,
            )
            result = metric.compute(targets=targets, predictions=predictions)
            for key, value in result.items():
                evaluation_metrics_tracker.register(
                    metric=key,
                    epoch=1,
                    step=1,
                    value=value,
                )
        else:
            predictions = process_output_for_text_metric(
                generated_answers=generated_answers,
                images=images,
                processor=processor,
            )
            result = metric.compute(targets=expected_answers, predictions=predictions)
            for key, value in result.items():
                evaluation_metrics_tracker.register(
                    metric=key,
                    epoch=1,
                    step=1,
                    value=value,
                )

    evaluation_metrics_tracker.as_json(
        output_dir=os.path.join(config.output_dir, "metrics"), filename="evaluation.json"
    )