Florence-2

Overview¶

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license. The model demonstrates strong zero-shot and fine-tuning capabilities across tasks such as captioning, object detection, grounding, and segmentation.

Florence-2: Fine-tune Microsoft’s Multimodal Model.

Architecture¶

The model takes images and task prompts as input, generating the desired results in text format. It uses a DaViT vision encoder to convert images into visual token embeddings. These are then concatenated with BERT-generated text embeddings and processed by a transformer-based multi-modal encoder-decoder to generate the response.

florence-2-architecture Overview of Florence-2 architecture. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.

Fine-tuning Examples¶

Dataset Format¶

The Florence-2 model expects a specific dataset structure for training and evaluation. The dataset should be organized into train, test, and validation splits, with each split containing image files and an annotations.jsonl file.

dataset/
├── train/
│   ├── 123e4567-e89b-12d3-a456-426614174000.png
│   ├── 987f6543-a21c-43c3-a562-926514273001.png
│   ├── ...
│   ├── annotations.jsonl
├── test/
│   ├── 456b7890-e32d-44f5-b678-724564172002.png
│   ├── 678c1234-e45b-67f6-c789-813264172003.png
│   ├── ...
│   ├── annotations.jsonl
└── valid/
    ├── 789d2345-f67c-89d7-e891-912354172004.png
    ├── 135e6789-d89f-12e3-f012-456464172005.png
    ├── ...
    └── annotations.jsonl

Depending on the vision task being performed, the structure of the annotations.jsonl file will vary slightly.

Warning

The dataset samples shown below are formatted for improved readability, with each JSON structure spread across multiple lines. In practice, the annotations.jsonl file must contain each JSON object on a single line, without any line breaks between the key-value pairs. Make sure to adhere to this structure to avoid parsing errors during model training.

Object DetectionVisual Question Answering (VQA)Object Character Recognition (OCR)

{
    "image":"123e4567-e89b-12d3-a456-426614174000.png",
    "prefix":"<OD>",
    "suffix":"9 of clubs<loc_138><loc_100><loc_470><loc_448>10 of clubs<loc_388><loc_145><loc_670><loc_453>"
}
{
    "image":"987f6543-a21c-43c3-a562-926514273001.png",
    "prefix":"<OD>",
    "suffix":"5 of clubs<loc_554><loc_2><loc_763><loc_467>6 of clubs<loc_399><loc_79><loc_555><loc_466>"
}
...

{
    "image":"123e4567-e89b-12d3-a456-426614174000.png",
    "prefix":"<VQA> Is the value of Favorable 38 in 2015?",
    "suffix":"Yes"
}
{
    "image":"987f6543-a21c-43c3-a562-926514273001.png",
    "prefix":"<VQA> How many values are below 40 in Unfavorable graph?",
    "suffix":"6"
}
...

{
    "image":"123e4567-e89b-12d3-a456-426614174000.png",
    "prefix":"<OCR>",
    "suffix":"ke begherte Die mi"
}
{
    "image":"987f6543-a21c-43c3-a562-926514273001.png",
    "prefix":"<OCR>",
    "suffix":"mi uort in de middelt"
}
...

CLI¶

Tip

Depending on the GPU you are using, you may need to adjust the batch-size to ensure that your model trains within memory limits. For larger GPUs with more memory, you can increase the batch size for better performance.

Tip

Depending on the vision task you are executing, you may need to select different vision metrics. For example, tasks like object detection typically use mean_average_precision, while VQA and OCR tasks use metrics like word_error_rate and character_error_rate.

Tip

You may need to use different learning rates depending on the task. We have found that lower learning rates work better for tasks like OCR or VQA, as these tasks require more precision.

Object DetectionVisual Question Answering (VQA)Object Character Recognition (OCR)

maestro florence2 train --dataset='<DATASET_PATH>' \
--epochs=10 --batch-size=8 --lr=5e-6 --metrics=mean_average_precision

maestro florence2 train --dataset='<DATASET_PATH>' \
--epochs=10 --batch-size=8 --lr=1e-6 \
--metrics=word_error_rate, character_error_rate

maestro florence2 train --dataset='<DATASET_PATH>' \
--epochs=10 --batch-size=8 --lr=1e-6 \
--metrics=word_error_rate, character_error_rate

SDK¶

Object DetectionVisual Question Answering (VQA)Object Character Recognition (OCR)

from maestro.trainer.common import MeanAveragePrecisionMetric
from maestro.trainer.models.florence_2 import train, Configuration

config = Configuration(
    dataset='<DATASET_PATH>',
    epochs=10,
    batch_size=8,
    lr=5e-6,
    metrics=[MeanAveragePrecisionMetric()]
)

train(config)

from maestro.trainer.common import WordErrorRateMetric, CharacterErrorRateMetric
from maestro.trainer.models.florence_2 import train, Configuration

config = Configuration(
    dataset='<DATASET_PATH>',
    epochs=10,
    batch_size=8,
    lr=1e-6,
    metrics=[WordErrorRateMetric(), CharacterErrorRateMetric()]
)

train(config)

from maestro.trainer.common import WordErrorRateMetric, CharacterErrorRateMetric
from maestro.trainer.models.florence_2 import train, Configuration

config = Configuration(
    dataset='<DATASET_PATH>',
    epochs=10,
    batch_size=8,
    lr=1e-6,
    metrics=[WordErrorRateMetric(), CharacterErrorRateMetric()]
)

train(config)

API¶

Configuration

Configuration for a Florence-2 model.

This class encapsulates all the parameters needed for training a Florence-2 model, including dataset paths, model specifications, training hyperparameters, and output settings.

Attributes:

Name	Type	Description
`dataset`	`str`	Path to the dataset used for training.
`model_id`	`str`	Identifier for the Florence-2 model.
`revision`	`str`	Revision of the model to use.
`device`	`device`	Device to use for training.
`cache_dir`	`Optional[str]`	Directory to cache the model.
`epochs`	`int`	Number of training epochs.
`optimizer`	`Literal['sgd', 'adamw', 'adam']`	Optimizer to use for training.
`lr`	`float`	Learning rate for the optimizer.
`lr_scheduler`	`Literal['linear', 'cosine', 'polynomial']`	Learning rate scheduler.
`batch_size`	`int`	Batch size for training.
`val_batch_size`	`Optional[int]`	Batch size for validation.
`num_workers`	`int`	Number of workers for data loading.
`val_num_workers`	`Optional[int]`	Number of workers for validation data loading.
`lora_r`	`int`	Rank of the LoRA update matrices.
`lora_alpha`	`int`	Scaling factor for the LoRA update.
`lora_dropout`	`float`	Dropout probability for LoRA layers.
`bias`	`Literal['none', 'all', 'lora_only']`	Which bias to train.
`use_rslora`	`bool`	Whether to use RSLoRA.
`init_lora_weights`	`Union[bool, LoraInitLiteral]`	How to initialize LoRA weights.
`output_dir`	`str`	Directory to save output files.
`metrics`	`List[BaseMetric]`	List of metrics to track during training.

Source code in maestro/trainer/models/florence_2/core.py

@dataclass(frozen=True)
class Configuration:
    """Configuration for a Florence-2 model.

    This class encapsulates all the parameters needed for training a Florence-2 model,
    including dataset paths, model specifications, training hyperparameters, and output
    settings.

    Attributes:
        dataset (str): Path to the dataset used for training.
        model_id (str): Identifier for the Florence-2 model.
        revision (str): Revision of the model to use.
        device (torch.device): Device to use for training.
        cache_dir (Optional[str]): Directory to cache the model.
        epochs (int): Number of training epochs.
        optimizer (Literal["sgd", "adamw", "adam"]): Optimizer to use for training.
        lr (float): Learning rate for the optimizer.
        lr_scheduler (Literal["linear", "cosine", "polynomial"]): Learning rate
            scheduler.
        batch_size (int): Batch size for training.
        val_batch_size (Optional[int]): Batch size for validation.
        num_workers (int): Number of workers for data loading.
        val_num_workers (Optional[int]): Number of workers for validation data loading.
        lora_r (int): Rank of the LoRA update matrices.
        lora_alpha (int): Scaling factor for the LoRA update.
        lora_dropout (float): Dropout probability for LoRA layers.
        bias (Literal["none", "all", "lora_only"]): Which bias to train.
        use_rslora (bool): Whether to use RSLoRA.
        init_lora_weights (Union[bool, LoraInitLiteral]): How to initialize LoRA
            weights.
        output_dir (str): Directory to save output files.
        metrics (List[BaseMetric]): List of metrics to track during training.
    """

    dataset: str
    model_id: str = DEFAULT_FLORENCE2_MODEL_ID
    revision: str = DEFAULT_FLORENCE2_MODEL_REVISION
    device: torch.device = DEVICE
    cache_dir: Optional[str] = None
    epochs: int = 10
    optimizer: Literal["sgd", "adamw", "adam"] = "adamw"
    lr: float = 1e-5
    lr_scheduler: Literal["linear", "cosine", "polynomial"] = "linear"
    batch_size: int = 4
    val_batch_size: Optional[int] = None
    num_workers: int = 0
    val_num_workers: Optional[int] = None
    lora_r: int = 8
    lora_alpha: int = 8
    lora_dropout: float = 0.05
    bias: Literal["none", "all", "lora_only"] = "none"
    use_rslora: bool = True
    init_lora_weights: Union[bool, LoraInitLiteral] = "gaussian"
    output_dir: str = "./training/florence-2"
    metrics: list[BaseMetric] = field(default_factory=list)

train

Train a Florence-2 model using the provided configuration.

This function sets up the training environment, prepares the model and data loaders, and runs the training loop. It also handles metric tracking and checkpoint saving.

Parameters:

Name	Type	Description	Default
`config` ¶	`Configuration`	The configuration object containing all necessary parameters for training.	required

Returns:

Type	Description
`None`	None

Raises:

Type	Description
`ValueError`	If an unsupported optimizer is specified in the configuration.

Source code in maestro/trainer/models/florence_2/core.py

def train(config: Configuration) -> None:
    """Train a Florence-2 model using the provided configuration.

    This function sets up the training environment, prepares the model and data loaders,
    and runs the training loop. It also handles metric tracking and checkpoint saving.

    Args:
        config (Configuration): The configuration object containing all necessary
            parameters for training.

    Returns:
        None

    Raises:
        ValueError: If an unsupported optimizer is specified in the configuration.
    """
    make_it_reproducible(avoid_non_deterministic_algorithms=False)
    run_dir = create_new_run_directory(
        base_output_dir=config.output_dir,
    )
    config = replace(
        config,
        output_dir=run_dir,
    )
    checkpoint_manager = CheckpointManager(run_dir)

    processor, model = load_model(
        model_id_or_path=config.model_id,
        revision=config.revision,
        device=config.device,
        cache_dir=config.cache_dir,
    )
    train_loader, val_loader, test_loader = create_data_loaders(
        dataset_location=config.dataset,
        train_batch_size=config.batch_size,
        processor=processor,
        device=config.device,
        num_workers=config.num_workers,
        test_loaders_workers=config.val_num_workers,
    )
    peft_model = prepare_peft_model(
        model=model,
        r=config.lora_r,
        lora_alpha=config.lora_alpha,
        lora_dropout=config.lora_dropout,
        bias=config.bias,
        use_rslora=config.use_rslora,
        init_lora_weights=config.init_lora_weights,
        revision=config.revision,
    )
    training_metrics_tracker = MetricsTracker.init(metrics=["loss"])
    metrics = ["loss"]
    for metric in config.metrics:
        metrics += metric.describe()
    validation_metrics_tracker = MetricsTracker.init(metrics=metrics)

    run_training_loop(
        processor=processor,
        model=peft_model,
        data_loaders=(train_loader, val_loader),
        config=config,
        training_metrics_tracker=training_metrics_tracker,
        validation_metrics_tracker=validation_metrics_tracker,
        checkpoint_manager=checkpoint_manager,
    )

    save_metric_plots(
        training_tracker=training_metrics_tracker,
        validation_tracker=validation_metrics_tracker,
        output_dir=os.path.join(config.output_dir, "metrics"),
    )
    training_metrics_tracker.as_json(output_dir=os.path.join(config.output_dir, "metrics"), filename="training.json")
    validation_metrics_tracker.as_json(
        output_dir=os.path.join(config.output_dir, "metrics"), filename="validation.json"
    )

    # Log out paths for latest and best checkpoints
    print(f"Latest checkpoint saved at: {checkpoint_manager.latest_checkpoint_dir}")
    print(f"Best checkpoint saved at: {checkpoint_manager.best_checkpoint_dir}")

evaluate

Evaluate a Florence-2 model using the provided configuration.

This function loads the model and data, runs predictions on the evaluation dataset, computes specified metrics, and saves the results.

Parameters:

Name	Type	Description	Default
`config` ¶	`Configuration`	The configuration object containing all necessary parameters for evaluation.	required

Returns:

Type	Description
`None`	None

Source code in maestro/trainer/models/florence_2/core.py

def evaluate(config: Configuration) -> None:
    """Evaluate a Florence-2 model using the provided configuration.

    This function loads the model and data, runs predictions on the evaluation dataset,
    computes specified metrics, and saves the results.

    Args:
        config (Configuration): The configuration object containing all necessary
            parameters for evaluation.

    Returns:
        None
    """
    processor, model = load_model(
        model_id_or_path=config.model_id,
        revision=config.revision,
        device=config.device,
        cache_dir=config.cache_dir,
    )
    train_loader, val_loader, test_loader = create_data_loaders(
        dataset_location=config.dataset,
        train_batch_size=config.batch_size,
        processor=processor,
        device=config.device,
        num_workers=config.num_workers,
        test_loaders_workers=config.val_num_workers,
    )
    evaluation_loader = test_loader if test_loader is not None else val_loader

    metrics = []
    for metric in config.metrics:
        metrics += metric.describe()
    evaluation_metrics_tracker = MetricsTracker.init(metrics=metrics)

    # Run inference once for all metrics
    _, expected_answers, generated_answers, images = run_predictions(
        loader=evaluation_loader, processor=processor, model=model
    )

    for metric in config.metrics:
        if isinstance(metric, MeanAveragePrecisionMetric):
            classes = get_unique_detection_classes(train_loader.dataset)
            targets, predictions = process_output_for_detection_metric(
                expected_answers=expected_answers,
                generated_answers=generated_answers,
                images=images,
                classes=classes,
                processor=processor,
            )
            result = metric.compute(targets=targets, predictions=predictions)
            for key, value in result.items():
                evaluation_metrics_tracker.register(
                    metric=key,
                    epoch=1,
                    step=1,
                    value=value,
                )
        else:
            predictions = process_output_for_text_metric(
                generated_answers=generated_answers,
                images=images,
                processor=processor,
            )
            result = metric.compute(targets=expected_answers, predictions=predictions)
            for key, value in result.items():
                evaluation_metrics_tracker.register(
                    metric=key,
                    epoch=1,
                    step=1,
                    value=value,
                )

    evaluation_metrics_tracker.as_json(
        output_dir=os.path.join(config.output_dir, "metrics"), filename="evaluation.json"
    )

Florence-2

Overview¶

Architecture¶

Fine-tuning Examples¶

Dataset Format¶

CLI¶

SDK¶

API¶

Configuration

train

`config` ¶

evaluate

`config` ¶

Florence-2

Overview¶

Architecture¶

Fine-tuning Examples¶

Dataset Format¶

CLI¶

SDK¶

API¶

Configuration

train

config ¶

evaluate

config ¶

`config` ¶

`config` ¶