Florence-2
Overview¶
Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license. The model demonstrates strong zero-shot and fine-tuning capabilities across tasks such as captioning, object detection, grounding, and segmentation.
Florence-2: Fine-tune Microsoft’s Multimodal Model.
Architecture¶
The model takes images and task prompts as input, generating the desired results in text format. It uses a DaViT vision encoder to convert images into visual token embeddings. These are then concatenated with BERT-generated text embeddings and processed by a transformer-based multi-modal encoder-decoder to generate the response.
Overview of Florence-2 architecture. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks.
Fine-tuning Examples¶
Dataset Format¶
The Florence-2 model expects a specific dataset structure for training and evaluation.
The dataset should be organized into train, test, and validation splits, with each
split containing image files and an annotations.jsonl
file.
dataset/
├── train/
│ ├── 123e4567-e89b-12d3-a456-426614174000.png
│ ├── 987f6543-a21c-43c3-a562-926514273001.png
│ ├── ...
│ ├── annotations.jsonl
├── test/
│ ├── 456b7890-e32d-44f5-b678-724564172002.png
│ ├── 678c1234-e45b-67f6-c789-813264172003.png
│ ├── ...
│ ├── annotations.jsonl
└── valid/
├── 789d2345-f67c-89d7-e891-912354172004.png
├── 135e6789-d89f-12e3-f012-456464172005.png
├── ...
└── annotations.jsonl
Depending on the vision task being performed, the structure of the annotations.jsonl
file will vary slightly.
Warning
The dataset samples shown below are formatted for improved readability, with each
JSON structure spread across multiple lines. In practice, the annotations.jsonl
file must contain each JSON object on a single line, without any line breaks
between the key-value pairs. Make sure to adhere to this structure to avoid parsing
errors during model training.
{
"image":"123e4567-e89b-12d3-a456-426614174000.png",
"prefix":"<OD>",
"suffix":"9 of clubs<loc_138><loc_100><loc_470><loc_448>10 of clubs<loc_388><loc_145><loc_670><loc_453>"
}
{
"image":"987f6543-a21c-43c3-a562-926514273001.png",
"prefix":"<OD>",
"suffix":"5 of clubs<loc_554><loc_2><loc_763><loc_467>6 of clubs<loc_399><loc_79><loc_555><loc_466>"
}
...
CLI¶
Tip
Depending on the GPU you are using, you may need to adjust the batch-size
to
ensure that your model trains within memory limits. For larger GPUs with more
memory, you can increase the batch size for better performance.
Tip
Depending on the vision task you are executing, you may need to select different
vision metrics. For example, tasks like object detection typically use
mean_average_precision
, while VQA and OCR tasks use metrics like
word_error_rate
and character_error_rate
.
Tip
You may need to use different learning rates depending on the task. We have found that lower learning rates work better for tasks like OCR or VQA, as these tasks require more precision.
SDK¶
from maestro.trainer.common import WordErrorRateMetric, CharacterErrorRateMetric
from maestro.trainer.models.florence_2 import train, Configuration
config = Configuration(
dataset='<DATASET_PATH>',
epochs=10,
batch_size=8,
lr=1e-6,
metrics=[WordErrorRateMetric(), CharacterErrorRateMetric()]
)
train(config)
from maestro.trainer.common import WordErrorRateMetric, CharacterErrorRateMetric
from maestro.trainer.models.florence_2 import train, Configuration
config = Configuration(
dataset='<DATASET_PATH>',
epochs=10,
batch_size=8,
lr=1e-6,
metrics=[WordErrorRateMetric(), CharacterErrorRateMetric()]
)
train(config)
API¶
Configuration for a Florence-2 model.
This class encapsulates all the parameters needed for training a Florence-2 model, including dataset paths, model specifications, training hyperparameters, and output settings.
Attributes:
Name | Type | Description |
---|---|---|
dataset |
str
|
Path to the dataset used for training. |
model_id |
str
|
Identifier for the Florence-2 model. |
revision |
str
|
Revision of the model to use. |
device |
device
|
Device to use for training. |
cache_dir |
Optional[str]
|
Directory to cache the model. |
epochs |
int
|
Number of training epochs. |
optimizer |
Literal['sgd', 'adamw', 'adam']
|
Optimizer to use for training. |
lr |
float
|
Learning rate for the optimizer. |
lr_scheduler |
Literal['linear', 'cosine', 'polynomial']
|
Learning rate scheduler. |
batch_size |
int
|
Batch size for training. |
val_batch_size |
Optional[int]
|
Batch size for validation. |
num_workers |
int
|
Number of workers for data loading. |
val_num_workers |
Optional[int]
|
Number of workers for validation data loading. |
lora_r |
int
|
Rank of the LoRA update matrices. |
lora_alpha |
int
|
Scaling factor for the LoRA update. |
lora_dropout |
float
|
Dropout probability for LoRA layers. |
bias |
Literal['none', 'all', 'lora_only']
|
Which bias to train. |
use_rslora |
bool
|
Whether to use RSLoRA. |
init_lora_weights |
Union[bool, LoraInitLiteral]
|
How to initialize LoRA weights. |
output_dir |
str
|
Directory to save output files. |
metrics |
List[BaseMetric]
|
List of metrics to track during training. |
Source code in maestro/trainer/models/florence_2/core.py
Train a Florence-2 model using the provided configuration.
This function sets up the training environment, prepares the model and data loaders, and runs the training loop. It also handles metric tracking and checkpoint saving.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Configuration
|
The configuration object containing all necessary parameters for training. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported optimizer is specified in the configuration. |
Source code in maestro/trainer/models/florence_2/core.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
|
Evaluate a Florence-2 model using the provided configuration.
This function loads the model and data, runs predictions on the evaluation dataset, computes specified metrics, and saves the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Configuration
|
The configuration object containing all necessary parameters for evaluation. |
required |
Returns:
Type | Description |
---|---|
None
|
None |