imgdiff_difference_caption_generator_mapper

Generates difference captions for bounding box regions in two images.

This operator processes pairs of images and generates captions for the differences in their bounding box regions. It uses a multi-step process:

Describes the content of each bounding box region using a Hugging Face model.
Crops the bounding box regions from both images.
Checks if the cropped regions match the generated captions.
Determines if there are differences between the two captions.
Marks the difference area with a red box.
Generates difference captions for the marked areas.
The key metric is the similarity score between the captions, computed using a CLIP model.
If no valid bounding boxes or differences are found, it returns empty captions and zeroed bounding boxes.
Uses 'cuda' as the accelerator if any of the fused operations support it.
Caches temporary images during processing and clears them afterward.

为两幅图像的边界框区域生成差异描述。

此算子处理成对的图像并为其边界框区域的差异生成描述。它使用多步骤过程：

Type 算子类型: mapper

Tags 标签: gpu

🔧 Parameter Configuration 参数配置

name 参数名	type 类型	default 默认值	desc 说明
`mllm_mapper_args`	typing.Optional[typing.Dict]	`{}`	Arguments for multimodal language model mapper. Controls the generation of captions for bounding box regions. Default empty dict will use fixed values: max_new_tokens=256, temperature=0.2, top_p=None, num_beams=1, hf_model="llava-hf/llava-v1.6-vicuna-7b-hf".
`image_text_matching_filter_args`	typing.Optional[typing.Dict]	`{}`	Arguments for image-text matching filter. Controls the matching between cropped regions and generated captions. Default empty dict will use fixed values: min_score=0.1, max_score=1.0, hf_blip="Salesforce/blip-itm-base-coco", num_proc=1.
`text_pair_similarity_filter_args`	typing.Optional[typing.Dict]	`{}`	Arguments for text pair similarity filter. Controls the similarity comparison between caption pairs. Default empty dict will use fixed values: min_score=0.1, max_score=1.0, hf_clip="openai/clip-vit-base-patch32", text_key_second="target_text", num_proc=1.
`args`		`''`
`kwargs`		`''`