CN-122023980-A - Method and device for enhancing fine granularity perception of visual language model
Abstract
The invention discloses a method and a device for enhancing fine granularity perception of a visual language model. The method comprises the steps of obtaining normalized coordinates of an image and a candidate region, extracting global semantic features and fine-grained visual features by using a main visual encoder and an auxiliary visual encoder in parallel, performing multi-scale transformation on the main features by using a feature pyramid through a mixed fine-grained region encoder, splicing by combining the auxiliary features, generating a geometric position embedded vector by using linear projection based on the coordinates, fusing the geometric position embedded vector with the spliced features element by element to obtain mixed fine-grained region features, and finally projecting the mixed fine-grained region features into a region index Token input visual language model to generate text response containing region reference labels. The invention solves the problem of low VLM positioning precision through double-flow feature complementation and explicit geometric injection, realizes accurate target positioning and region understanding, and simultaneously reserves the general semantic capability of the model.
Inventors
- LIU PENG
- ZHAO TIANCHENG
- ZHANG QIANQIAN
- LI KUISONG
- LI CHENG
Assignees
- 穹界智能科技(杭州)有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251229
Claims (9)
- 1. A method for enhancing the fine granularity perception of a visual language model, comprising the steps of: S1, acquiring an image to be processed and a plurality of candidate region information corresponding to the image to be processed, wherein the candidate region information comprises normalized space coordinates of a potential target object in the image to be processed; s2, extracting a global semantic feature image from the image to be processed through a main visual encoder, and extracting fine-grained visual features from the image to be processed through an auxiliary visual encoder in parallel; s3, generating mixed fine granularity region characteristics through a mixed fine granularity region encoder, wherein the specific steps comprise: s31, performing scale transformation on the global semantic feature map output by the main visual encoder by using a simple feature pyramid module to generate a plurality of layers of main visual features; s32, extracting a first region feature from the main visual feature based on the plurality of candidate region information, and extracting a second region feature from the fine-grained visual feature based on the plurality of candidate region information in parallel; s33, splicing the first region features and the second region features; S34, generating a geometric position embedded vector by utilizing a linear projection layer based on the normalized space coordinates of the candidate region information, and fusing the geometric position embedded vector and the spliced features element by element to obtain mixed fine-grained region features; s4, mapping the mixed fine-granularity region features to an embedding space of a visual language model through a feature projection layer to generate a region index Token; S5, converting the global semantic feature map into an image tokens through a visual-language connector, converting a text instruction into a text tokens through a text word segmentation device, and then inputting the region index Token, the image tokens and the text tokens together into a visual language model to generate a text response containing a region reference label, wherein the region reference label is associated with the region index Token to indicate the position of a target object in the image.
- 2. The method for enhancing fine-grained perception of a visual language model according to claim 1, wherein in step S31, the multi-scale transformation is performed on the global semantic feature map by using a simple feature pyramid module, specifically comprising processing the global semantic feature map by using a set of convolution layers and deconvolution layers with different step sizes to construct four layers of main visual features including a retention resolution, a downsampling, and an upsampling scale; In step S32, the extracting a first region feature from the main visual features based on the plurality of candidate region information specifically includes: aiming at each candidate region, carrying out region-of-interest alignment operation on each layer of four layers of main visual features to obtain four-scale feature blocks; and cascading the feature blocks with the four scales on the channel dimension to serve as the first region features.
- 3. The method of claim 1, wherein the auxiliary visual encoder is a high resolution visual transducer or convolutional neural network; In step S32, the extracting the corresponding second region feature specifically includes: upsampling the fine-grained visual features output in the step S2 to a uniform size and splicing to form a combined feature map; And executing the region of interest alignment operation on the combined feature map, and extracting to obtain the second region features.
- 4. A method for enhancing fine granularity perception of a visual language model according to claim 1, wherein in step S34, the generating a geometric position embedding vector specifically comprises: acquiring the normalized spatial coordinates [ x 1 ,y 1 ,x 2 ,y 2 ]; Mapping each coordinate to an initial position vector using sine and cosine functions comprising a series of preset frequencies, the formula PE (u, 2 i) =sin (u/T 2i/d' );PE(u,2i+1)=cos(u/T 2i/d' ); Wherein u is a coordinate value, i is a dimension index, d' is a single coordinate vector length, and T is a temperature coefficient; and splicing the initial position vectors corresponding to the four coordinates, and mapping the initial position vectors to the same dimension as the spliced features through a learnable linear projection layer to obtain the geometric position embedded vector.
- 5. A method of enhancing fine granularity perception of a visual language model as claimed in claim 1, wherein in step S5, the text instruction comprises a region placeholder; the text response containing the region reference tag is specifically a sequence in the format of "< group > object name > < object > region index Token >".
- 6. The method of enhancing fine-grained perception of a visual language model according to claim 1, further comprising a model training step comprising: The first stage alignment training, namely freezing parameters of the main visual encoder, the auxiliary visual encoder and the visual language model, and training parameters of a simple feature pyramid module, a linear projection layer and a feature projection layer only to realize alignment of regional features and language embedding space; Second stage fine tuning training, namely thawing the auxiliary visual encoder and the visual language model, and performing instruction fine tuning by utilizing mixed data comprising positive samples and negative samples; wherein the negative sample is an image-text pair that does not contain the target object, and its corresponding training label is structured as a text description that does not contain the region reference label.
- 7. The method of claim 6, wherein the proportion of the negative sample in the mixed data is configured as a preset proportion, and the mixed data further comprises general visual question-answering task data.
- 8. An apparatus for enhancing fine-grained perception of a visual language model, comprising: The main visual encoder and the auxiliary visual encoder are used for receiving the image to be processed and respectively outputting a global semantic feature map and fine-grained visual features; the input end of the mixed fine-granularity region encoder is connected with the main visual encoder and the auxiliary visual encoder, and is used for receiving the global semantic feature map, the fine-granularity visual features and a plurality of candidate region information and outputting mixed fine-granularity region features corresponding to the candidate regions; the input end of the feature projection layer, namely a region-language connector, is connected with the mixed fine granularity region encoder and is used for mapping the mixed fine granularity region features into region indexes Token; the visual language model has inputs for receiving the region index Token, the image Tokens from the global semantic feature map, and the text Tokens from the text instructions and outputting a text response containing the region reference tag.
- 9. The apparatus for enhancing fine granularity perception of a visual language model as claimed in claim 8, wherein said hybrid fine granularity region encoder comprises: The simple feature pyramid module is used for carrying out multi-scale transformation on the global semantic feature map to generate multi-layer main visual features; the regional characteristic extraction module is used for extracting a first regional characteristic and a second regional characteristic from the multi-layer main visual characteristic and the fine-granularity visual characteristic respectively based on the plurality of candidate regional information; And the feature fusion module is used for splicing the first region features and the second region features, generating a geometric position embedded vector by utilizing a linear projection layer according to the space coordinates of the candidate region information, and fusing the geometric position embedded vector and the spliced features element by element to generate the mixed fine-granularity region features.
Description
Method and device for enhancing fine granularity perception of visual language model Technical Field The invention relates to the technical field of artificial intelligence and computer vision intersection, in particular to a multi-mode large model technology. In particular, the present invention relates to a method and apparatus for enhancing fine granularity perceptibility of Visual Language Models (VLMs), which are suitable for computer Vision tasks such as object localization, region generation understanding, visual region reasoning, and the like. Background In recent years, a Visual Language Model (VLMs) is excellent in high-level semantic understanding tasks such as Visual Questions and Answers (VQA), image descriptions, and the like by mapping visual features to a language model embedding space. However, existing common VLMs still has significant drawbacks in fine-grained sensing tasks (e.g., target detection, target grounding) that require accurate spatial localization. First, there is a mismatch in the architecture. Mainstream VLMs (e.g., LLaVA, qwen-VL, etc.) generates designs based on sequential language that tend to output discrete text Token. When a model direct regression is required to generate an accurate floating point coordinate sequence (e.g., [0.45,0.32,. ]) it is highly prone to error. Prediction errors of a single Token can lead to failure of the whole rectangular coordinate frame, and error accumulation easily occurs in a multi-target scene, so that recall rate is low. Secondly, the existing improvement scheme has the following defects: 1. coordinate quantization scheme (Pix 2Seq, etc.) the coordinates are discretized into words in the vocabulary. This approach suffers from quantization errors in high resolution images and is difficult to handle for complex scenes where multiple instances overlap. 2. External prediction header scheme-an independent positioning module (e.g., DETR header) is added after VLMs output layers. This not only increases the inference delay, but also requires the design of complex task loss functions, destroying the unified generation paradigm of LLM. 3. And (3) a full-scale joint training scheme, namely performing end-to-end joint training on the detection model and the VLM. This does not reuse the rich semantic understanding and world knowledge that pre-trained VLMs already have, is extremely costly to train, and easily results in models forgetting the original general ability (catastrophic forgetting). Thus, the localization performance of existing VLMs on standard test datasets such as COCO is far lower than that of dedicated test models (e.g., model recall of partial 72B parameter scale is even lower than 40%). How to improve the fine granularity sensing capability of VLMs, realize accurate target positioning and regional understanding, and meanwhile, the original high-level scene understanding and general reasoning capability are not lost, so that the method is a technical problem to be solved currently. Disclosure of Invention The invention mainly solves the technical problems in the prior art, and provides a method and a device for enhancing the fine granularity perception of a visual language model, which are used for obviously improving the positioning precision of a micro target and a dense target while retaining the general semantic capability of VLM (very large scale virtual machine) by introducing a plug-in mixed fine granularity region encoder (HFRE) and a decoupled region reference generation mechanism. The invention aims at solving the technical problems by mainly adopting the following technical scheme that the method for enhancing the fine granularity perception of the visual language model comprises the following steps: S1, acquiring an image to be processed and a plurality of candidate region information corresponding to the image to be processed, wherein the candidate region information comprises normalized space coordinates of a potential target object in the image to be processed; s2, extracting a global semantic feature image from the image to be processed through a main visual encoder, and extracting fine-grained visual features from the image to be processed through an auxiliary visual encoder in parallel; s3, generating mixed fine granularity region characteristics through a mixed fine granularity region encoder, wherein the specific steps comprise: s31, performing scale transformation on the global semantic feature map output by the main visual encoder by using a simple feature pyramid module to generate a plurality of layers of main visual features; s32, extracting a first region feature from the main visual feature based on the plurality of candidate region information, and extracting a second region feature from the fine-grained visual feature based on the plurality of candidate region information in parallel; s33, splicing the first region features and the second region features; S34, generating a geometric position embedded vec