CN-122024243-A - Pointing object text positioning method based on iterative semantic visual association

CN122024243ACN 122024243 ACN122024243 ACN 122024243ACN-122024243-A

Abstract

The invention discloses a pointer text positioning method based on iterative semantic visual association, which comprises the steps of obtaining a document image containing a pointer and a natural language instruction, extracting visual and language characteristics, predicting a pointer boundary box, generating a pointing mask to extract the pointing context characteristics, splicing the pointing context characteristics and the instruction semantics to generate a channel scaling coefficient, modulating the visual characteristics in space and channel, generating FiLM parameters by utilizing the instruction semantics to modulate the enhanced visual characteristics, predicting an initial target boundary box by cross-modal attention fusion characteristics, starting an iterative refinement process, generating a target mask and calculating semantic-visual consistency scores according to a current prediction box, modulating the fusion characteristics by combining geometrical differences and semantic information to predict a better boundary box, and iterating until a result with highest scores is output after a termination condition is met. The invention enhances the fine granularity space semantic understanding and high-precision iteration positioning capability.

Inventors

SONG HAOXUAN
ZHENG XIAOLONG
SHAO LIHUAN

Assignees

杭州电子科技大学

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. The pointing object text positioning method based on iterative semantic visual association is characterized by comprising the following steps of: S1, acquiring a document image containing a pointing object and a natural language instruction describing a target text area; s2, extracting visual features of the document image and language features of natural language instructions, and extracting instruction semantic vectors; S3, predicting a boundary box of the pointing object based on the visual features and generating a pointing mask according to the boundary box so as to carry out weighted pooling on the visual features and extract pointing context features; S4, after the pointing context features are spliced with the command semantic vectors, inputting two layers of perceptron networks to generate channel scaling coefficients; performing spatial modulation and channel modulation on the visual feature by using the pointing mask and the channel scaling coefficient to obtain an enhanced visual feature; S5, generating FiLM modulation parameters based on the command semantic vector and carrying out channel modulation on the enhanced visual features, taking the modulated visual features as keys and values, taking the language features as inquiry, and carrying out cross-modal attention fusion; S6, performing an iterative refinement process by taking the initial target boundary frame as a starting point, wherein the iterative refinement process comprises the steps of generating a target mask according to the current target boundary frame, calculating a semantic-visual consistency score by combining the current fusion characteristic, generating channel modulation parameters according to geometrical characteristic differences of the current target boundary frame and the pointing object boundary frame, semantic information of a spatial relation type and a target object type and the command semantic vector when the score is improved to exceed a preset threshold value, predicting the target boundary frame of the next iteration through regression after the current fusion characteristic is enhanced, repeating the iteration until the termination condition is met, and outputting the target boundary frame with the highest semantic-visual consistency score as a final positioning result.
2. The method for positioning a pointer text based on iterative semantic visual association according to claim 1, wherein generating a pointing mask in step S3 includes calculating a diagonal length of a bounding box of the pointer, expanding the bounding box with a preset ratio of the diagonal length as an expansion radius, and generating an expanded pointing mask.
3. The method for positioning the pointer text based on iterative semantic visual correlation according to claim 2, wherein the preset ratio is 0.3.
4. The method for positioning the pointer text based on iterative semantic visual association according to claim 1, wherein in the step S4, the modulating the visual feature with the pointing mask and the channel scaling factor is achieved by the following formula: Wherein, the As a visual characteristic of the image, the image is, In order to point to the mask, For the channel scaling factor(s), A broadcast operation is indicated and, And (3) with As a result of the variable which can be learned, Representing an element-by-element multiplication, To enhance visual characteristics.
5. The method for positioning a pointer text based on iterative semantic visual association according to claim 1, wherein in step S2, extracting the instruction semantic vector from the language features includes extracting a vector corresponding to a CLS tag output by a language encoder, and projecting the vector through a linear projection layer to obtain the instruction semantic vector.
6. The method for positioning the pointer text based on iterative semantic visual association according to claim 1, wherein the spatial relation class and the target object type are obtained by decoding the instruction semantic vector based on the instruction semantic vector by inputting the instruction semantic vector into a spatial relation classifier and an object type classifier to obtain probability distribution of the spatial relation class and probability distribution of the target object type, the spatial relation class comprises current, last and next, and the target object type comprises words, lines and paragraphs.
7. The method for locating the pointer text based on iterative semantic visual association according to claim 1, wherein in the step S6, the semantic-visual consistency score is calculated based on the target mask and the fusion feature of the current iteration, and the method comprises the steps of carrying out mask weighted pooling on the fusion feature of the current iteration by utilizing the target mask to obtain local features, carrying out global pooling on the fusion feature to obtain global features, and inputting a multi-layer perceptron composed of a plurality of fully connected layers after the local features, the global features and the command semantic vectors are spliced and then carrying out Sigmoid activation function processing to obtain the semantic-visual consistency score.
8. The method for positioning the pointer text based on iterative semantic visual association according to claim 1, wherein in the step S6, generating channel modulation parameters for channel modulation comprises calculating a difference value between a center coordinate of a current target bounding box and a center coordinate of a pointer bounding box and a logarithmic ratio of width and height of the center coordinate of the current target bounding box to form a geometric descriptor, splicing the geometric descriptor, probability distribution of the spatial relationship type, probability distribution of the target object type and the command semantic vector, and obtaining FiLM parameters for channel scaling and channel biasing through nonlinear mapping as the channel modulation parameters, wherein the parameters for channel scaling are processed through a hyperbolic tangent function.
9. The method for locating a pointer text based on iterative semantic visual correlation according to claim 1, wherein the iterative refinement process is terminated under the condition that the number of iterations reaches a preset maximum value or that the improvement of the semantic-visual consistency score in two consecutive iterations does not exceed the preset threshold.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method for positioning pointer text based on iterative semantic visual correlation according to any one of claims 1 to 9 when the computer program is executed.

Description

Pointing object text positioning method based on iterative semantic visual association Technical Field The invention belongs to the technical field of document image understanding and interactive text positioning, and particularly relates to a pointing object text positioning method based on iterative semantic visual association. Background With the development of computer vision technology, text detection and recognition in images has become an important research direction. The task of visual text positioning (Visual Grounding) aims at positioning a target area in an image according to natural language description, and is widely applied to the fields of man-machine interaction and the like. The existing method mainly comprises two stages and one stage, wherein the two stages firstly generate candidate regions and then match, the calculation cost is high and the method is sensitive to missed detection, and the one stage method directly predicts the boundary frame on the fusion characteristics, has higher efficiency and has limited modeling capability on complex spatial relationships. However, the above method is mainly directed to objects (such as vehicles and pedestrians) with sparse and obvious appearance differences in natural scenes. When applied to the specific task of Pointer interactive text positioning (Pointer-based Document Text Grounding, PDTG), there are significant disadvantages: 1. The texts are dense and similar, words, lines and paragraphs in the document image are densely arranged, visual features are highly similar, adjacent text units are difficult to distinguish, and a positioning frame is easy to cover an error area. 2. The PDTG instruction often contains a complex spatial relationship with reference to a pointing position, such as a previous word, a next paragraph, and the like, and the existing method lacks effective modeling of the hierarchical document structure and the relative position relationship. 3. The positioning accuracy requirement is extremely high, PDTG requires a bounding box accuracy approaching the character level, and slightly containing adjacent characters or lines is considered an error. The IoU threshold (e.g., 0.5) commonly used in the existing methods is difficult to meet such high precision requirements and lacks a gradual optimization mechanism. In addition, although the conventional Optical Character Recognition (OCR) technology can detect all texts, it cannot be used for targeted positioning by combining the position of the pointing object and the language instruction, and is not suitable for PDTG tasks. In view of the foregoing, a new method for implementing fine-granularity spatial semantic understanding and high-precision iterative positioning in dense document images by fusing pointing information and language instructions is needed to solve the above-mentioned bottleneck faced in the task of pointing object interactive text positioning in the prior art. Disclosure of Invention The invention provides a pointer text positioning method based on iterative semantic visual association, which aims to solve the problems that the existing visual text positioning method is difficult to deeply fuse pointing information and language instructions in dense document images, lacks fine granularity space semantic understanding capability and is insufficient in positioning precision. The invention provides a pointing object text positioning method based on iterative semantic visual association, which comprises the following steps: S1, acquiring a document image containing a pointing object and a natural language instruction for describing a target text area; s2, extracting visual features of the document image through a visual encoder, extracting language features of the natural language instruction through a language encoder, and extracting instruction semantic vectors from the language features; s3, predicting a boundary box of the pointing object through a regression prediction head based on the visual features, generating a pointing mask based on the boundary box of the pointing object, carrying out weighted pooling on the visual features by using the pointing mask, and extracting pointing context features; s4, after the directional context features and the command semantic vectors are spliced, inputting a two-layer perceptron network and processing the two-layer perceptron network through a Sigmoid activation function to generate channel scaling coefficients; S5, generating a channel scaling parameter and a channel bias parameter as FiLM modulation parameters through two independent leachable mapping functions based on the instruction semantic vector, and carrying out channel modulation on the enhanced visual features by utilizing the FiLM modulation parameters; taking the modulated visual features as keys and values, taking the language features as queries, and performing cross-modal attention calculation to obtain fusion features; And S6, performing an iteration re