CN-121661652-B - Visual positioning method, system and readable storage medium based on thinking chain

CN121661652BCN 121661652 BCN121661652 BCN 121661652BCN-121661652-B

Abstract

The application relates to the technical field of image processing, in particular to a visual positioning method, a visual positioning system and a readable storage medium based on a thinking chain, which comprise the steps of receiving an input target image and a query text, and preprocessing the target image to obtain a normalized image; the method comprises the steps of inputting a query text into a language big model to obtain a text thinking chain, extracting a first semantic subject, inputting the first semantic subject and a normalized image into a target detection model and a multi-modal big model simultaneously to obtain a candidate box list, a global description and a main body area description, constructing an image thinking chain based on the candidate box list, the global description and the main body area description, integrating the text thinking chain, the image thinking chain, the normalized image, the query text and a rule prompt word according to a standard thinking chain construction strategy to construct a standard thinking chain, and inputting the standard thinking chain into the multi-modal big model to obtain a first visual positioning result. The positioning accuracy is enhanced, and meanwhile, the processing capacity of complex reasoning tasks is improved.

Inventors

KANG YASHU
XU MENG
YU JINMING
ZHU FANFAN
Zhong Chaowen
FU CHENQIN
YU JIAYI

Assignees

浙江中控信息产业股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260204

Claims (10)

1. A visual localization method based on a thought chain, comprising: S1, receiving a target image input by a user and a query text corresponding to the target image, and preprocessing the target image to obtain a normalized image; s2, inputting the query text into a preset language big model for structural analysis, obtaining a corresponding text thinking chain, and extracting a first semantic main body in the text thinking chain; S3, inputting the first semantic subject and the normalized image into a preset target detection model to obtain a candidate frame list corresponding to the normalized image based on the first semantic subject; The candidate frame list comprises coordinates, categories and confidence degrees corresponding to all candidate boundary frames in the image, the global description is a natural language description of the relation between the whole scene of the normalized image and the space, and the main body region description is a natural language description of the region most relevant to the first semantic main body; s4, constructing an image thinking chain based on the candidate frame list, the global description and the main body region description; S5, according to a preset standard thinking chain construction strategy, carrying out structural integration on a text thinking chain, an image thinking chain, a normalized image, a query text and a preset rule prompt word, constructing a standard thinking chain, and inputting the standard thinking chain into the multi-mode large model to obtain a first visual positioning result.
2. The visual localization method based on a mental chain according to claim 1, wherein the normalized image comprises a first resolution normalized image and a second equally scaled normalized image; then, preprocessing the target image, including: The dual-path preprocessing strategy is used for scaling the target image according to the preset fixed resolution to obtain a first normalized image, adjusting the long side of the target image to be a preset fixed pixel, and filling the target image to be square according to the gray value after the short side is scaled according to the fixed proportion to obtain a second normalized image.
3. The visual localization method based on a mental chain according to claim 1, wherein the first visual localization result comprises a plurality of target detection boxes, and each target detection box has a corresponding coordinate, category, confidence and semantic interpretation; After step S5, the method further comprises: S61, sorting based on the confidence coefficient corresponding to each target detection frame, and screening out a preset number of target detection frames from high to low according to the confidence coefficient; S62, constructing iterative prompt words based on all the screened target detection frames, coordinates, categories, confidence levels and semantic interpretations corresponding to each target detection frame and preset prompt word construction strategies; S63, inputting the iteration prompt words and the target images into the multi-mode large model to obtain a first visual positioning result after iteration enhancement, and repeating the steps S61 to S63 until a target detection frame IoU which is continuously output in two rounds is more than 0.9, stopping iteration to obtain a second visual positioning result, wherein the second visual positioning result is the first visual positioning result after the last round of iteration enhancement.
4. A visual localization method based on a thought chain as claimed in claim 3, wherein the S2 comprises: constructing a strategy based on the query text and a preset structured prompt word, and constructing a structured prompt word corresponding to the query text; Inputting a query text and a structured prompt word into a preset language big model to generate a second semantic main body, a global summary and Chinese-English bilingual expressions corresponding to the query text, and constructing a text thinking chain based on the second semantic main body, the global summary and the Chinese-English bilingual expressions; A first semantic body in a text thought chain is acquired, and the first semantic body is identical to a second semantic body.
5. The visual localization method based on thought chain according to claim 1, wherein the S3 comprises: Inputting the first semantic body and the normalized image into a preset target detection model to perform target detection on the normalized image based on the first semantic body so as to mark a plurality of candidate boundary boxes on the normalized image and obtain coordinates, categories and confidence degrees corresponding to each candidate boundary box; Constructing a candidate frame list based on coordinates, categories and confidence degrees corresponding to all candidate boundary frames; Inputting a first semantic subject and a normalized image into a preset multi-mode large model to perform global description on the normalized image through natural language, obtaining global description corresponding to the normalized image, and performing subject region description on a region where the first semantic subject is located based on the first semantic subject through natural language, so as to obtain subject region description corresponding to the normalized image.
6. A visual localization method based on a thought chain as claimed in claim 3, wherein after step S63, the method further comprises: s7, carrying out semantic consistency verification on each target detection frame in the second visual positioning result based on the query text to judge whether each target detection frame in the second visual positioning result is consistent with the description of the query text, wherein the semantic consistency verification is to calculate the similarity between each target detection frame and the query text based on the query text, the category corresponding to each target detection frame and semantic interpretation; and if the similarity between any target detection frame and the query text is greater than a preset similarity threshold, outputting a second visual positioning result.
7. The visual localization method based on thought chain according to claim 6, wherein the S7 further comprises: if the similarity between all the target detection frames and the query text is smaller than a preset similarity threshold, inputting the candidate frame list and all the target detection frames in the first visual positioning result and the second visual positioning result into a preset multi-mode large model to obtain a third visual positioning result.
8. The visual localization method based on a mental chain according to claim 5, wherein the global description includes an image field, a scene type, and an image description; then, global description is carried out on the normalized image through natural language, and global description corresponding to the normalized image is obtained, which comprises the following steps: global description is carried out on the normalized image through natural language, and image description corresponding to the normalized image is obtained; determining an image field corresponding to the normalized image based on the semantic subject, the image description and a preset field classification strategy; And determining the scene type corresponding to the normalized image based on the semantic body, the image description, the image field and a preset scene classification strategy.
9. A visual localization system based on a thought chain, comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the visual localization method based on a thought chain as claimed in any one of claims 1 to 8.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the visual localization method based on a mental chain as claimed in any one of claims 1 to 8.

Description

Visual positioning method, system and readable storage medium based on thinking chain Technical Field The application relates to the technical field of image processing, in particular to a visual positioning method, a visual positioning system and a readable storage medium based on a thinking chain. Background In recent years, along with the rapid development of artificial intelligence technology, multi-modal machine learning has made remarkable progress in the fields of visual understanding, man-machine interaction, intelligent decision making and the like. Wherein visual localization (REFERRING EXPRESSION COMPREHENSION, REC) is a critical task for connecting natural language with visual space, aimed at precisely locating corresponding object regions in an image from a segment of natural language description, usually output in the form of bounding boxes (Bounding Box). The task is widely applied to scenes such as image editing, robot navigation, auxiliary vision, intelligent monitoring and the like, and is one of core technologies for realizing the capability of language guidance vision. The traditional visual positioning mode is mainly based on a two-stage detection framework, such as a Faster R-CNN combined with an attention mechanism, and positioning is completed through candidate region generation and cross-modal matching. However, such approaches rely on large amounts of annotation data and have certain limitations in terms of complex semantic understanding. With the development of deep learning, especially the rise of large-scale pre-training models, multi-modal large models (Multimodal Large Models, MLLMs) such as CLIP, BLIP, qwen-VL, internVL, florence and the like show strong cross-modal alignment and semantic understanding capability, and gradually become the main technical route of visual positioning tasks. The current mainstream multi-mode visual positioning mode generally adopts an end-to-end single-stage reasoning architecture, images and text descriptions are input into a model at the same time, a target bounding box is directly output, namely, the images are encoded into visual token and fused with the text token in the same transducer architecture, cross-modal alignment is realized by using a cross-attention mechanism, and finally, the coordinates of the bounding box are returned through a decoder. In addition, some models specifically oriented to open vocabulary detection can effectively identify entities in text descriptions and generate corresponding bounding boxes through a semantic-driven detection mechanism by embedding a language as a target detection head in the query vector guided DETR architecture. Although the above model performs well on standard test sets, there is still a significant deficiency in accuracy in handling high-order reasoning tasks such as complex spatial relationships, dynamic behavior descriptions, or multi-objective reference resolution. According to the public evaluation result, the average ACC@0.5 of the current multi-mode large model on the MARS2 data set is only 0.42-0.45, which is far lower than the precision level of traditional visual tasks such as image classification or target detection. This suggests that existing approaches still face significant challenges in terms of complex semantic understanding, context modeling, and reasoning robustness. Disclosure of Invention First, the technical problem to be solved In view of the above-mentioned drawbacks and deficiencies of the prior art, the present application provides a visual localization method, system and readable storage medium based on a thought chain, which solve the technical problem that the accuracy still has significant deficiencies when the current target detection mode processes high-order reasoning tasks such as complex spatial relationship, dynamic behavior description or multi-target reference resolution. (II) technical scheme In order to achieve the above purpose, the main technical scheme adopted by the application comprises the following steps: in a first aspect, an embodiment of the present application provides a visual localization method based on a mental chain, including: S1, receiving a target image input by a user and a query text corresponding to the target image, and preprocessing the target image to obtain a normalized image; s2, inputting the query text into a preset language big model for structural analysis, obtaining a corresponding text thinking chain, and extracting a first semantic main body in the text thinking chain; S3, inputting the first semantic subject and the normalized image into a preset target detection model to obtain a candidate frame list corresponding to the normalized image based on the first semantic subject; The candidate frame list comprises coordinates, categories and confidence degrees corresponding to all candidate boundary frames in the image, the global description is a natural language description of the relation between the whole scene of the