CN-121564448-B - Parallel vision detection method and device based on large language model and electronic equipment

CN121564448BCN 121564448 BCN121564448 BCN 121564448BCN-121564448-B

Abstract

The invention discloses a parallel visual detection method, a device and electronic equipment based on a large language model, which relate to the technical field of image detection, wherein the method comprises the steps of obtaining a visual feature sequence representing an image to be detected; constructing a learnable visual query vector, splicing a visual feature sequence and the visual query vector, inputting a large language model by combining a natural language input sequence of a target to be detected, updating the visual query vector, generating a category token of the target to be detected and a semantic feature representation for describing the category token in an autoregressive mode, fusing each visual query vector and the semantic feature representation based on a detection head to obtain target features fused with semantic information, processing the target features based on a prediction branch, and outputting a plurality of detection results in parallel. By adopting the method, the category identification and the position location of a plurality of targets can be completed in a parallel mode, and the overall detection efficiency and the detection performance are improved.

Inventors

SUN JUN
DING HAISONG
JIA WEI
LIU HAIFENG

Assignees

合肥中科类脑智能技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (8)

1. A parallel vision inspection method based on a large language model, comprising: Performing visual coding on an image to be detected to obtain a visual feature sequence representing the image to be detected; constructing a group of learnable visual query vectors for parallel characterization of spatial information of targets to be detected in the images to be detected; The visual feature sequence and the visual query vector are spliced, a pre-trained large language model is input together with the natural language input sequence of the object to be detected, the visual query vector is updated, a category token of the object to be detected and a semantic feature representation for describing the category token are generated in an autoregressive mode, the semantic feature representation for describing the category token is generated, and the method comprises the steps of introducing a predefined occupation token after the category token is generated, extracting hidden features of the occupation token in the large language model to be used for describing the semantic feature representation of the category token, or extracting hidden features of the category token corresponding to the generated category token in the large language model to be used for representing the semantic feature of the category token independent of a category token decoding result; Fusing each visual query vector with the semantic feature representation based on a detection head in the large language model to obtain a target feature of fused semantic information, wherein the semantic feature representation is used as key input and value input to participate in feature interaction with the visual query vector so as to introduce semantic constraint of category to the visual query vector based on the semantic feature representation; and processing the target characteristics based on the prediction branches in the large language model, and outputting a plurality of detection results in parallel, wherein each detection result at least comprises probability information of the category to which the target to be detected belongs and target boundary frame coordinates, and the target boundary frame coordinates carry out continuous value regression prediction through independent regression branches.
2. The large language model based parallel visual inspection method of claim 1, wherein the stitching the visual feature sequence with the visual query vector comprises: and splicing the visual query vector serving as a prefix sequence or a suffix sequence with the visual feature sequence.
3. The large language model based parallel visual inspection method of claim 1, wherein prior to fusing each of the visual query vectors with the semantic feature representation, the method further comprises: And performing parallel self-attention processing on the updated visual query vector.
4. A parallel visual inspection method based on a large language model according to claim 3, wherein said fusing each of said visual query vectors with said semantic feature representation comprises: And taking the visual query vector after interaction as a query term, taking the semantic feature representation as key input and value input, and executing cross-attention calculation to obtain the target feature.
5. A parallel visual inspection method based on a large language model according to claim 3, wherein said fusing each of said visual query vectors with said semantic feature representation further comprises: And performing feature stitching, element-by-element addition or fusion through a gating mechanism on the visual query vector and the semantic feature representation.
6. The large language model based parallel vision inspection method of claim 1, wherein during the large language model training process, the method further comprises: The intermediate token and the class token generated by the large language model are supervised by adopting cross entropy loss, wherein the intermediate token represents other intermediate generated token except for visual query vectors, class token and semantic feature representation corresponding to the class token in a feature sequence output by the large language model; Introducing detection loss through the detection result output by the detection head, carrying out joint back propagation optimization on the updating of the visual query vector and the related model parameters of the semantic feature representation, wherein, The detection loss comprises classification loss and regression loss after sample distribution based on a matching strategy, wherein the classification loss is used for constraining the target category prediction result, and the regression loss is used for constraining the target boundary box coordinate prediction result.
7. A parallel vision inspection device based on a large language model, comprising: The visual coding module is used for visually coding the image to be detected to obtain a visual characteristic sequence representing the image to be detected; The visual query vector construction module is used for constructing a group of learnable visual query vectors so as to represent the spatial information of the target to be detected in the image to be detected in parallel; The detection module is used for splicing the visual feature sequence and the visual query vector, inputting a pre-trained large language model in combination with a natural language input sequence of the target to be detected, updating the visual query vector, generating a category token of the target to be detected and a semantic feature representation for describing the category token in an autoregressive mode, wherein the generation of the semantic feature representation for describing the category token comprises the steps of introducing a pre-defined occupation token after the generation of the category token, extracting hidden layer features of the occupation token in the large language model as the semantic feature representation for describing the category token, or extracting hidden layer features of the category word token corresponding to the generated category token in the large language model, taking the hidden layer features as the semantic feature representation independent of a category token decoding result, fusing each visual query vector with the semantic feature representation based on a detection head in the large language model, obtaining fusion information, taking the feature representation of the occupation token in the large language model as the semantic feature representation and a coordinate prediction frame based on the coordinate prediction frame of the target, and inputting a coordinate prediction frame based on the visual feature of the target, and conducting continuous constraint and the input of the coordinate prediction frame based on the visual feature of the target, and the coordinate prediction frame based on the visual feature prediction frame.
8. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the large language model based parallel vision detection method of any one of claims 1 to 6 when the computer program is executed.

Description

Parallel vision detection method and device based on large language model and electronic equipment Technical Field The present invention relates to the field of image detection technologies, and in particular, to a parallel visual detection method and apparatus based on a large language model, and an electronic device. Background With the development of multi-modal artificial intelligence technology, a Visual Language Model (VLM) formed by combining a large Language Model (LLM, large Language Model) with a visual encoder is widely applied to tasks such as image understanding, visual question answering and cross-modal retrieval. Based on the method, part of research attempts to introduce a target detection task into a VLM framework, and a guide model outputs the category of the target and the position information of the target in an image in the form of a text sequence through designing natural language prompts, so that the open vocabulary detection and zero sample migration capability are realized to a certain extent. The method generally utilizes a visual encoder to extract characteristics of an input image, inputs the obtained visual characteristics and text prompts into a large language model, generates text token describing detection results one by the model in an autoregressive mode, and analyzes the generated text into corresponding detection frames. However, the above-mentioned target detection method based on the language generation paradigm essentially relies on an autoregressive generation mechanism inherent to a large language model, and the detection results must be gradually output according to the token sequence, so that it is difficult to simultaneously predict a plurality of targets in parallel. When the number of targets in the scene to be detected is large, a large number of tokens need to be generated in the reasoning process, so that the overall reasoning delay is obviously increased, and the application scene with high real-time requirements is difficult to meet. Meanwhile, as the output space of the large language model is a discrete vocabulary space, continuous target boundary frame coordinates are usually required to be represented by a discrete coding mode or a special coordinate token introducing mode, quantization errors are inevitably introduced into the representation mode, so that detection results are limited in high-precision positioning tasks, and positioning precision is difficult to further improve particularly in application scenes with higher requirements on boundary frame overlapping rate. Disclosure of Invention The present invention aims to solve at least one of the technical problems in the related art to some extent. Therefore, the invention aims to provide a parallel visual detection method, a device and electronic equipment based on a large language model, so that the parallel reasoning efficiency of multi-target detection and the target positioning accuracy are improved while the language semantic alignment capability is maintained. In order to achieve the above object, an embodiment of a first aspect of the present invention provides a parallel visual inspection method based on a large language model, including: Performing visual coding on an image to be detected to obtain a visual feature sequence representing the image to be detected; constructing a group of learnable visual query vectors for parallel characterization of spatial information of targets to be detected in the images to be detected; Splicing the visual characteristic sequence and the visual query vector, jointly inputting a pre-trained large language model by combining the natural language input sequence of the target to be detected, updating the visual query vector, and generating a category token of the target to be detected and semantic characteristic representation for describing the category token in an autoregressive mode; Fusing each visual query vector with the semantic feature representation based on a detection head in the large language model to obtain target features fused with semantic information; and processing the target feature based on the prediction branch in the large language model, and outputting a plurality of detection results in parallel, wherein each detection result at least comprises probability information of the category to which the target to be detected belongs and target boundary frame coordinates. In addition, the method of the above embodiment of the present invention may further have the following additional technical features: According to one embodiment of the invention, generating a semantic feature representation describing the category token comprises: after the category token is generated, a predefined placeholder token is introduced, and hidden layer features of the placeholder token in the large language model are extracted as semantic feature representations for describing the category token. According to one embodiment of the invention, generating a se