CN-121999493-A - Multi-mode space sensing method, device and equipment based on prompt word expansion and storage medium

CN121999493ACN 121999493 ACN121999493 ACN 121999493ACN-121999493-A

Abstract

The application discloses a multi-modal space sensing method, a device, equipment and a storage medium based on prompt word expansion, which relate to the technical field of computer vision, wherein the multi-modal space sensing method based on the prompt word expansion comprises the steps of receiving an original image and an original prompt word, carrying out semantic expansion on the original prompt word, and generating an enhanced prompt word set; the method comprises the steps of carrying out visual detection on an original image by using the enhanced prompt word set to generate a plurality of candidate detection frames, generating a weighted image according to the candidate detection frames and the original image, and carrying out semantic rechecking on the candidate detection frames according to the weighted image to obtain a target detection frame. The application can improve the understanding capability of semantics, thereby improving the accuracy of target detection.

Inventors

LIU LEI
SHI TUO
YUAN XIAOYI
XU MING

Assignees

深圳力维智联技术有限公司

Dates

Publication Date: 20260508
Application Date: 20251225

Claims (10)

1. A multimode space sensing method based on prompt word expansion is characterized by comprising the following steps: receiving an original image and an original prompt word, performing semantic expansion on the original prompt word, and generating an enhanced prompt word set; performing visual detection on the original image by using the enhanced prompt word set to generate a plurality of candidate detection frames; generating a weighted image according to the candidate detection frame and the original image; And carrying out semantic rechecking on the candidate detection frames according to the weighted images to obtain target detection frames.
2. The method of claim 1, wherein the step of semantically expanding the original cue words to generate an enhanced set of cue words comprises: Extracting core entities in the original prompt words; generating a candidate expansion word set based on the core entity; and carrying out semantic similarity screening on the candidate expansion word set based on the core entity to obtain an enhanced prompt word set.
3. The method of claim 2, wherein the step of performing semantic similarity screening on the candidate expanded term set based on the core entity to obtain an enhanced hint term set comprises: Obtaining a core entity text feature vector according to the core entity, and obtaining an expanded word text feature vector according to the candidate expanded word set; Calculating cosine similarity between the core entity text feature vector and each expansion word text feature vector; And screening a preset number of expansion words from the candidate expansion word set according to the cosine similarity, and obtaining an enhanced prompt word set with the original prompt word set.
4. The method of claim 1, wherein the step of visually inspecting the original image using the set of enhanced cue words to generate a plurality of candidate inspection boxes comprises: encoding the enhanced prompt word set into text feature vectors; extracting visual features of the original image; And calculating the cross-modal similarity of the text feature vector and the visual feature, and generating a plurality of candidate detection frames according to the cross-modal similarity.
5. The method of claim 1, wherein the step of generating a weighted image from the candidate detection box and the original image comprises: generating a soft mask map based on the original image and the position information of the candidate detection frame; and carrying out weighted fusion on the soft mask image and the original image to generate a weighted image.
6. The method of claim 5, wherein the generating a soft mask map based on the original image and the location information of the candidate detection box comprises: calculating a plurality of pixel weight values based on the candidate detection frames; initializing a background mask having the same size as the original image; and superposing the pixel weight values and the background mask to generate a soft mask image.
7. The method of claim 1, wherein the step of semantically rechecking the candidate detection box based on the weighted image to obtain a target detection box comprises: inputting the weighted image and the original prompt word into a preset multi-mode large model to obtain a rechecking detection frame output by the preset multi-mode large model; Calculating the overlapping degree of the rechecking detection frame and the candidate detection frame; And screening the candidate detection frames according to the overlapping degree to obtain a target detection frame.
8. A multi-modal spatial awareness apparatus based on cue word expansion, the apparatus comprising: the expansion module is used for receiving the original image and the original prompt words, carrying out semantic expansion on the original prompt words and generating an enhanced prompt word set; the detection module is used for performing visual detection on the original image by using the enhanced prompt word set to generate a plurality of candidate detection frames; the generation module is used for generating a weighted image according to the candidate detection frame and the original image; and the rechecking module is used for conducting semantic rechecking on the candidate detection frames according to the weighted images to obtain target detection frames.
9. A cue expansion-based multimodal spatial perception device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program configured to implement the steps of the cue expansion-based multimodal spatial perception method of any of claims 1 to 7.
10. A storage medium, characterized in that the storage medium is a computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the multi-modal spatial perception method based on hint word extension according to any one of claims 1 to 7.

Description

Multi-mode space sensing method, device and equipment based on prompt word expansion and storage medium Technical Field The present application relates to the field of computer vision, and in particular, to a method, apparatus, device, and storage medium for multi-modal spatial perception based on prompt word expansion. Background Along with the rapid development of artificial intelligence and multi-mode technology, vision and language joint understanding tasks are widely applied in the fields of open scene recognition, intelligent security, automatic driving, man-machine interaction and the like. At present, the common detection method realizes cross-modal alignment from language to vision, and has stronger space positioning capability in open set target detection. However, because of relying on a fixed text embedding and static semantic matching mechanism, understanding of complex semantics, polysemous references or upper and lower relationships is weak, and the problem of misjudgment or omission of semantics is easy to occur, so that the accuracy of target detection is not high. Therefore, how to improve the accuracy of target detection is still a problem to be solved. The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art. Disclosure of Invention The application mainly aims to provide a multi-mode space sensing method, device and equipment based on prompt word expansion and a storage medium, aiming at solving the technical problem of how to improve the accuracy of target detection. In order to achieve the above purpose, the present application provides a multi-modal spatial perception method based on prompt word expansion, the method comprising: receiving an original image and an original prompt word, performing semantic expansion on the original prompt word, and generating an enhanced prompt word set; performing visual detection on the original image by using the enhanced prompt word set to generate a plurality of candidate detection frames; generating a weighted image according to the candidate detection frame and the original image; And carrying out semantic rechecking on the candidate detection frames according to the weighted images to obtain target detection frames. In one embodiment, the step of semantically expanding the original hint word to generate the enhanced hint word set includes: Extracting core entities in the original prompt words; generating a candidate expansion word set based on the core entity; and carrying out semantic similarity screening on the candidate expansion word set based on the core entity to obtain an enhanced prompt word set. In an embodiment, the step of obtaining the enhanced prompt word set based on the semantic similarity screening of the candidate expanded word set by the core entity includes: Obtaining a core entity text feature vector according to the core entity, and obtaining an expanded word text feature vector according to the candidate expanded word set; Calculating cosine similarity between the core entity text feature vector and each expansion word text feature vector; And screening a preset number of expansion words from the candidate expansion word set according to the cosine similarity, and obtaining an enhanced prompt word set with the original prompt word set. In one embodiment, the step of using the enhanced cue word set to perform visual detection on the original image to generate a plurality of candidate detection frames includes: encoding the enhanced prompt word set into text feature vectors; extracting visual features of the original image; And calculating the cross-modal similarity of the text feature vector and the visual feature, and generating a plurality of candidate detection frames according to the cross-modal similarity. In one embodiment, the step of generating a weighted image from the candidate detection box and the original image comprises: generating a soft mask map based on the original image and the position information of the candidate detection frame; and carrying out weighted fusion on the soft mask image and the original image to generate a weighted image. In an embodiment, the step of generating a soft mask map based on the original image and the position information of the candidate detection box includes: calculating a plurality of pixel weight values based on the candidate detection frames; initializing a background mask having the same size as the original image; and superposing the pixel weight values and the background mask to generate a soft mask image. In an embodiment, the step of performing semantic review on the candidate detection frame according to the weighted image to obtain a target detection frame includes: inputting the weighted image and the original prompt word into a preset multi-mode large model to obtain a rechecking detection frame output by the pr