CN-120910296-B - Image retrieval method, system, equipment and medium for enhancing fine-grained object retrieval performance

CN120910296BCN 120910296 BCN120910296 BCN 120910296BCN-120910296-B

Abstract

The method processes an original image through a multi-scale image interception technology to generate a target image and a slice set, the images are input into an encoder to construct an image vector library, meanwhile, a multi-mode large model carries out semantic analysis on the images and the slices to generate text description and construct a text vector library, visual similarity matching is carried out based on image vectors in a retrieval stage, text similarity matching is carried out based on the text vector library, a candidate set is obtained through integration results, and finally, a final retrieval result is obtained through weighted fusion and sequencing. The method and the device remarkably improve the accuracy of fine-grained object retrieval, not only enhance the comprehensiveness of the retrieval, but also improve the relevance of the results by combining visual and text information, so that a user can find the required information more quickly and accurately.

Inventors

LI HUI
FEI YICHAO
ZHU XINYU
WANG HAI
Wei Hongdao
QI BAOJIN
CHEN SHOUFENG
WANG LEHAO

Assignees

中国铁塔股份有限公司

Dates

Publication Date: 20260512
Application Date: 20250728

Claims (10)

1. An image retrieval method for enhancing retrieval performance of fine-grained objects is characterized in that, The method comprises the following steps: Processing the original image based on a multi-scale image capturing technology to obtain a target image and a target image slice set; Inputting the target image and the target image slice set into an image encoder, and establishing an image vector library; carrying out semantic analysis on the target image and the target image slice set by utilizing a multi-mode large model, generating corresponding text descriptions, carrying out vector coding on the text descriptions, and establishing a text vector library; Performing visual similarity matching on the query vector and the image vector library, performing semantic similarity matching on the query vector and the text vector library to respectively obtain a visual candidate image set and a semantic candidate image set, and generating a visual similarity score and a semantic similarity score; And carrying out weighted fusion on the visual similarity score and the semantic similarity score to obtain a final similarity score, and sequencing the visual candidate image set and the semantic candidate image set according to the final similarity score to obtain a final retrieval result.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises, Processing an original image based on a multi-scale image interception technology to obtain a target image and a target image slice set, wherein the method specifically comprises the following steps of: performing unified size adjustment on the original image to obtain a target image; Identifying the target image by using an image segmentation model to obtain each target object and target position information corresponding to each target object; Performing multi-scale clipping on the target image according to preset size parameters to obtain an initial image slice set; And determining object inclusion relations among all initial image slices based on all target objects and target position information thereof, and performing de-duplication processing on the initial image slice set according to the object inclusion relations to obtain the target image slice set.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises, Inputting the target image and the target image slice set into an image encoder, and establishing an image vector library, wherein the method specifically comprises the following steps of: inputting the target image and the target image slice set into the image encoder, and extracting corresponding visual features; And fusing the extracted visual features, and establishing the image vector library.
4. The method of claim 3, wherein the step of, Performing visual similarity matching on the query vector and the image vector library to obtain a visual candidate image set, and generating a visual similarity score, wherein the method specifically comprises the following steps of: Searching in the image vector library by utilizing the query vector to obtain an initial candidate image set; And respectively calculating the visual similarity scores of the candidate images in the initial candidate image set and the query vector, and determining the visual candidate image set according to the visual similarity scores.
5. The method of claim 4, wherein the step of determining the position of the first electrode is performed, Carrying out semantic similarity matching on the query vector and the text vector library to obtain a semantic candidate image set, and generating a semantic similarity score, wherein the method specifically comprises the following steps of: searching in the text vector library by utilizing the query vector to obtain an initial candidate text description set; Determining actual similarity between the query vector and each text description information in the text vector library by using a text similarity algorithm, and determining a class II initial candidate text description set according to each actual similarity; Performing de-duplication processing on the first-class initial candidate text description set and the second-class initial candidate text description set to obtain an actual recall result; And calculating a semantic similarity score for the actual recall result by using a text rerank model, and determining the semantic candidate image set according to the semantic similarity score.
6. The method of claim 5, wherein the step of determining the position of the probe is performed, The visual similarity score and the semantic similarity score are weighted and fused to obtain a final similarity score, and the visual candidate image set and the semantic candidate image set are ordered according to the final similarity score to obtain a final search result, which comprises the following steps: Acquiring the image reciprocal rank of each final candidate image in the visual candidate image set and the semantic candidate image set and the text reciprocal rank in the semantic candidate image set; And determining the final similarity score according to the image reciprocal rank and the text reciprocal rank.
7. An image retrieval system for enhancing retrieval performance of fine-grained objects, characterized in that, The system comprises: The multi-scale image intercepting module is used for processing the original image based on a multi-scale image intercepting technology to obtain a target image and a target image slice set; The image feature extraction module inputs the target image and the target image slice set into an image encoder, and an image vector library is established; the image semantic description generation module performs semantic analysis on the target image and the target image slice set by using a multi-mode large model to generate corresponding text descriptions, performs vector coding on the text descriptions and establishes a text vector library; The primary retrieval module is used for carrying out visual similarity matching on the query vector and the image vector library and carrying out semantic similarity matching on the query vector and the text vector library to respectively obtain a visual candidate image set and a semantic candidate image set, and generating a visual similarity score and a semantic similarity score; And the secondary retrieval module is used for carrying out weighted fusion on the visual similarity score and the semantic similarity score to obtain a final similarity score, and sequencing the visual candidate image set and the semantic candidate image set according to the final similarity score to obtain a final retrieval result.
8. An electronic device comprising at least one processor and at least one memory electrically connected; the memory is electrically connected to the processors, wherein the memory stores instructions executable by at least one of the processors to enable the at least one processor to perform an image retrieval method for enhancing fine-grained object retrieval according to any of claims 1-6.
9. A computer storage medium, characterized in that, The computer readable storage medium has a computer program stored therein; The computer program, when executed by a processor, implements an image retrieval method of enhancing fine-grained object retrieval performance as claimed in any of claims 1-6.
10. A computer program product, characterized in that, The computer program product is stored in at least one storage medium; the computer program product comprising instructions for causing at least one electronic device to perform an image retrieval method of enhancing fine-grained object retrieval performance as claimed in any of claims 1-6.

Description

Image retrieval method, system, equipment and medium for enhancing fine-grained object retrieval performance Technical Field The disclosure belongs to the technical field of image retrieval, and particularly relates to an image retrieval method, an image retrieval system, an image retrieval device and an image retrieval medium for enhancing retrieval performance of fine-grained objects. Background With the continuous progress of computer technology, the importance of image data management and application in search systems is becoming increasingly prominent. The image information is introduced into a question-answering system and a computer vision task, so that the accuracy and the intelligence level of the system are improved remarkably, and however, some key challenges still exist in the current image information retrieval practice. The conventional image retrieval method mainly depends on embedding vectors generated by an image coding model, realizes matching in a pattern searching mode, is difficult to support the function of searching a related image in a text mode, namely searching related images according to text description, and in addition, because the image sizes are different and target objects can only occupy a small part of the images, the proportion of information occupied by the objects in the whole images is small, so that the similarity matching effect between the whole images is reduced, and omission of the related images is easy to cause. Therefore, by proposing an image retrieval method for enhancing the retrieval performance of fine-grained objects, the aim of improving the accuracy and semantic understanding capability of image retrieval is a feasible and preferable approach. Disclosure of Invention In order to solve the problems, the present disclosure provides an image retrieval method, system, device and medium for enhancing the retrieval performance of fine-grained objects, which realizes multi-level understanding of image content by introducing an object-based multi-scale slicing mechanism and generating image semantic description in combination with PALIGEMMA models. In a first aspect, the present disclosure provides an image retrieval method for enhancing retrieval performance of fine-grained objects, the method comprising, Processing the original image based on a multi-scale image capturing technology to obtain a target image and a target image slice set; Inputting the target image and the target image slice set into an image encoder, and establishing an image vector library; carrying out semantic analysis on the target image and the target image slice set by utilizing a multi-mode large model, generating corresponding text descriptions, carrying out vector coding on the text descriptions, and establishing a text vector library; Performing visual similarity matching on the query vector and the image vector library, performing semantic similarity matching on the query vector and the text vector library to respectively obtain a visual candidate image set and a semantic candidate image set, and generating a visual similarity score and a semantic similarity score; And carrying out weighted fusion on the visual similarity score and the semantic similarity score to obtain a final similarity score, and sequencing the visual candidate image set and the semantic candidate image set according to the final similarity score to obtain a final retrieval result. Further, the method comprises the steps of, Processing an original image based on a multi-scale image interception technology to obtain a target image and a target image slice set, wherein the method specifically comprises the following steps of: performing unified size adjustment on the original image to obtain a target image; Identifying the target image by using an image segmentation model to obtain each target object and target position information corresponding to each target object; Performing multi-scale clipping on the target image according to preset size parameters to obtain an initial image slice set; And determining object inclusion relations among all initial image slices based on all target objects and target position information thereof, and performing de-duplication processing on the initial image slice set according to the object inclusion relations to obtain the target image slice set. Further, the method comprises the steps of, Inputting the target image and the target image slice set into an image encoder, and establishing an image vector library, wherein the method specifically comprises the following steps of: inputting the target image and the target image slice set into the image encoder, and extracting corresponding visual features; And fusing the extracted visual features, and establishing the image vector library. Further, the method comprises the steps of, Performing visual similarity matching on the query vector and the image vector library to obtain a visual candidate image set, and generating a visual sim