CN-122019824-A - Image-text cross-modal retrieval method and system based on high-dimensional sphere embedding

CN122019824ACN 122019824 ACN122019824 ACN 122019824ACN-122019824-A

Abstract

The invention discloses an image-text cross-mode retrieval method and system based on high-dimensional ball embedding, wherein the method comprises the steps of respectively carrying out feature extraction processing on a target image and a target text through a backbone network and a word embedding method based on the target image and the target text to obtain visual features and text features; based on a ball encoder, performing ball embedded center calculation and semantic uncertainty modeling to obtain a visual ball center vector, a visual uncertainty radius, a text ball center vector and a text uncertainty radius, and performing similarity learning on the ball center vector and the radius by a ball embedded Monte Carlo sampling method to realize image text cross-modal retrieval. According to the invention, cross-modal alignment is enhanced through semantic uncertainty and diversity among visual texts, so that the cross-modal retrieval precision of the image text is improved. The image-text cross-mode retrieval method and system based on high-dimensional sphere embedding can be widely applied to the technical field of image-text retrieval.

Inventors

YUAN YONGZE
LIANG GUANCHAO
QIN XUEYANG
PENG JUNWEI
HAN ZHONGYUAN
LI JIASHAN
LI JIAOLING
Nong Yongshun
YANG KAIKANG
GUO YULONG
HUANG JIANFEI
WANG SHUNLI
LI HAOCHENG

Assignees

广东顺策工程管理股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260205

Claims (10)

1. The image-text cross-modal retrieval method based on high-dimensional sphere embedding is characterized by comprising the following steps of: Based on the target image and the target text, respectively carrying out feature extraction processing on the target image and the target text through a backbone network and a word embedding method to obtain visual features and text features; based on a ball encoder, respectively performing ball embedded center calculation and semantic uncertainty modeling on the visual features and the text features to obtain a visual ball center vector, a visual uncertainty radius, a text ball center vector and a text uncertainty radius; and performing similarity learning on the visual sphere center vector, the visual uncertainty radius, the text sphere center vector and the text uncertainty radius by using a ball-embedded Monte Carlo sampling method, so as to realize cross-modal retrieval of the image text.
2. The method for cross-modal retrieval of images and texts based on high-dimensional sphere embedding according to claim 1, wherein the step of modeling the sphere encoder based on the sphere encoder for respectively performing sphere-embedded center calculation and semantic uncertainty on the visual features and the text features to obtain a visual sphere center vector, a visual uncertainty radius, a text sphere center vector and a text uncertainty radius specifically comprises the following steps: Based on a visual ball encoder, performing ball-embedded center calculation and semantic uncertainty modeling on visual features through an attention-guided entropy mechanism to obtain a visual ball center vector and a visual uncertainty radius; based on a text ball encoder, the center calculation and semantic uncertainty modeling of ball embedding are carried out on text features through an attention-guided entropy mechanism, and a text ball center vector and a text uncertainty radius are obtained.
3. The method for cross-modal retrieval of image and text based on high-dimensional sphere embedding according to claim 2, wherein the step of obtaining a visual sphere center vector and a visual uncertainty radius by performing sphere-embedded center calculation and semantic uncertainty modeling on visual features through an attention-directed entropy mechanism based on a visual sphere encoder comprises the following steps: Sequentially carrying out full-connection transformation and global average pooling extraction treatment on the visual features to obtain global semantic information of the visual features, and constructing a first component; Performing dimension transformation processing on deep features in the visual features to obtain deep features after dimension transformation; after the deep features subjected to dimension transformation are subjected to hyperbolic tangent activation, generating attention vectors through a full-connection layer, and normalizing through a Softmax function to obtain the attention weights of the visual category features; multiplying the deep features after dimension transformation by the attention weight, and activating by a Sigmoid activation function to generate a second component; Combining the first component and the second component to construct a visual sphere center vector; and carrying out semantic uncertainty measurement according to the information entropy of the distribution of the attention weights to obtain a visual uncertainty radius.
4. The image-text cross-modal retrieval method based on high-dimensional sphere embedding according to claim 3, wherein the expression of the center calculation for sphere embedding of visual features by the attention-directed entropy mechanism is specifically as follows: ; ; ; ; In the above-mentioned method, the step of, The first component is represented by a first component, The second component is represented by a second component, The full-join transform is represented as such, Representing a global average pooling of the data, The visual characteristics are represented by the visual characteristics, The attention vector is represented by a vector of attention, The Softmax function is represented as a function of, Representing the hyperbolic tangent activation function, The dimension-transform function is represented as such, Representing the visual sphere center vector.
5. The method for cross-modal retrieval of images and texts based on high-dimensional sphere embedding according to claim 4, wherein the expression for semantic uncertainty measurement according to the information entropy of the distribution of attention weights is specifically as follows: ; In the above-mentioned method, the step of, Representing the radius of the visual uncertainty, Representing hyper-parameters that adjust the intensity of visual semantic uncertainty, Represents the attention weights in the attention vector corresponding to the visual features, Representing the radius.
6. The method for cross-modal retrieval of image and text based on high-dimensional sphere embedding according to claim 5, wherein the step of obtaining a text sphere center vector and a text uncertainty radius by performing sphere-embedded center calculation and semantic uncertainty modeling on text features through an attention-directed entropy mechanism based on a text sphere encoder comprises the following steps: Context modeling is carried out on text features through a bidirectional GRU network, and bidirectional semantic information is captured; Extracting attention features of text features through two full-connection layers and an activation function tanh, and generating word-level attention distribution weights; multiplying the text feature with the word level attention distribution weight, and generating a weighted feature through a Sigmoid activation function; Combining the bidirectional semantic information with the weighted characteristics to construct a text sphere center vector; and carrying out semantic uncertainty measurement according to the distributed information entropy of the word level attention distribution weight to obtain a text uncertainty radius.
7. The method for cross-modal retrieval of image and text based on high-dimensional sphere embedding according to claim 6, wherein the method for cross-modal retrieval of image and text by using the method for sampling Monte Carlo with sphere embedding performs similarity learning on a visual sphere center vector, a visual uncertainty radius, a text sphere center vector and a text uncertainty radius, and comprises the following steps: determining spherical center coordinates according to the visual sphere center vector and the text sphere center vector, determining a radius according to the visual uncertainty radius and the text uncertainty radius, setting batch size, embedding dimension and sampling number, initializing an empty list and setting a cycle counter to be 1; Judging whether the cycle counter is smaller than or equal to the batch size, if so, extracting the center coordinate of the ith sample from the center coordinates of the sphere, and extracting the radius value corresponding to the ith sample from the radius; Sampling N d-dimensional direction vectors from standard normal distribution, constructing a direction matrix, and calculating L2 norms of each row of vectors to obtain norms; dividing the norm vector of each row of the direction matrix by the corresponding L2 norm to obtain a normalized direction matrix; sampling N scalar values from the uniform distribution to obtain a uniform sampling vector, and multiplying each element in the uniform sampling vector by a corresponding radius value to obtain a scaled radius vector; Multiplying the scaled radius vector with the normalized direction matrix element by element to obtain an offset vector relative to the sphere center; adding the offset vector and the spherical center coordinates element by element to obtain a final sampling point set, adding the sampling point set into the list, and adding 1 to the cycle counter until the cycle counter is larger than the batch size; Stacking all sampling points in the list along the batch dimension, constructing a final sampling point tensor, and realizing cross-modal retrieval of the image text.
8. The image-text cross-modal retrieval method based on high-dimensional sphere embedding according to claim 7, wherein the loss function of the sphere embedded monte carlo sampling method comprises a matching loss and a maximum penetration depth loss, wherein the expression of the matching loss is specifically as follows: ; ; In the above-mentioned method, the step of, Representing the matching loss function, Representing the maximum sigmoid similarity in all sample pairs, 、 Representing sample in image sphere and text sphere respectively And A point of the light-emitting diode is located, The sigmoid function is represented as a function, 、 Representing the parameters that can be learned, The L2 norm is represented by the number, The number of samples per sphere is indicated, Representing the size of the sample image text feature matches, 、 Representing the usage characteristics of the image text, Representing conditional probabilities.
9. The image-text cross-modal retrieval method based on high-dimensional sphere embedding according to claim 8, wherein the expression of the maximum penetration depth loss is specifically as follows: ; ; In the above-mentioned method, the step of, Representing the maximum penetration depth loss function, Representation state Ball of (2) A number of sampling points are used to sample the sample, 、 Respectively represent modes Corresponding to the center and radius of the example ball, Representing the measurement of slave modalities To the mode Is used for the treatment of the skin with the composition, The super-parameter is represented by a parameter, The number of samples per sphere is indicated, Representing the depth of penetration of modality a to modality B, Represent the first And sampling features.
10. An image-text cross-modal retrieval system based on high-dimensional sphere embedding is characterized by comprising the following modules: the first module is used for respectively carrying out feature extraction processing on the target image and the target text through a backbone network and a word embedding method based on the target image and the target text to obtain visual features and text features; the second module is used for respectively carrying out ball embedded center calculation and semantic uncertainty modeling on the visual features and the text features based on the ball encoder to obtain a visual ball center vector, a visual uncertainty radius, a text ball center vector and a text uncertainty radius; and the third module is used for carrying out similarity learning on the visual ball center vector, the visual uncertainty radius, the text ball center vector and the text uncertainty radius by a Monte Carlo sampling method embedded by the ball, so as to realize cross-mode retrieval of the image text.

Description

Image-text cross-modal retrieval method and system based on high-dimensional sphere embedding Technical Field The invention relates to the technical field of image-text retrieval, in particular to an image-text cross-mode retrieval method and system based on high-dimensional ball embedding. Background In the intelligent supervision field, image-text cross-mode retrieval is a core task, aims to measure semantic relativity between images and text descriptions, and is widely applied to various supervision links such as retrieval and comparison of construction site images, building engineering field regulations and industry standards, retrieval of historical supervision pictures based on text descriptions and the like. The task can be used for image-text retrieval of intelligent supervision, and can also support various application scenes such as visual language navigation, visual questions and answers and the like. However, bridging the heterogeneous gap between visual and text modalities is inherently very challenging in that images often contain rich but ambiguous signals (e.g., clutter, object occlusion, diversified object configurations), while text descriptions tend to be compact, discrete, and semantically selective. Such modal differences are particularly prominent in one-to-many matching scenarios where a single image may correspond to multiple valid text interpretations that differ in specificity and level of abstraction, and vice versa. Despite significant advances in the field of image-text retrieval, a key challenge remains in how to construct a unified embedding space to efficiently reconcile semantic gaps between low-level entangled visual content and high-level sparse text semantics, especially in one-to-many correspondence settings. This difficulty stems from the inherent asymmetry of the multi-modal data that the visual signal is continuous, spatially dense and viewing dependent, while the language is symbolized, combinable and discrete in nature. In addition, visual data encapsulates detail-rich, context-dependent information (e.g., texture, shape, and spatial relationships) that does not necessarily directly correspond to the more abstract, more typed nature of the textual representation. Therefore, without fine modeling these differences, the embedding space may not properly align the rich and diverse information in the image with the often sparse and abstract content in the text, resulting in incomplete or inaccurate semantic mapping. In practice, the one-to-many correspondence reflects semantic uncertainty between the image and text, which arises from inherent differences in human understanding and interpretation. Different individuals may perceive the same image in multiple ways, as well as the same piece of text may convey multiple meanings due to different contexts. This variability is further exacerbated by subjectivity of the language and contextual dependence of visual content. Modeling such semantic uncertainty is therefore becoming critical because it enables the system to take into account the multiple possibilities of image to text alignment. More importantly, the uncertainty is explicitly integrated into representation learning, so that the limitation of deterministic embedding can be relieved, and the retrieval model can capture fine-grained semantics better. The related art attempts probabilistic and geometry-based embedding methods. For example, gaussian probability embedding models modal ambiguity by learning mean and variance, but it is still intrinsically dependent on the estimation and alignment of cross-modal directional cues. Such alignment is very sensitive to rotation, object layout and viewing angle variations and therefore limited in performance under real world visual variations. Furthermore, geometric representations such as rectangular boxes (boxes) and sector areas (sectors) attempt to provide modal-oriented uncertainty modeling. However, these methods still rely fundamentally on matching of directional information between heterogeneous modalities. When there is a rotation change, occlusion, or incomplete matching of the visual input, this dependence can easily lead to misalignments in the embedding space, thereby degrading retrieval performance. Disclosure of Invention In order to solve the technical problems, the invention aims to provide the image-text cross-modal retrieval method and system based on high-dimensional ball embedding, which enhance cross-modal alignment through semantic uncertainty and diversity among visual texts, so that the image text cross-modal retrieval precision is improved. The first technical scheme adopted by the invention is that the image-text cross-mode retrieval method based on high-dimensional ball embedding comprises the following steps: Based on the target image and the target text, respectively carrying out feature extraction processing on the target image and the target text through a backbone network and a word embedding me