CN-116704300-B - Cross-modal feature fusion model training method, fusion method, device and equipment

CN116704300BCN 116704300 BCN116704300 BCN 116704300BCN-116704300-B

Abstract

The disclosure relates to a cross-modal feature fusion model training method, a fusion method, a device and equipment. According to the method and the device, the cross-modal feature fusion model to be trained is trained, so that the cross-modal feature fusion model is obtained, and further when the image text pairs are aligned and fused based on the cross-modal feature fusion model, the accuracy of the image text alignment and fusion is improved, the reasoning speed of the image text alignment and fusion is improved, namely the accuracy rate of vision and language tasks is improved, and the reasoning speed of the vision and language tasks is improved.

Inventors

CHEN SU
ZHAO NING
ZHAO YUEKAI
WANG DONGAN
LV QING
ZHOU YI
SHI GUANG
WU ZHIMIN

Assignees

国家计算机网络与信息安全管理中心

Dates

Publication Date: 20260512
Application Date: 20230516

Claims (11)

1. A cross-modal feature fusion model training method, the method comprising: acquiring a plurality of image text pairs in a target data set, wherein the image text pairs comprise images and texts; for each pair of image texts, encoding a first vector of the image and a second vector of the text through a cross-modal feature fusion model to be trained, shielding the image and the text according to a shielding strategy, and encoding a third vector of the shielded image and a fourth vector of the shielded text; encoding a fifth vector from the first vector and the fourth vector, and encoding a sixth vector from the third vector and the second vector; calculating a loss value according to the first vector, the second vector, the third vector, the fourth vector, the fifth vector, the sixth vector and a preset loss function, wherein the preset loss function is obtained by adding an image fusion loss function, a text fusion loss function, a similarity loss function and a reconstruction loss function; Training the cross-modal feature fusion model to be trained according to the loss value; The calculation formula of the image fusion loss function is as follows: , Wherein, the Representing the loss function of the image-text, Representing a loss function of the text-image; The calculation formula of (2) is as follows: , The calculation formula of (2) is as follows: , The calculation formula of (2) is as follows: , Wherein, the Is a temperature scalar quantity, which is a temperature scalar quantity, The function represents the similarity of dot product, which is used for measuring the similarity between different modes, To be in the fifth vector Segmentation marks of the middle image coding and the masked text coding; the calculation formula of the text fusion loss function is as follows: , Wherein, the Representing the loss function of the text-image, A loss function representing image-text; The calculation formula of (2) is as follows: , The calculation formula of (2) is as follows: , The calculation formula of (2) is as follows: Wherein, the Is a temperature scalar quantity, which is a temperature scalar quantity, Represented is a similarity function of dot product, used to measure the similarity between different modalities, To be in the sixth vector Segmentation marks of the middle-masked image code and the text code; The calculation formula of the similarity loss function is as follows: , Wherein, the Is a temperature scalar quantity, which is a temperature scalar quantity, To a fixed value, sim The function is used for calculating the similarity of the two vectors; the calculation formula of the reconstruction loss function is as follows: , Wherein, the Representing the first of the target data sets C The masked predicted probability distribution in the individual sentences, Is a super parameter; Wherein, the Representing a first vector, Representing a second vector, Representing a third vector, Representing a fourth vector, Representing a fifth vector, Representing a sixth vector.
2. The method of claim 1, wherein acquiring a plurality of image text pairs in the target dataset comprises: acquiring an original data set; dividing the original data set to obtain a plurality of divided data sets; determining a target data set based on the plurality of partitioned data sets; A plurality of image text pairs in the target dataset are acquired.
3. The method of claim 1, wherein encoding the first vector of the image and the second vector of text by a cross-modal feature fusion model to be trained comprises: And respectively encoding the image and the text by a single-mode encoder in a cross-mode feature fusion model to be trained to obtain a first vector of the image and a second vector of the text.
4. The method of claim 1, wherein masking the image and the text according to a masking policy encodes a third vector of a masked image and a fourth vector of a masked text, comprising: dividing the image to obtain a plurality of image blocks, wherein the sizes of the image blocks are consistent; masking a first preset number of image blocks in the image according to a masking strategy to obtain a masked image; Encoding the shielded image to obtain a third vector; Masking a second preset number of words in the text to obtain a masked text, wherein the masking strategy comprises that the ratio of the second preset number of words to the words of the text is smaller than a first preset value; And encoding the shielded text to obtain a fourth vector.
5. The method of claim 1, wherein the masking strategy comprises at least a first phase, a second phase, and a third phase; In the first stage, shielding any image block in the image, training the cross-modal feature fusion model to be trained, wherein the area ratio of the any image block to the image is smaller than or equal to a second preset value; in the second stage, masking a third preset number of image blocks in the image, and training the cross-modal feature fusion model to be trained, wherein the area ratio of the third preset number of image blocks to the image is larger than a second preset value and smaller than a third preset value; And in the third stage, shielding a fourth preset number of image blocks in the image, training the cross-modal feature fusion model to be trained, wherein the area ratio of the fourth preset number of image blocks to the image is equal to a third preset value.
6. The method of claim 1, wherein encoding a fifth vector from the first vector and the fourth vector comprises: Encoding the first vector by a first multi-modal encoder in the cross-modal feature fusion model to be trained, and encoding the fourth vector by a second multi-modal encoder in the cross-modal feature fusion model to be trained to obtain a fifth vector; accordingly, encoding a sixth vector from the third vector and the second vector, comprising: And encoding the third vector by a first multi-modal encoder in the cross-modal feature fusion model to be trained, and encoding the second vector by a second multi-modal encoder in the cross-modal feature fusion model to be trained, so as to obtain a sixth vector.
7. The method of claim 6, wherein the first multi-modal encoder includes a first self-attention algorithm module, a cross-attention algorithm module, and a first full-connection layer module; The second multi-mode encoder comprises a second self-attention algorithm module and a second full-connection layer module; Wherein the parameter types and the parameter numbers of the first multi-mode encoder and the second multi-mode encoder are the same.
8. A fusion method, the method comprising: Acquiring a plurality of image text pairs, wherein the image text pairs comprise images and texts; Inputting the pairs of image text into a cross-modal feature fusion model, so that the cross-modal feature fusion model outputs a first vector of the image and a second vector of the text, wherein the cross-modal feature fusion model is obtained by training according to the training method of any one of claims 1-7; And determining that the image and the text are fused according to the fact that the cosine similarity of the first vector and the second vector is larger than a fourth preset value.
9. A cross-modal feature fusion model training apparatus, the apparatus comprising: a first acquisition module for acquiring a plurality of image text pairs in a target data set, the image text pairs comprising an image and text; the first coding module is used for coding a first vector of the image and a second vector of the text according to a shielding strategy by using a cross-modal feature fusion model to be trained for each image text pair, and coding a third vector of the shielded image and a fourth vector of the shielded text; a second encoding module for encoding a fifth vector according to the first vector and the fourth vector, and encoding a sixth vector according to the third vector and the second vector; The calculation module is used for calculating a loss value according to the first vector, the second vector, the third vector, the fourth vector, the fifth vector, the sixth vector and a preset loss function, wherein the preset loss function is obtained by adding an image fusion loss function, a text fusion loss function, a similarity loss function and a reconstruction loss function; The training module is used for training the cross-modal feature fusion model to be trained according to the loss value; The calculation formula of the image fusion loss function is as follows: , Wherein, the Representing the loss function of the image-text, Representing a loss function of the text-image; The calculation formula of (2) is as follows: , The calculation formula of (2) is as follows: , The calculation formula of (2) is as follows: , Wherein, the Is a temperature scalar quantity, which is a temperature scalar quantity, The function represents the similarity of dot product, which is used for measuring the similarity between different modes, To be in the fifth vector Segmentation marks of the middle image coding and the masked text coding; the calculation formula of the text fusion loss function is as follows: , Wherein, the Representing the loss function of the text-image, A loss function representing image-text; The calculation formula of (2) is as follows: , The calculation formula of (2) is as follows: , The calculation formula of (2) is as follows: Wherein, the Is a temperature scalar quantity, which is a temperature scalar quantity, Represented is a similarity function of dot product, used to measure the similarity between different modalities, To be in the sixth vector Segmentation marks of the middle-masked image code and the text code; The calculation formula of the similarity loss function is as follows: , Wherein, the Is a temperature scalar quantity, which is a temperature scalar quantity, To a fixed value, sim The function is used for calculating the similarity of the two vectors; the calculation formula of the reconstruction loss function is as follows: , Wherein, the Representing the first of the target data sets C The masked predicted probability distribution in the individual sentences, Is a super parameter; Wherein, the Representing a first vector, Representing a second vector, Representing a third vector, Representing a fourth vector, Representing a fifth vector, Representing a sixth vector.
10. A fusion device, the device comprising: the second acquisition module is used for acquiring a plurality of image text pairs, wherein the image text pairs comprise images and texts; an output module, configured to input the pairs of image text into a cross-modal feature fusion model, such that the cross-modal feature fusion model outputs a first vector of the image and a second vector of the text, where the cross-modal feature fusion model is obtained by training according to the training method of any one of claims 1-7; and the determining module is used for determining that the image and the text are fused according to the fact that the cosine similarity of the first vector and the second vector is larger than a fourth preset value.
11. An electronic device, comprising: a memory; Processor, and A computer program; wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-8.

Description

Cross-modal feature fusion model training method, fusion method, device and equipment Technical Field The disclosure relates to the technical field of computers, and in particular relates to a cross-modal feature fusion model training method, a cross-modal feature fusion device and cross-modal feature fusion equipment. Background Visual and linguistic (vision and language, VL) tasks can test the ability of the system to understand and infer visual world semantics with the help of natural language. As part of the human perception mode, the language as part of the human communication mode, how to reasonably and efficiently use information of two modes of vision and hearing is an important factor for model optimization. In the prior art, a double encoder and a fusion encoder are generally used for processing visual and language tasks, but shallow interaction between images and texts of the double encoder leads to low accuracy of the visual and language tasks, and the fusion encoder needs to jointly encode all image text pairs, so that the speed of reasoning the visual and language tasks is low. Disclosure of Invention In order to solve the technical problems or at least partially solve the technical problems, the disclosure provides a cross-modal feature fusion model training method, a fusion method, a device and equipment, so as to improve accuracy of visual and language tasks and speed of reasoning of the visual and language tasks. In a first aspect, an embodiment of the present disclosure provides a cross-modal feature fusion model training method, including: acquiring a plurality of image text pairs in a target data set, wherein the image text pairs comprise images and texts; for each pair of image texts, encoding a first vector of the image and a second vector of the text through a cross-modal feature fusion model to be trained, shielding the image and the text according to a shielding strategy, and encoding a third vector of the shielded image and a fourth vector of the shielded text; encoding a fifth vector from the first vector and the fourth vector, and encoding a sixth vector from the third vector and the second vector; Calculating a loss value according to the first vector, the second vector, the third vector, the fourth vector, the fifth vector, the sixth vector and a preset loss function; And training the cross-modal feature fusion model to be trained according to the loss value. In a second aspect, embodiments of the present disclosure provide a fusion method, including: Acquiring a plurality of image text pairs, wherein the image text pairs comprise images and texts; Inputting the pairs of the plurality of image texts into a cross-modal feature fusion model, so that the cross-modal feature fusion model outputs a first vector of the image and a second vector of the text, wherein the cross-modal feature fusion model is obtained by training through the training method in the first aspect; And determining that the image and the text are fused according to the fact that the cosine similarity of the first vector and the second vector is larger than a fourth preset value. In a third aspect, an embodiment of the present disclosure provides a cross-modal feature fusion model training apparatus, including: a first acquisition module for acquiring a plurality of image text pairs in a target data set, the image text pairs comprising an image and text; the first coding module is used for coding a first vector of the image and a second vector of the text according to a shielding strategy by using a cross-modal feature fusion model to be trained for each image text pair, and coding a third vector of the shielded image and a fourth vector of the shielded text; a second encoding module for encoding a fifth vector according to the first vector and the fourth vector, and encoding a sixth vector according to the third vector and the second vector; The calculation module is used for calculating a loss value according to the first vector, the second vector, the third vector, the fourth vector, the fifth vector, the sixth vector and a preset loss function; And the training module is used for training the cross-modal feature fusion model to be trained according to the loss value. In a fourth aspect, embodiments of the present disclosure provide a fusion device comprising: the second acquisition module is used for acquiring a plurality of image text pairs, wherein the image text pairs comprise images and texts; The output module is used for inputting the pairs of the image texts into a cross-modal feature fusion model, so that the cross-modal feature fusion model outputs a first vector of the image and a second vector of the text, and the cross-modal feature fusion model is obtained through training by the training method in the first aspect; and the determining module is used for determining that the image and the text are fused according to the fact that the cosine similarity of the first vector and the second vector