CN-121502676-B - Semantic enhancement collaborative alignment fusion expression method based on multi-modal data

CN121502676BCN 121502676 BCN121502676 BCN 121502676BCN-121502676-B

Abstract

The invention discloses a semantic enhancement collaborative alignment fusion expression method based on multi-modal data, which comprises the steps of respectively extracting and compressing text semantic embedding, image semantic embedding and collaborative embedding representation of multi-modal data of a project; generating a combined potential representation through a linear projection layer, obtaining a modal sharing code sequence through modal sharing residual quantization, carrying out modal separation on an input modal specific quantizer of the last sharing layer in modal sharing residual quantization, obtaining a text specific code sequence and an image specific code sequence through modal specific residual quantization, constructing quantized text embedding through the combination of the modal sharing code sequence and the text specific code sequence, and constructing quantized image embedding through the combination of the modal sharing code sequence and the image specific code sequence. The invention realizes the cooperative enhancement of semantic understanding and structural perception, thereby providing a unified characterization basis with semantic richness, structural perceptibility and generating guiding capability for generating recommendation.

Inventors

SHI LEI
ZHAO YUQIU
ZHOU FEI
Qin Guanyu
LONG LONG
LIU WEN
JI YONGHUI
LAI JUNLI

Assignees

中国传媒大学
广西壮族自治区信息中心(广西壮族自治区大数据研究院)

Dates

Publication Date: 20260512
Application Date: 20251119

Claims (10)

1. A semantic enhancement collaborative alignment fusion expression method based on multi-modal data is characterized by comprising the following steps: S1, respectively extracting text semantics and embedding multi-mode data of items Image semantic embedding Collaborative embedding Collaborative embedding Includes associated information including user interaction information with the item, embedding text semantics, or/and text-to-image collaborative attribute Semantic embedding with images Latent semantic embedding compressed into the same dimension respectively And (3) with ; S2, embedding the latent semantics And (3) with Generating joint latent representations by linear projection layers Will combine the potential representations By passing through Performing hierarchical quantization on the layer sharing codebook to generate a codeword sequence, and obtaining a modal sharing code sequence through modal sharing residual quantization processing ; S3, inputting the last sharing layer in the modal sharing residual quantization process into a modal specific quantizer for modal separation, and then respectively passing through Performing hierarchical quantization on the layer-specific codebook to generate a codeword sequence, and performing modal-specific residual quantization processing to obtain a text-specific code sequence Code sequence specific to image ; S4, sharing code sequences in a mode Code sequence specific to text Joint construction quantized text embedding Sharing code sequences in a modality Code sequence specific to image Jointly constructing quantized image embedding ; S5, embedding quantized text with associated output items And quantized image embedding 。
2. The semantic enhanced collaborative alignment fusion expression method based on multimodal data according to claim 1, wherein in method S4, quantized text embedding is extracted The shared code word and text specific code word form text word element sequence and are used as text identifier, and the quantized image is extracted and embedded The shared code words of (2) and the image specific code words form an image character sequence and serve as image identifiers.
3. The method for expressing semantic enhanced collaborative alignment fusion based on multi-modal data according to claim 1, wherein the collaborative embedding is characterized in that Quantized text embedding is also recorded With quantized image embedding The alignment information is perceived cooperatively.
4. The method for expressing the semantic enhancement collaborative alignment fusion based on multi-modal data as set forth in claim 1, wherein in the method S1, text semantic embedding is extracted from the multi-modal data of the item respectively by using a pre-training model Semantic embedding with images Embedding text semantics by a text encoder Compression into latent semantic embeddings Embedding image semantics by an image encoder Compression into latent semantic embeddings 。
5. The method for collaborative alignment fusion expression of semantic enhancement based on multimodal data according to claim 1, wherein in method S2, latent semantics are first embedded And (3) with Stitching followed by generation of a joint latent representation by a linear projection layer ; Each level in the layer-shared codebook Are all provided with a shared codebook , , And the residual quantization processing expression of the modal sharing residual quantization processing is as follows: wherein Indicating that the euclidean distance is the smallest, Representing the selected first from the shared codebook The level code word is used to indicate the level code word, Is the first The shared semantic residual of the stage is referred to as, For selected code words, modal sharing code sequences 。
6. The semantic enhancement collaborative alignment fusion expression method based on multi-modal data according to claim 5, wherein in method S3, a mode-specific quantizer is adopted to embed a residual error of a last shared layer of a mode sharing quantization process Performing modal separation to obtain text semantics for text modal quantification And image semantics for image modality quantization 。
7. The method for collaborative alignment fusion expression of multi-modal data-based semantic enhancement of claim 6, wherein in method S3, text semantics are By passing through The layer text specific codebook is quantized to a sequence of codewords, Hierarchical text-specific codebook Corresponding text modal codebook The residual quantization process expression of the modality-specific residual quantization process is as follows: wherein Representing the selected first from the text-specific codebook The level code word is used to indicate the level code word, Is the first The text semantic residual of the stage is referred to as, For selected code words, text-specific code sequences 。
8. The method for collaborative alignment fusion expression of multi-modal data based semantic enhancement of claim 7, wherein in method S3, the image semantics are By passing through The layer image specific codebook is quantized into a codeword sequence, and the residual quantization processing expression of the modal specific residual quantization processing is as follows: wherein Representing the selected first from the image-specific codebook The level code word is used to indicate the level code word, Is the first The image semantic residual of the stage is referred to as, For selected code words, image-specific code sequences 。
9. The method for expressing semantic enhanced collaborative alignment fusion based on multi-modal data according to claim 8, wherein the method is characterized in that the method is embedded in quantized text Reconstructing original input features and encoding to obtain Embedding in quantized images Reconstructing original input features and encoding to obtain The residual quantization process for method S2 and method S3 builds the following loss functions: wherein Indicating that the gradient is stopped from operating, The processing penalty is lost for the residual quantization, The loss is quantized for the text modality residual, The loss is handled for the quantization of the image modality residual, Coefficients representing balanced code embedding and encoder optimized strength.
10. The semantic enhanced collaborative alignment fusion expression method based on multi-modal data according to claim 9 is characterized in that the total loss function expression of the method S2-the method S5 is as follows: wherein Indicating the total loss of the total of the components, Representing text and quantized image embedding reconstruction losses in the reconstruction process, Representing minimizing the loss between the residual vector and its corresponding codebook embedding.

Description

Semantic enhancement collaborative alignment fusion expression method based on multi-modal data Technical Field The invention relates to the field of multi-mode data semantic enhancement collaborative processing, in particular to a semantic enhancement collaborative alignment fusion expression method based on multi-mode data. Background Recommendation systems play a vital role in exploring personalized content in different scenarios, such as video platforms, e-commerce shopping, and movie recommendation. With the rapid development of large language models and recommendation system technologies, generated recommendations are gradually evolved from a traditional 'retrieval-ordering' mode to an 'understanding-generation' mode as a new generation recommendation paradigm, and compared with the traditional method which only relies on user-article interaction history for matching, the generated recommendations can dynamically generate personalized recommendation results through a constraint generation mechanism, content distribution quality is improved in a cold start scene, and user experience and system flexibility are remarkably improved. How to encode items (including items, video entities, objects, etc.) with multimodal information is the core content of the "understand-generate" mode, facilitating effective resolution of characterization limitations of traditional "search-sort" mode discriminant methods. At present, a plurality of problems to be solved still exist in aspects of researching semantic understanding depth, multi-source information fusion capability, cooperative signal utilization and the like. Firstly, multi-modal information (such as text data, image data, user preference, article characteristics and the like) in a real scene is often presented in a multi-modal form such as graphics context and the like, the semantic identifier structure of the current generated recommendation model depends on single-modal information, partial research only stays in a simple splicing or independent coding stage for processing the multi-modal information, and the lack of deep fusion and alignment of cross-modal semantics prevents the model from learning to utilize unified semantic representation of complementary information. Secondly, the existing research only focuses on the single-mode semantic attribute, but ignores the collaboration signal from the user interaction history, so that the interaction mode existing between semantic understanding and collaboration behavior cannot be reserved. The existing method generally models the cooperative signals and the semantic content independently, and cannot realize the cooperative construction of the cooperative signals and the semantic content at the semantic enhancement level, so that the generation process lacks the multi-mode semantic deep joint guidance. In summary, under the dual background of explosive growth of multi-modal content and explosive development of generation type artificial intelligence, a method for deeply fusing multi-modal semantics and collaborative signals and constructing novel multi-modal generation type semantic enhancement collaborative alignment fusion expression method based on the multi-modal semantic and collaborative signals is needed, so that the technical bottleneck of modal splitting and signal isolation is broken through. Disclosure of Invention The invention aims to provide a semantic enhancement collaborative alignment fusion expression method based on multi-mode data, which breaks through the technical bottleneck of modal splitting and signal isolation and realizes the collaborative enhancement of semantic understanding and structural perception, thereby providing a unified characterization basis with semantic richness, structural perceptibility and generation guiding capability for generating recommendation. The aim of the invention is achieved by the following technical scheme: a semantic enhancement collaborative alignment fusion expression method based on multi-modal data comprises the following steps: S1, respectively extracting text semantics and embedding multi-mode data of items Image semantic embeddingAnd collaborative embedding representationCollaborative embeddingIncludes associated information or/and text and image cooperative attribute, the associated information is information associated with the text or/and the image, the text semantic is embeddedSemantic embedding with imagesLatent semantic embedding compressed into the same dimension respectivelyAnd (3) with; S2, embedding the latent semanticsAnd (3) withGenerating joint latent representations by linear projection layersWill combine the potential representationsBy passing throughPerforming hierarchical quantization on the layer sharing codebook to generate a codeword sequence, and obtaining a modal sharing code sequence through modal sharing residual quantization processing; S3, inputting the last sharing layer in the modal sharing residual quantization process