CN-121980089-A - Knowledge enhancement-based multi-mode interactive guiding recommendation method and system

CN121980089ACN 121980089 ACN121980089 ACN 121980089ACN-121980089-A

Abstract

The invention provides a multi-mode interactive guiding recommendation method and system based on knowledge enhancement, and belongs to the technical field of user interactive data processing; the method comprises the steps of obtaining a multi-modal dialogue context containing texts and images, carrying out feature extraction to generate an integrated context representation, searching a plurality of similar samples by utilizing the integrated context representation based on a prototype comparison learning strategy, generating a target text response according to response fragments in the similar samples and the integrated context representation, and determining a target image response through an attention mechanism based on the target text response and images in the similar samples. The accuracy and the consistency of multi-modal interactive guiding recommendation are improved through the collaborative effect among the end-to-end collaborative framework, the visual coding based on the supervised contrast learning and the multi-modal representation learning based on prototype regularization.

Inventors

MENG LEI
YANG CHENGYE
QI ZHUANG
LI ZIXUAN
MENG XIANGXU

Assignees

山东大学

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (10)

1. A knowledge-enhancement-based multi-modal interactive guidance recommendation method, comprising: acquiring a multi-modal dialog context including a text context and an image context; Extracting features from the text context and extracting features from the image context to generate a comprehensive context representation of the current dialog; Based on a prototype comparison learning strategy, a plurality of similar samples are searched from a preset knowledge base by utilizing the comprehensive context representation; clustering the comprehensive context representation, obtaining at least one cluster center as a prototype representation, and optimizing a retrieval process by comparing similarity between the comprehensive context representation and the prototype representation; The method comprises the steps of generating a target text response for the multi-modal dialog context according to response fragments and comprehensive context representations in a plurality of similar samples, determining a target image response through an attention mechanism based on the target text response and images in the plurality of similar samples, and outputting a guiding recommendation result containing the target text response and the target image response.
2. The knowledge-enhanced multi-modal interactive guidance recommendation method as set forth in claim 1, wherein the multi-modal interactive guidance recommendation method is implemented based on IEPMCR-mode architecture, the IEPMCR-mode architecture adopts an end-to-end joint fine tuning paradigm, and specifically comprises an end-to-end collaborative framework module, a visual coding module based on supervised contrast learning, and a multi-modal representation learning module based on prototype regularization.
3. The method of claim 2, wherein the image context processing is performed based on the end-to-end collaborative framework module by embedding two enhanced samples of the same input data image in a small batch into query and key feature vectors respectively using an image encoder and a momentum encoder, managing the key feature vectors by iterative dequeuing and enqueuing procedures, and enhancing the similarity relationship between dialog contexts using a triplet ordering loss function.
4. The method of claim 3, wherein the end-to-end collaborative framework module employs a multi-modal pre-training large model RERG, and wherein in response generation, a GRU network is employed as an encoder and decoder, and wherein the corresponding hidden state is mapped to lexical space via an embedded matrix of the decoder.
5. The knowledge-enhanced multi-modal interactive guidance recommendation method of claim 2, wherein the multi-modal representation learning module learns the generic feature representation by clustering and selecting similar samples, comprising k-means clustering the multi-modal representation set and taking the center of each cluster as a prototype representation, then determining the corresponding cluster for the current context representation and obtaining the prototype representation, and then performing a generalization by using a contrast learning algorithm.
6. The knowledge-enhanced multimodal interactive guidance recommendation method of claim 2, wherein the visual coding module selects positive and negative samples using category labels as additional supervision information, by first selecting samples with the same category labels to form a positive sample set, and then, for each positive sample, selecting negative samples from different categories.
7. The knowledge-based multi-modal interactive guided recommendation method of claim 2 wherein the IEPMCR-mode architecture obtains a final model by training on multi-modal representation fusion and multi-level contrast retrieval.
8. A knowledge enhancement-based multi-modal interactive guidance recommendation system is characterized in that the system adopts IEPMCR-type architecture for realizing the multi-modal interactive guidance recommendation method as set forth in any one of claims 1-7, wherein the IEPMCR-type architecture comprises an end-to-end collaborative framework module, a visual coding module based on supervised contrast learning, and a multi-modal representation learning module based on prototype regularization.
9. A computer readable storage medium having stored thereon a program, which when executed by a processor implements the steps of a knowledge-based enhanced multimodal interactive guidance recommendation method according to any of claims 1-7.
10. Electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, characterized in that the processor implements the steps of a knowledge-based enhanced multimodal interactive guidance recommendation method according to any of the claims 1-7 when executing the program.

Description

Knowledge enhancement-based multi-mode interactive guiding recommendation method and system Technical Field The invention belongs to the technical field of user interactive data processing, and particularly relates to a multi-mode interactive guiding recommendation method and system based on knowledge enhancement. Background The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art. The multi-modal dialog system enhances user interaction by integrating data of multiple modalities, such as text, images, etc., so that the user can interact with the machine in a more natural and intuitive manner. Multi-modal dialog systems can be divided into two broad categories, modular architecture and end-to-end architecture, depending on the system architecture. The modular architecture achieves the purposes of task decoupling and flexible expansion by dividing the system into a plurality of independent modules, but challenges are still faced in the aspects of multi-mode information fusion and consistency, and especially in complex interaction scenes, coordination and integration between modules often become bottlenecks. The end-to-end architecture directly generates output from input through a deep learning model, avoids complex inter-module communication, makes progress in optimizing system design and implementation, and still faces challenges in multi-mode information fusion and effectively acquiring related information. However, in the prior art, the following problems are common to multi-modal dialog systems: First, information of different modalities may cause ambiguity, affecting the ability of the system to generate accurate responses, and existing methods, while capable of relying on end-to-end modeling to learn multi-modal representations and generate responses, often suffer from problems of conservative output, weak modality cooperativity, and limited interpretability. Secondly, the prior art has difficulty in understanding and integrating context information, particularly historical data, and the search process and the generation process lack of effective linking mechanisms, so that the model has limited interpretability, and consistency of search and generation in complex interaction scenes is difficult to ensure. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a multi-modal interactive guiding recommendation method and system based on knowledge enhancement, which improve the accuracy and consistency of multi-modal interactive guiding recommendation through the collaborative effect among an end-to-end collaborative framework, visual coding based on supervised contrast learning and multi-modal representation learning based on prototype regularization. To achieve the above object, one or more embodiments of the present invention provide the following technical solutions: the invention provides a multi-mode interactive guidance recommendation method based on knowledge enhancement. A knowledge-enhancement-based multi-modal interactive guidance recommendation method, comprising: acquiring a multi-modal dialog context including a text context and an image context; Extracting features from the text context and extracting features from the image context to generate a comprehensive context representation of the current dialog; Based on a prototype comparison learning strategy, a plurality of similar samples are searched from a preset knowledge base by utilizing the comprehensive context representation; clustering the comprehensive context representation, obtaining at least one cluster center as a prototype representation, and optimizing a retrieval process by comparing similarity between the comprehensive context representation and the prototype representation; The method comprises the steps of generating a target text response for the multi-modal dialog context according to response fragments and comprehensive context representations in a plurality of similar samples, determining a target image response through an attention mechanism based on the target text response and images in the plurality of similar samples, and outputting a guiding recommendation result containing the target text response and the target image response. Further, the multi-modal interactive guidance recommendation method is realized based on IEPMCR-mode architecture, wherein the IEPMCR-mode architecture adopts an end-to-end joint fine tuning paradigm, and specifically comprises an end-to-end collaborative framework module, a visual coding module based on supervised contrast learning and a multi-modal representation learning module based on prototype regularization. Further, the image context processing is performed based on the end-to-end collaborative framework module, and comprises the steps of embedding two enhanced samples of the same input data image in a small batch into query and key value feature vec