CN-121981248-A - Question answering method and device based on multi-mode knowledge graph and knowledge distillation

CN121981248ACN 121981248 ACN121981248 ACN 121981248ACN-121981248-A

Abstract

The invention discloses a method and a device for generating questions and answers based on distillation type retrieval enhancement of multi-mode knowledge graph enhancement, wherein the method comprises the steps of constructing a heterogeneous information knowledge base of unified representation and associated text and images; the method comprises the steps of receiving multi-modal query of a user, encoding the multi-modal query into a unified multi-modal query vector, searching the most relevant knowledge context from a multi-modal knowledge graph by combining semantic searching and structured searching, efficiently transferring the capability of a large multi-modal first language model to a lightweight second language model, inputting the user query, the searched knowledge context and the trained soft prompt word vector by utilizing the trained second language model, and forming a complete answer generated by multi-modal response and presenting the complete answer to the user. The method greatly enriches the interaction dimension and information bearing capacity of the question-answering system, can process more natural and complex user inquiry, and provides more visual answers with larger information quantity.

Inventors

ZHAO MIN
HUO MEIRU
WAN XUEFENG
Dong Chenni
SUN LIYAN
HAN CHAO
DANG XIAOYAN
ZHANG JIANLIANG

Assignees

国网山西省电力有限公司信息通信分公司

Dates

Publication Date: 20260505
Application Date: 20251126

Claims (10)

1. A method for generating questions and answers based on multi-mode knowledge graph enhancement and distillation type retrieval enhancement is characterized by comprising the following steps: Step 1, constructing a heterogeneous information knowledge base of unified representation and associated text and images, wherein all entity nodes and media contents in a multi-mode knowledge graph generate unified multi-mode embedded vectors in the same semantic space through a pre-trained multi-mode alignment model; Step 2, receiving multi-modal query of a user, encoding the multi-modal query into a unified multi-modal query vector, and searching the most relevant multi-modal knowledge context from the multi-modal knowledge graph by combining semantic searching and structural searching; step 3, efficiently transferring the capacity of the large multi-mode first language model to a lightweight second language model, wherein the transferring process is realized by minimizing a joint loss function comprising output layer distillation loss, hidden layer alignment loss and contrast learning loss, so that the training of the second language model is completed; And 4, inputting the multi-modal query of the user, the retrieved knowledge context and the trained soft prompt word vector in the output layer distillation loss into the trained second language model, analyzing a sequence comprising texts and special placeholders generated by the second language model, and replacing the special placeholders with related images selected from the retrieved knowledge context so as to form a complete answer generated by multi-modal response and presenting the complete answer to the user.
2. The method according to claim 1, wherein in the step 1, the construction of the multi-modal knowledge-graph includes: generating a unified multi-mode embedded vector in the same semantic space for each entity node in the multi-mode knowledge graph by utilizing a pre-trained multi-mode alignment model; For entity nodes in each knowledge node in the multi-mode knowledge graph, the calculation of the unified multi-mode embedding vector is obtained by taking text embedding generated by a text encoder and associated image embedding generated by an image encoder as inputs through a fusion function, wherein the fusion function is a weighted average and is used for balancing the importance of text and visual information; Expanding entity nodes and edges of the multi-modal knowledge graph, additionally adding multi-modal embedded vector fields for the entity nodes, storing unified multi-modal embedded vectors of the entity nodes, introducing media nodes into knowledge nodes in the multi-modal knowledge graph, storing metadata of the media nodes and the unified multi-modal embedded vectors of the media nodes, and defining new relation edges to connect the entity nodes and the media nodes.
3. The method according to claim 1, wherein the step 2 specifically includes: Encoding a text query and an image query contained in the received user multi-modal query into a unified multi-modal query vector by using a multi-modal fusion encoder; Executing semantic retrieval, and calculating the similarity between the unified multi-modal query vector and the unified multi-modal embedded vectors of all knowledge nodes in the multi-modal knowledge graph; executing structured search, performing graph traversal on the multi-mode knowledge graph based on entity nodes with highest semantic search similarity scores, and exploring associated attributes, relationships and other entities to form a structured knowledge subgraph; And carrying out weighted fusion on the scores of the semantic retrieval and the structured retrieval to obtain a final sorting score, and selecting a preset number of knowledge segments with the highest sorting score as the knowledge context.
4. The method of claim 1, wherein in step 3, the joint loss function specifically includes the output layer distillation loss being used to model the output probability distribution of the second language model as that of a first language model in a neural network, the vocabulary of the distribution including text vocabulary and a special placeholder, the hidden layer alignment loss being used to align intermediate layer representations of the first and second language models when processing multimodal inputs, and the contrast learning loss being used to enhance discrimination of multimodal features by the second language model by maximizing similarity between the second language model representation and its corresponding positive sample of the first language model while minimizing similarity between the second language model representation and other negative samples in the first language model.
5. The method according to claim 4, wherein the output layer distillation loss is a KL divergence loss, the hidden layer alignment loss is a Mean Square Error (MSE) loss, and the contrast learning loss is calculated by comparing a similarity function with a temperature super parameter.
6. The method of claim 1, wherein the soft-hint word vector is adaptively adjusted and optimized during a training phase as an optimizable parameter of the second language model along with a minimization process of the joint loss function, and wherein the soft-hint word vector is gradient updated through back propagation of the joint loss function so as to learn how to guide the second language model.
7. The method of claim 1, wherein the multi-modal response generation comprises: in the reasoning stage, the soft prompt word vector and the multi-modal query of the user and the retrieved multi-modal context are cooperated as the input of the second language model to generate a sequence containing text and the special placeholder, the sequence is analyzed, when the special placeholder is detected, the most relevant image is selected from the retrieved knowledge context, and the URL or actual data of the most relevant image is inserted into a final response, so that a complete answer generated by the multi-modal response is formed.
8. The method according to claim 1, wherein in the step 1, the construction of the multi-modal knowledge graph further comprises the steps of constructing a dynamic time-series multi-modal knowledge graph by adding time stamps and/or validity attributes to nodes and relations in the multi-modal knowledge graph, encoding text queries, image queries and time constraints included therein of the user into unified multi-modal time-series query vectors by using an enhanced multi-modal fusion encoder in the step 2, performing time-series aware semantic search, calculating similarity between the multi-modal time-series query vectors and unified multi-modal time-series embedded vectors of all knowledge nodes in the dynamic time-series multi-modal knowledge graph, wherein the semantic search prioritizes knowledge nodes with time attributes matched with query time intents, and performing time-series aware graph traversal on the dynamic time-series multi-modal knowledge graph based on entity nodes with highest semantic search so as to discover associated knowledge valid in a query time range, thereby acquiring the time-series related knowledge context.
9. The method of claim 8, wherein in step 3, the migration process further comprises cross-modal knowledge distillation, specifically comprising a Chain of reasoning (Chain-of-Thought) distillation for the first language model, so that the second language model can learn and express a thought process of complex knowledge including temporal reasoning, thereby providing a more interpreted and logical multi-modal response, the Chain of reasoning detailing intermediate steps and logical basis for the first language model from understanding a question to deriving an answer.
10. A distillation type search enhancement generation question-answering device based on multi-mode knowledge graph enhancement is characterized by comprising: The construction unit of the multi-mode knowledge graph is used for constructing a heterogeneous information knowledge base of unified representation and associated text and images, wherein all entity nodes and media contents in the multi-mode knowledge graph generate unified multi-mode embedded vectors in the same semantic space through a pre-trained multi-mode alignment model; the multi-modal query-based hybrid search unit is used for receiving multi-modal queries of users, encoding the multi-modal queries into unified multi-modal query vectors, and searching the most relevant multi-modal knowledge context from the multi-modal knowledge graph by combining semantic search and structural search; The cross-modal knowledge distillation unit is used for efficiently migrating the capability of a large multi-modal first language model to a lightweight second language model, and the migration process is realized by minimizing a joint loss function comprising output layer distillation loss, hidden layer alignment loss and contrast learning loss, so that the training of the second language model is completed; the multi-modal response generating unit is used for inputting the multi-modal query of the user, the retrieved knowledge context and the trained soft prompt word in the output layer distillation loss into the trained second language model, analyzing a sequence comprising texts and special placeholders generated by the second language model, replacing the special placeholders with related images selected from the retrieved knowledge context, and accordingly forming a complete answer generated by the multi-modal response and presenting the complete answer to the user.

Description

Question answering method and device based on multi-mode knowledge graph and knowledge distillation Technical Field The invention belongs to the field of intelligent question and answer, and particularly relates to a question and answer method and device based on a multi-mode knowledge graph and knowledge distillation. Background Currently, question-answering systems have made significant progress in processing plain text information, but facing increasing user demand, i.e., processing queries containing multi-modal information, such as images, audio, etc., existing systems face a number of challenges. The invention provides a multi-modal knowledge graph enhanced distillation RAG (MM-KG-DRAG) framework, which aims to expand an original KG-DRAG framework from a pure text field to a multi-modal field, so that the multi-modal knowledge graph enhanced distillation RAG framework can understand query containing information such as images, audios and the like and generate rich responses containing various media formats. In particular, existing question-answering systems face the following core challenges when handling multimodal information: The unified representation of knowledge, namely, how to simultaneously represent and correlate heterogeneous data such as text, images, audio and the like in a single knowledge graph, is a key difficult problem for constructing a multi-mode knowledge base. The multi-modal joint understanding how effectively the system understands a query that mixes text and images, e.g., the user uploads a picture of a building and asks "who the designer of the building is. Traditional single-modality understanding methods have difficulty in handling such compound queries. The multi-mode generation capability is that how to enable a lightweight question-answer model not only to generate text, but also to properly embed images or other media contents in answers, thereby providing more intuitive answers with larger information content. In order to overcome the challenges, the invention provides a comprehensive technical scheme which aims to improve the capability of a question-answering system for processing and generating multi-mode information. Disclosure of Invention In order to solve the technical problems, the invention provides a distillation type retrieval enhancement generation question-answering method based on multi-mode knowledge graph enhancement, which comprises the following steps: Step 1, constructing a heterogeneous information knowledge base of unified representation and associated text and images, wherein all entity nodes and media contents in a multi-mode knowledge graph generate unified multi-mode embedded vectors in the same semantic space through a pre-trained multi-mode alignment model; Step 2, receiving multi-modal query of a user, encoding the multi-modal query into a unified multi-modal query vector, and searching the most relevant multi-modal knowledge context from the multi-modal knowledge graph by combining semantic searching and structural searching; step 3, efficiently transferring the capacity of the large multi-mode first language model to a lightweight second language model, wherein the transferring process is realized by minimizing a joint loss function comprising output layer distillation loss, hidden layer alignment loss and contrast learning loss, so that the training of the second language model is completed; And 4, inputting the multi-modal query of the user, the retrieved knowledge context and the trained soft prompt word vector in the output layer distillation loss into the trained second language model, analyzing a sequence comprising texts and special placeholders generated by the second language model, and replacing the special placeholders with related images selected from the retrieved knowledge context so as to form a complete answer generated by multi-modal response and presenting the complete answer to the user. In particular, in the step 1, the construction of the multi-modal knowledge graph includes generating a unified multi-modal embedded vector in the same semantic space for each entity node in the multi-modal knowledge graph by using a pre-trained multi-modal alignment model; For entity nodes in each knowledge node in the multi-mode knowledge graph, the calculation of the unified multi-mode embedding vector is obtained by taking text embedding generated by a text encoder and associated image embedding generated by an image encoder as inputs through a fusion function, wherein the fusion function is a weighted average and is used for balancing the importance of text and visual information; Expanding entity nodes and edges of the multi-modal knowledge graph, additionally adding multi-modal embedded vector fields for the entity nodes, storing unified multi-modal embedded vectors of the entity nodes, introducing media nodes into knowledge nodes in the multi-modal knowledge graph, storing metadata of the media nodes and the unified multi-modal em