CN-121980050-A - Information generation method and device based on large language model and cross-modal retrieval

CN121980050ACN 121980050 ACN121980050 ACN 121980050ACN-121980050-A

Abstract

The application relates to the technical field of information and discloses an information generation method based on a large language model and cross-modal retrieval, which comprises the steps of obtaining query input information; the method comprises the steps of encoding query input information to obtain a query vector, respectively retrieving a plurality of image information and a plurality of pieces of text information based on the query vector, generating an attention query vector based on the query vector and generated words, determining attention weights according to the attention query vector and each piece of text information, carrying out weighted summation on each piece of text information according to the attention weights corresponding to each piece of text information to obtain text content representation, determining attention weights according to the attention query vector and each image information, carrying out weighted summation on each image information according to the attention weights corresponding to each image information to obtain image content representation, fusing the image content representation and the text content representation to obtain modal fusion representation, and generating an answer according to the modal fusion representation. Through the mode, the answer accuracy is improved.

Inventors

ZHANG LIXIN
Han Fengzhe
DANG GUANGYUE
CHEN ZONGXIAN

Assignees

深圳市前海研祥亚太电子装备技术有限公司

Dates

Publication Date: 20260505
Application Date: 20251225

Claims (10)

1. An information generation method based on a large language model and cross-modal retrieval, which is characterized by comprising the following steps: Acquiring inquiry input information; encoding the query input information to obtain a query vector; Respectively carrying out image retrieval and text retrieval based on the query vector to obtain a plurality of image information and a plurality of pieces of text information which are strongly related to the query vector; Answer generation step: For the word to be generated, generating an attention query vector based on the query input information if the word to be generated is the first word, and generating an attention query vector based on the query input information and the generated word if the word to be generated is the next word; Calculating the similarity between the attention inquiry vector and each piece of text information to obtain the attention weight corresponding to each piece of text information; carrying out weighted summation on each section of text information according to the attention weight corresponding to each section of text information to obtain text content representation; calculating the similarity between the attention query vector and each piece of image information to obtain the attention weight corresponding to each piece of image information; Carrying out weighted summation on each piece of image information according to the attention weight corresponding to each piece of image information to obtain image content representation; fusing the image content representation and the text content representation to obtain a modal fused representation; generating the word to be generated according to the modal fusion representation; and repeatedly executing the answer generation step until an answer is obtained.
2. The method of claim 1, wherein the calculating the similarity between the attention query vector and each piece of text information to obtain an attention weight corresponding to each piece of text information, and the weighting and summing each piece of text information according to the attention weight corresponding to each piece of text information to obtain a text content representation, further comprises: generating text context keys and text segment representations corresponding to each piece of text information based on each piece of text information; calculating the attention weight corresponding to each piece of text information according to the text context key and the attention query vector; Weighting and summing the text segment representations corresponding to the text information of each section according to the attention weight corresponding to the text information of each section to obtain the text content representations; Calculating the similarity between the attention query vector and each piece of image information to obtain an attention weight corresponding to each piece of image information, and carrying out weighted summation on each piece of image information according to the attention weight corresponding to each piece of image information to obtain an image content representation, wherein the method further comprises the following steps: generating an image context key and an image feature representation corresponding to each piece of image information based on each piece of image information; calculating the attention weight corresponding to each piece of image information according to the image context key and the attention query vector; Carrying out weighted summation on image characteristic representations corresponding to each piece of image information according to the attention weight corresponding to each piece of image information to obtain image content representations; the fusing the image content representation and the text content representation to obtain a modal fused representation, further comprising: Respectively determining weights corresponding to the image content representation and the text content representation; and fusing the image content representation and the text content representation according to the weights corresponding to the image content representation and the text content representation to obtain the modal fusion representation.
3. The method of claim 1, wherein the performing image retrieval and text retrieval based on the query vector, respectively, results in a plurality of image information and a plurality of pieces of text information that are strongly correlated with the query vector, further comprising: When the query input information is text, inputting the query vector into a text encoder to obtain a multi-modal query vector, and when the query input information is an image, inputting the query vector into an image encoder to obtain a multi-modal query vector, wherein the text encoder and the image encoder are used for mapping the query vector into a vector space with the same dimension; Calculating a first similarity between the multimodal query vector and each text vector in a text library, wherein each text vector in the text library is obtained by inputting each segment of text into the text encoder; determining the K-segment text with the highest first similarity as a plurality of segments of text information which are strongly related to the query vector; Calculating a second similarity between the multimodal query vector and each image vector in an image library, wherein each image vector in the image library is obtained by inputting each image into the image encoder; And determining M images with highest second similarity as a plurality of image information which is strongly related to the query vector.
4. A method according to claim 3, characterized in that the method further comprises: extracting keywords in the query input information; Respectively calculating the keyword matching score of each text segment in the text library and each image in the image library; Respectively determining K-segment texts with highest keyword matching scores and M images as a plurality of segments of key texts and a plurality of key images; The determining the K-segment text with the highest first similarity as the multi-segment text information strongly related to the query vector further comprises: determining the K-segment text with the highest first similarity as a multi-segment related text; Fusing and deduplicating the multi-section related text according to the multi-section key text to obtain multi-section text information which is strongly related to the query vector; The determining the M images with the highest second similarity as a plurality of image information strongly related to the query vector further includes: determining M images with highest second similarity as a plurality of related images; And fusing and de-duplicating the related images according to the key images to obtain the image information which is strongly related to the query vector.
5. The method of claim 1, wherein the performing image retrieval and text retrieval based on the query vector, respectively, results in a plurality of image information and a plurality of pieces of text information that are strongly correlated with the query vector, further comprising: Inputting the query vector into a retrieval model, and respectively carrying out image retrieval and text retrieval by the retrieval model based on the query vector to obtain a plurality of image information and multi-section text information which are strongly related to the query vector; The answer generation step further includes: inputting the query vector, the plurality of image information and the plurality of pieces of text information into an answer generation model, and executing the answer generation step by the answer generation model; the retrieval model and the answer generation model are obtained through training by the following steps: Acquiring a plurality of question-answer samples in a query sample set, wherein each question-answer sample comprises a query sample marked with a correct answer; For each question-answer sample: Encoding the query sample to obtain a query sample vector, inputting the query sample vector into a retriever, and outputting a plurality of pieces of retrieval information and associated probabilities corresponding to each piece of retrieval information by the retriever; inputting the query sample and each piece of search information into a generator, and outputting the probability of generating a correct answer corresponding to each piece of search information by the generator; Multiplying the relevant probability corresponding to each search information by the probability corresponding to the search information for generating a correct answer, and then adding to obtain the total probability corresponding to each question-answer sample; Generating a loss according to the total probability calculation of each question-answer sample; and alternately optimizing the retriever and the generator based on the generation loss until the generation loss converges, and respectively determining the retriever and the generator after the last optimization as the retrieval model and the answer generation model.
6. The method of claim 1, wherein the obtaining query input information, encoding the query input information, results in a query vector, further comprises: Acquiring inquiry input information input by a user; determining a keyword label of the user; And encoding the keyword label and the query input information to obtain the query vector.
7. The method of claim 6, wherein the query vector comprises a plurality of vector dimensions; The image retrieval and text retrieval are respectively performed based on the query vector to obtain a plurality of image information and a plurality of pieces of text information which are strongly related to the query vector, and the method further comprises the following steps: Determining a vector dimension to be weighted from a plurality of vector dimensions based on historical query answers of the user; multiplying the dimension of the vector to be weighted in the query vector by a weighting coefficient to obtain a weighted query vector; And respectively carrying out image retrieval and text retrieval based on the weighted query vector to obtain a plurality of image information and a plurality of pieces of text information related to the query vector.
8. The method of claim 6, wherein the performing image retrieval and text retrieval based on the query vector, respectively, results in a plurality of image information and a plurality of pieces of text information that are strongly correlated with the query vector, further comprising: extracting the bias vector of the user according to the historical query answers of the user; adding the query vector and the offset vector to obtain a final query vector; And respectively carrying out image retrieval and text retrieval based on the final query vector to obtain a plurality of image information and a plurality of pieces of text information related to the query vector.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the information generation method based on a large language model and cross-modal retrieval of any one of claims 1-8.
10. A computer-readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the information generation method based on a large language model and cross-modal retrieval of any one of claims 1 to 8.

Description

Information generation method and device based on large language model and cross-modal retrieval Technical Field The embodiment of the application relates to the technical field of information, in particular to an information generation method, equipment and a storage medium based on a large language model and cross-modal retrieval. Background With the wide application of large language models (Large Language Model, abbreviated as LLM), how to make LLM use the private knowledge inside enterprises to accurately answer questions becomes an important issue. LLM often lacks knowledge about the expertise or up-to-date information in a particular area due to limitations in training data range and timeliness, and direct application in enterprise knowledge base questions and answers is prone to creating hallucinations, i.e., giving incorrect or irrelevant answers. In order to solve the problem, the prior art proposes a retrieval enhancement generation (RETRIEVAL-Augmented Generation, abbreviated as RAG) method, namely, before generating an answer, related content is retrieved from an external data source such as an enterprise knowledge base and the like to be used as a prompt for LLM reference, so that the LLM answers the problem based on the latest and related information, the phenomenon that the LLM answers questions or composes the content in a non-questionable manner is reduced, and the accuracy and reliability of the answer are improved. However, when the questions relate to non-text knowledge, for example, LLM refers to a product schematic to answer the questions, even if the product schematic is manually converted into text to be incorporated into a knowledge base, the abundant visual information contained in the product schematic is often lost and cannot be fully utilized, so that the accuracy of the answers is low. Disclosure of Invention In view of the above problems, the embodiments of the present application provide a method, an apparatus, and a storage medium for generating information based on a large language model and cross-modal retrieval, which improve the accuracy of answers. According to one aspect of the embodiment of the application, an information generation method based on a large language model and cross-mode retrieval is provided, and comprises the steps of obtaining query input information, encoding the query input information to obtain query vectors, respectively carrying out image retrieval and text retrieval based on the query vectors to obtain a plurality of image information and multi-segment text information which are strongly related to the query vectors, generating attention query vectors based on the query input information if the attention query vectors are first words and generating attention query vectors based on the query input information if the attention query vectors are next words, calculating the similarity between the attention query vectors and each segment of text information to obtain attention weights corresponding to each segment of the text information, carrying out weighted summation on each segment of text information according to the attention weights corresponding to each segment of the text information to obtain text content representation, calculating the similarity between the attention query vectors and each image information to obtain the attention weights corresponding to each image information, generating the attention weights corresponding to the image information according to the first words, generating the attention query vectors and the attention query vectors, generating the image fusion model according to the weighted summation until the image representation is obtained, and generating the image fusion mode representation. In an alternative way, the calculating the similarity between the attention query vector and each piece of text information to obtain the attention weight corresponding to each piece of text information, and the weighting summation is performed on each piece of text information according to the attention weight corresponding to each piece of text information to obtain a text content representation, further comprising generating a text context key and a text segment representation corresponding to each piece of text information based on each piece of text information, calculating the attention weight corresponding to each piece of text information according to the text context key and the attention query vector, weighting summation is performed on each piece of text segment representation corresponding to each piece of text information according to each piece of attention weight corresponding to each piece of text information to obtain the text content representation, the calculating the attention weight corresponding to each piece of image information according to each attention query vector to obtain the attention weight corresponding to each piece of image information, and the weighting summation is performed on each piece of