KR-20260065811-A - Faithful generation of output text for multimodal applications

KR20260065811AKR 20260065811 AKR20260065811 AKR 20260065811AKR-20260065811-A

Abstract

A system and technique for generating and using a single-modal/multi-modal generative model that alleviates hallucinations are described. For example, a computing device may encode input data to generate an encoded representation of the input data. The computing device may acquire intermediate data containing multiple partial sentences associated with the input data, and based on the intermediate data, may generate at least one complete sentence associated with the input data. The computing device may encode at least one complete sentence to generate at least one encoded representation of at least one complete sentence. The computing device may generate a fidelity score based on a comparison of the encoded representations of the input data and at least one encoded representation of at least one complete sentence. Based on the fidelity score, the computing device may re-rank the multiple partial sentences of the intermediate data to generate re-ranked data.

Inventors

스리다르 아르빈드 크리쉬나
마흐푸즈 레하나
비제르 에릭
궈 인이

Assignees

퀄컴 인코포레이티드

Dates

Publication Date: 20260511
Application Date: 20240829
Priority Date: 20240228

Claims (20)

As a device that generates output text from input data, One or more memories configured to store the above input data; and It includes one or more processors coupled to the one or more of the above memories, and the one or more processors, Encoding the above input data to generate an encoded representation of the above input data; Obtaining intermediate data including multiple partial sentences associated with the above input data; Based on the above intermediate data, generate at least one complete sentence associated with the above input data; Encoding the above at least one complete sentence to generate at least one encoded representation of the above at least one complete sentence; Generating a fidelity score based on a comparison of the encoded representations of the above input data and the at least one encoded representation of the at least one complete sentence; and A device for generating output text from input data, configured to generate reranked data by reranking the plurality of partial sentences of the intermediate data based on the fidelity score.
A device for generating output text from input data, wherein, in claim 1, the input data comprises at least one of audio data, text data, image data, or video data.
A device for generating output text from input data, wherein, in paragraph 2, the input data comprises two or more of the audio data, the text data, the image data, and the video data.
A device for generating output text from input data, wherein the intermediate data comprises intermediate beams generated using a beam search technique, in the first aspect.
A device for generating output text from input data, wherein, in claim 1, the one or more processors are configured to generate the at least one complete sentence based on the intermediate data using a greedy search technique.
A device for generating output text from input data, wherein, in claim 1, the one or more processors are configured to re-rank the plurality of partial sentences of the intermediate data based on the fidelity score and model confidence to generate re-ranked data.
In paragraph 6, the above one or more processors, Determining a beam score based on the probability of the next word in each of the plurality of sub-sentences, the model confidence, and the fidelity score; Determining the cumulative probability based on the above beam score; and A device for generating output text from input data, configured to re-rank the plurality of partial sentences of the intermediate data based on the above cumulative probability.
In claim 7, a device for generating output text from input data, wherein one or more processors are configured to determine the model reliability based on entropy values and kurtosis values.
In paragraph 1, the input data includes video data, and the one or more processors, Downsampling a plurality of frames of the above video data; and A device for generating output text from input data, configured to generate a fused expression of the video data by fusing the encoded expressions of the plurality of frames of the video data, wherein the encoded expressions of the input data include the fused expression of the video data.
In paragraph 1, The above input data includes at least a first type of input data and a second type of input data; In order to encode the above input data and generate the above encoded representations of the above input data, the one or more processors, Encoding the input data of the first type to generate an encoded representation of the input data of the first type; and It is configured to encode the input data of the second type to generate an encoded representation of the input data of the second type; A device for generating output text from input data, wherein the above one or more processors are also configured to generate a combined representation of the first type of input data and the second type of input data based on the encoded representation of the first type of input data and the encoded representation of the second type of input data.
In paragraph 10, in order to generate a combined representation of the input data of the first type and the input data of the second type, the one or more processors, A device for generating output text from input data, configured to determine the weighted average of the encoded representation of the first type of input data and the encoded representation of the second type of input data.
In claim 10, the device for generating output text from input data, wherein the one or more processors are configured to normalize the combined representation of the first type of input data and the second type of input data.
A device for generating output text from input data, wherein, in claim 10, the input data of the first type and the input data of the second type include two or more of audio data, text data, image data, and video data.
A device for generating output text from input data, wherein, in paragraph 10, the one or more processors are configured to generate the fidelity score based on a comparison of the combined expression and the at least one encoded expression of the at least one complete sentence.
In paragraph 1, the above one or more processors, A device for generating output text from input data, configured to generate output text associated with the input data based on the above-mentioned reordered data.
A device for generating output text from input data, wherein, in claim 1, it further comprises at least one of an image sensor or a microphone configured to capture at least a portion of the input data.
A device for generating output text from input data, wherein, in claim 1, the one or more processors are configured to generate the intermediate data using at least one neural network model.
In claim 17, the device for generating output text from input data, wherein at least one neural network model includes a transformer neural network model.
As a method for generating output text from input data, A step of encoding the above input data to generate an encoded representation of the above input data; A step of obtaining intermediate data including a plurality of partial sentences associated with the above input data; Based on the above intermediate data, a step of generating at least one complete sentence associated with the above input data; A step of encoding at least one complete sentence to generate at least one encoded representation of the at least one complete sentence; A step of generating a fidelity score based on a comparison of the encoded representations of the input data and the at least one encoded representation of the at least one complete sentence through a fidelity guidance engine; and A method for generating output text from input data, comprising the step of reranking the plurality of partial sentences of the intermediate data based on the fidelity score to generate reranked data.
In claim 19, a method for generating output text from input data, wherein the intermediate data includes intermediate beams generated using a beam search technique.

Description

Faithful generation of output text for multimodal applications The present disclosure generally relates to generative models. For example, aspects of the present disclosure relate to systems and techniques for generating and using a single-modal or multi-modal generative model that mitigates hallucination, that is, cases where a generative model becomes convinced of a false fact associated with input data and generates text based on that false fact. Machine learning models (e.g., deep learning models such as neural networks) can be used to perform a variety of tasks, including, among other tasks, depth estimation, detection and/or recognition (e.g., scene or object detection and/or recognition, speech recognition), pose estimation, image reconstruction, classification, three-dimensional (3D) modeling, dense regression tasks, data compression and/or decompression, and image processing. Machine learning models can be versatile and can achieve high-quality results in various tasks. Multimodal generative models tend to generate output text that is not faithful to the input context. In the case of audio captioning, generative machine learning models receive audio as input and generate relevant captions word by word. For example, the audio signal might include the sound of a person walking on fallen leaves, then walking slowly on the sidewalk and speaking. A multimodal generative model might generate a caption for this audio that contains hallucinations. For example, the caption for that audio might be: "A person walks through the leaves, stops, and cuts a bush." The hallucination relates to the part of the caption where the user cuts the bush, because that action is not actually represented in the audio signal. Systems and techniques for generating output text based on input content that may be unimodal or multimodal are described herein. For example, systems and techniques may use a multimodal faithful decoder that can be used as guidance to alleviate hallucinations in captions generated based on multimodal or unimodal inputs. According to some embodiments, an apparatus for generating output text from input data is provided. The apparatus comprises one or more memories configured to store input data and one or more processors coupled to one or more memories, wherein the one or more processors are configured to: encode input data to generate an encoded representation of input data; acquire intermediate data comprising a plurality of partial sentences associated with input data; generate at least one complete sentence associated with input data based on the intermediate data; encode at least one complete sentence to generate at least one encoded representation of at least one complete sentence; generate a fidelity score based on a comparison of the encoded representation of input data and at least one encoded representation of at least one complete sentence; and re-rank the plurality of partial sentences of intermediate data based on the fidelity score to generate re-ranked data. In some embodiments, a method for generating output text from input data is provided. The method comprises: encoding input data to generate encoded representations of input data; obtaining intermediate data including a plurality of partial sentences associated with input data; generating at least one complete sentence associated with input data based on the intermediate data; encoding at least one complete sentence to generate at least one encoded representation of at least one complete sentence; generating a fidelity score based on a comparison of the encoded representation of input data and at least one encoded representation of at least one complete sentence through a fidelity guidance engine; and generating reranked data by reranking the plurality of partial sentences of intermediate data based on the fidelity score. In some embodiments, a non-transient computer-readable medium storing instructions is provided, wherein the instructions are executed by one or more processors, the one or more processors are configured to encode input data to generate encoded representations of the input data; acquire intermediate data comprising a plurality of partial sentences associated with the input data; generate at least one complete sentence associated with the input data based on the intermediate data; encode at least one complete sentence to generate at least one encoded representation of at least one complete sentence; generate a fidelity score based on a comparison of the encoded representation of the input data and at least one encoded representation of at least one complete sentence; and re-rank the plurality of partial sentences of the intermediate data based on the fidelity score to generate re-ranked data. In some embodiments, an apparatus comprising the following is provided: means for encoding input data to generate an encoded representation of the input data; means for obtaining intermediate data comprising a plurality of partial sen