KR-20260062667-A - Method And Device for Video Description Generation based on Video-Language Model

KR20260062667AKR 20260062667 AKR20260062667 AKR 20260062667AKR-20260062667-A

Abstract

A method and apparatus for generating video descriptions based on a video-language model are disclosed. According to one aspect of the present disclosure, a computer-implemented method for generating a description of a video comprises: identifying, from an input video, a target video clip for generating a description and a target audio clip included in the target video clip; inputting the target audio clip into a speech recognition model to obtain an audio-guided expression including an acoustic feature vector extracted from the target audio clip or a target speech transcription generated from the acoustic feature vector; generating an augmented prompt based on the audio-guided expression; and inputting the augmented prompt into a video-language model to generate a video description of the target video clip.

Inventors

마춘페이
최준향
박정환
이병원

Assignees

에스케이텔레콤 주식회사

Dates

Publication Date: 20260507
Application Date: 20241029

Claims (11)

In a computer implementation method for generating a description of a video, A process of identifying, from an input video, a target video clip for generating a description and a target audio clip included in the target video clip; A process of inputting the above target audio clip into a speech recognition model to obtain an audio-guided expression including one or more acoustic feature vectors extracted from the above target audio clip or a target speech transcription generated from the one or more acoustic feature vectors; A process of generating an augmented prompt based on the above audio-guided representation; and A process of inputting the above augmented prompt into a video-language model to generate a video description for the target video clip. A computer implementation method including
In paragraph 1, The process of generating the above-mentioned augmented prompt is, A process of generating an augmented text prompt by inserting the above target voice transcription into a text prompt that directs the generation of the above video description. A computer implementation method including
In paragraph 2, The process further includes identifying one or more reference audio clips that are temporally adjacent to the target audio clip, and A computer-implemented method wherein the augmented text prompt comprises one or more reference speech transcriptions corresponding to each of the one or more reference audio clips.
In paragraph 3, A computer implementation method wherein the augmented text prompt further comprises confidence indicating the degree to which each of the one or more reference audio clips is associated with the target video clip.
In paragraph 4, The above reliability for each reference voice transcription is, A computer-implemented method determined based on at least one of the time interval between each reference audio clip and the target audio clip, the similarity between the text feature vector of each reference speech transcription and the text feature vector of the target speech transcription, and the similarity between the visual feature vector of a reference video clip including each reference audio clip and the visual feature vector of the target video clip.
In paragraph 1, The process of generating the augmented prompt described above includes the process of generating one or more soft prompt embeddings based on one or more acoustic feature vectors, and A computer implementation method wherein the above one or more soft prompt embeddings are connected to one or more text prompt embeddings encoded from a text prompt directing the generation of the video description, one or more visual prompt embeddings encoded from the target video clip, and one or more audio prompt embeddings encoded from the target audio clip, and provided as a backbone network of the video-language model.
In paragraph 5, The process further includes identifying one or more reference audio clips that are temporally adjacent to the target audio clip, and A computer-implemented method for generating the augmented prompt, wherein the process of generating the augmented prompt further generates the one or more soft prompt embeddings based on reference acoustic feature vectors extracted from each of the one or more reference audio clips.
In Paragraph 7, The process of generating the above-mentioned augmented prompt is, A computer-implemented method further comprising the process of generating an augmented text prompt by inserting confidence into the text prompt that indicates the degree to which each of the one or more reference audio clips is related to the target video clip.
In paragraph 8, The above reliability for each reference acoustic feature vector is, A computer-implemented method determined based on at least one of the time interval between each reference audio clip and the target audio clip, the similarity between each reference acoustic feature vector and the acoustic feature vector of the target audio clip, and the similarity between the visual feature vector of a reference video clip including each reference audio clip and the visual feature vector of the target video clip.
Memory for storing instructions; and at least one processor, comprising The above at least one processor executes the above instructions, From the input video, identify a target video clip to generate a description and a target audio clip included in the target video clip, and The above target audio clip is input into a speech recognition model to obtain an audio-guided representation including an acoustic feature vector extracted from the target audio clip or a target speech transcription generated from the acoustic feature vector, and Based on the above audio-guided expression, generate an augmented prompt, and A device that inputs the augmented prompt into a video-language model to generate a video description for the target video clip.
A computer program stored on a computer-readable recording medium to execute the processes included in the method according to any one of paragraphs 1 through 9.

Description

Method and Device for Video Description Generation based on Video-Language Model The present disclosure relates to a method and apparatus for generating video descriptions based on a video-language model. The following description merely provides background information related to the present embodiment and does not constitute prior art. Over the past few years, as Artificial Intelligence (AI) technology, particularly Multi-modality AI, has advanced rapidly, its importance in the field of Video Description Generation has significantly increased. Multi-modality AI can integrally process not only visual video data but also various modal data such as audio and text. Compared to traditional methods that relied on a single modality, this approach enables more accurate and in-depth analysis, offering the advantage of generating more meaningful and comprehensive video descriptions. The importance of this technology is being highlighted for its ability to efficiently process vast amounts of video content and concisely convey key information to viewers. Conventional video description generation is implemented by utilizing multiple AI models that process each modality independently. Typically, a video is separated into clips, and the visual and auditory information of each clip is processed independently by separate models. This approach has limitations in that it fails to adequately consider the interactions between the various modalities within the video. In particular, processing visual and auditory information separately increases the likelihood of failing to capture semantic interactions occurring in important scenes or dialogue. For instance, if a specific scene in a video contains visually significant elements, and the corresponding audio data is not considered alongside it, a video description may be generated that fails to fully understand the meaning of that scene. Meanwhile, video processing models can generally generate accurate descriptions only when dealing with short video clips of approximately 15 seconds. If the input video becomes too long or the model's capacity is limited, long-term information within the input video may not be properly grasped, potentially leading to hallucinations. Hallucinations refer to the generation of plausible descriptions that differ from the actual content of the video. To prevent this, there is a method of dividing the video into short intervals and processing it in units of short clips; however, this disrupts the flow of continuous scenes, making it difficult to understand the contextual relationships between clips and posing a risk of losing important contextual information. FIG. 1 is a block diagram schematically showing a video explanation device according to one embodiment of the present disclosure. FIG. 2 is an illustrative diagram referenced to explain the operation of a preprocessing module according to one embodiment of the present disclosure. FIG. 3 is an exemplary diagram schematically illustrating the network architecture of a video-language model according to one embodiment of the present disclosure. FIGS. 4a to 4c are illustrative diagrams showing various examples of augmenting a vision-language model's prompt based on speech transcription according to one embodiment of the present disclosure. FIGS. 5a to 5c are illustrative diagrams showing various examples of augmenting a vision-language model's prompt based on acoustic feature vectors according to one embodiment of the present disclosure. FIG. 6 is a flowchart illustrating a method for generating a video description according to one embodiment of the present disclosure. FIG. 7 is a schematic block diagram of an exemplary computing device that can be used to implement the devices and methods described in the present disclosure. Some embodiments of the present disclosure are described in detail below with reference to exemplary drawings. It should be noted that in assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present disclosure, if it is determined that a detailed description of related known components or functions could obscure the essence of the present disclosure, such detailed description is omitted. In describing the components of the embodiments according to the present disclosure, symbols such as first, second, i), ii), a), b), etc., may be used. These symbols are intended only to distinguish the components from other components, and the essence, order, or sequence of the components is not limited by the symbols. When a part in the specification is described as 'comprising' or 'having' a component, this means that, unless explicitly stated otherwise, it does not exclude other components but may include additional components. The detailed description set forth below, together with the accompanying drawings, is intended to describe