KR-20260062503-A - Method And Device for Video Description Generation Using Video Sampling

KR20260062503AKR 20260062503 AKR20260062503 AKR 20260062503AKR-20260062503-A

Abstract

A method and apparatus for generating a video description using video sampling are disclosed. According to one aspect of the present disclosure, a computer-implemented method for generating a description of a video comprises: a process of selecting one or more key frames in a video based on a pixel-level change between a plurality of frames included in the video and at least one of the visual content included in each of the plurality of frames; a process of grouping the selected key frames into one or more segments based on a change in the visual content between adjacent key frames; a process of extracting one or more representative frames representing each segment for each of the one or more segments; and a process of generating a video description based on the extracted one or more representative frames.

Inventors

마춘페이
최준향
박정환
이병원

Assignees

에스케이텔레콤 주식회사

Dates

Publication Date: 20260507
Application Date: 20241029

Claims (9)

In a computer implementation method for generating a description of a video, A process of selecting one or more key frames in a video based on pixel-level changes between multiple frames included in the video and at least one of the visual content included in each of the multiple frames; A process of grouping selected key frames into one or more segments based on changes in visual content between adjacent key frames; For each of one or more segments, a process of extracting one or more representative frames representing each segment; and The process of generating a video description based on one or more extracted representative frames A computer implementation method including
In paragraph 1, A computer-implemented method in which changes at the pixel level are analyzed based on at least one of color histogram changes, pixel-wise difference, edge changes, optical flow, brightness changes, and keypoint matching.
In paragraph 1, The above visual content is identified based on the processing results of a perception task for each frame, and A computer-implemented method wherein the above perception task includes at least one of an object detection task, a segmentation task, and a classification task.
In paragraph 1, The above grouping process is, A computer-implemented method for assigning adjacent major frames to the same segment or different segments based on whether there is a change in the set of semantic labels extracted from each of the adjacent major frames.
In paragraph 1, The above grouping process is, A computer-implemented method for assigning adjacent major frames to the same segment or different segments based on whether there is a change in the number of objects (instance number) detected in each of the adjacent major frames.
In paragraph 1, The above grouping process is, A computer-implemented method for assigning adjacent major frames to the same segment or different segments based on the amount of change in the size of an object detected in each of the adjacent major frames or the size of the area surrounding the object.
In paragraph 1, Prior to the above grouping process, A process of excluding frames from the main frames in which the area of the motion blur occurs is greater than or equal to a preset threshold. A computer implementation method including further
Memory for storing instructions; and at least one processor, comprising The above at least one processor executes the above instructions, One or more key frames are selected from the video based on pixel-level changes between multiple frames included in the video and at least one of the visual content included in each of the multiple frames. Based on changes in visual content between adjacent key frames, selected key frames are grouped into one or more segments, and For each of one or more segments, extract one or more representative frames representing each segment, and A device that generates a video description based on one or more extracted representative frames.
A computer program stored on a computer-readable recording medium to execute the processes included in the method according to any one of paragraphs 1 through 7.

Description

Method and Device for Video Description Generation Using Video Sampling The present disclosure relates to a method and apparatus for generating a video description using video sampling. The following description merely provides background information related to the present embodiment and does not constitute prior art. Video data is a complex form of media in which various information such as images, sound, and motion is compressed, presenting significant volume and complexity in data processing and analysis. Technologies for efficiently analyzing and summarizing such video data are becoming an increasingly important field of research and development. Existing video description systems utilize a method of dividing a video into multiple clips and generating text descriptions for each clip. This system aims to summarize key scenes within a video and automatically describe them in text form. However, existing systems require massive computational resources due to their structural characteristic of having to process every single frame of a video. Since the entire video is processed frame by frame and detailed analysis is performed on each clip, processing time and the consumption of computational resources increase exponentially as the video becomes longer or more complex. In particular, this approach can pose significant difficulties for practical application when processing video in real-time or analyzing large-scale video datasets. Furthermore, existing methods treat all clips identically, leading to the problem of having to analyze even unnecessary information within the video. This results in a waste of processing time and computational resources and places a heavy burden on system performance, especially when handling real-time processing or large-scale video datasets. Therefore, a new technical approach is required for the efficient analysis and summarization of video data. FIG. 1 is a block diagram schematically showing a video explanation device according to one embodiment of the present disclosure. FIG. 2 is an illustrative diagram referenced to explain the operation of a video description device according to one embodiment of the present disclosure. FIGS. 3a and 3b are illustrative diagrams referenced to explain the operation of selecting key frames of a video according to various embodiments of the present disclosure. FIG. 4 is a flowchart illustrating a motion blur removal operation according to one embodiment of the present disclosure. FIGS. 5a to 5c are illustrative drawings referenced to explain the operation of grouping frames based on visual content according to various embodiments of the present disclosure. FIG. 6 is a flowchart illustrating a method for generating a video description according to one embodiment of the present disclosure. FIG. 7 is a schematic block diagram of an exemplary computing device that can be used to implement the devices and methods described in the present disclosure. Some embodiments of the present disclosure are described in detail below with reference to exemplary drawings. It should be noted that in assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present disclosure, if it is determined that a detailed description of related known components or functions could obscure the essence of the present disclosure, such detailed description is omitted. In describing the components of the embodiments according to the present disclosure, symbols such as first, second, i), ii), a), b), etc., may be used. These symbols are intended only to distinguish the components from other components, and the essence, order, or sequence of the components is not limited by the symbols. When a part in the specification is described as 'comprising' or 'having' a component, this means that, unless explicitly stated otherwise, it does not exclude other components but may include additional components. The detailed description set forth below, together with the accompanying drawings, is intended to describe exemplary embodiments of the present disclosure and is not intended to represent the only embodiment in which the present disclosure can be practiced. FIG. 1 is a block diagram schematically illustrating a video explanation device according to one embodiment of the present disclosure. FIG. 2 is an illustrative diagram referenced to explain the operation of a video explanation device according to one embodiment of the present disclosure. A video description device (10) according to one embodiment of the present disclosure may include all or part of a sampling module (100), a grouping module (120), and a video interpretation module (140). Not all blocks illustrated in FIG. 1 are essential components, and some blocks included in other embodiments may be added, changed, or deleted. Meanwhile, the components illustrated in FIG. 1