KR-102964737-B1 - VIDEO SUMMARIZATION METHOD AND DEVICE BASED ON MULTI-SCALE FEATURE

KR102964737B1KR 102964737 B1KR102964737 B1KR 102964737B1KR-102964737-B1

Abstract

The present invention relates to a multi-scale feature-based video summary method. The method comprises the steps of: receiving a video including a plurality of frames; providing the plurality of frames to a backbone network to generate a first feature map, a second feature map, a third feature map, and a fourth feature map corresponding to each frame included in the plurality of frames; providing the generated second feature map, the third feature map, and the fourth feature map to a refinement block to generate a refined second feature map, a refined third feature map, and a refined fourth feature map; performing feature fusion based on the first feature map, the refined second feature map, the refined third feature map, and the refined fourth feature map to generate final features; providing the generated final features to a regression block to calculate a significance score corresponding to each frame; and performing a video summary for the received video using the calculated significance score corresponding to each frame.

Inventors

백성욱
이미영
하비브 칸
사미 울라 칸
줄피카 아마드 칸
윤상일

Assignees

세종대학교산학협력단

Dates

Publication Date: 20260513
Application Date: 20230704

Claims (20)

A multi-scale feature-based video summarization method performed by at least one processor, A step of receiving a video containing multiple frames; A step of providing the plurality of frames to a backbone network to generate a first feature map, a second feature map, a third feature map, and a fourth feature map corresponding to each frame included in the plurality of frames; A step of providing the generated second feature map, third feature map, and fourth feature map to a refinement block to generate a refined second feature map, a refined third feature map, and a refined fourth feature map; A step of generating final features by performing feature fusion based on the first feature map, the refined second feature map, the refined third feature map, and the refined fourth feature map; A step of providing the generated final features to a regressor block to calculate importance scores corresponding to each frame; and A step of performing a video summary for the received video using importance scores corresponding to each frame calculated above; Includes, The step of generating a first feature map, a second feature map, a third feature map, and a fourth feature map corresponding to each of the above frames is: A step of providing the plurality of frames to a first stage of the backbone network to generate the first feature map corresponding to each of the frames; A step of generating a second feature map by providing the first feature map to a second stage of the backbone network; A step of generating the third feature map by providing the first feature map and the second feature map to the second stage of the backbone network; and A step of generating the fourth feature map by providing the first feature map, the second feature map, and the third feature map to the third stage of the backbone network; Includes, The step of generating final features by performing the above feature fusion is: A step of generating a first fusion value by performing feature fusion on the refined third feature map and the refined fourth feature map; A step of generating a second fusion value by performing feature fusion on the first fusion value and the refined second feature map; A step of generating a third fusion value by performing feature fusion on the second fusion value and the first feature map; and A step of generating the final feature by performing flattening on the third fusion value and the fourth feature map; A multi-scale feature-based video summarization method including
In paragraph 1, The above backbone network is a multi-scale feature-based video summarization method including a ViT (vision transformer)-based backbone network.
delete
In paragraph 1, The above-mentioned refinement block is a multi-scale feature-based video summarization method including a pyramidal refinement block.
In paragraph 1, The step of generating the above-mentioned refined second feature map, refined third feature map, and refined fourth feature map is, A step of generating the refined second feature map by providing the second feature map to three DASPP (dense atrous spatial pyramid pooling) modules and one MHSA (multi-headed self-attention) module of the second stage of the refinement block; A step of generating the refined third feature map by providing the above third feature map to two DASPP modules and one MHSA module of the third stage of the refinement block; and A step of generating the refined fourth feature map by providing the above fourth feature map to one DASPP module and one MHSA module of the fourth stage of the refinement block; A multi-scale feature-based video summarization method including
delete
In paragraph 1, The above regression block is a multi-scale feature-based video summarization method comprising multiple layers for convolution, feature normalization, and deconvolution.
In paragraph 1, The step of performing a video summary for the received video using importance scores corresponding to each frame calculated above is: A step of calculating a summary score corresponding to each shot constituting the video using a significance score corresponding to each of the frames; and A step of performing a video summary for the video based on a summary score corresponding to each of the above shots; A multi-scale feature-based video summarization method including
In paragraph 8, The step of calculating a summary score corresponding to each of the above shots is: A step of calculating a summary score corresponding to each of the above shots based on the average of the importance scores of the frames included in each of the above shots; A multi-scale feature-based video summarization method including
In paragraph 8, The step of performing a video summary for the received video using importance scores corresponding to each frame calculated above is: A step of performing video summarization by extracting at least some of the shots among the plurality of shots constituting the video as keyshots so that the sum of the summary scores is maximized while having a length within a predetermined ratio; A multi-scale feature-based video summarization method including
A computer program stored on a computer-readable recording medium for executing a method according to any one of paragraphs 1, 2, 4, 5, 7 through 10 on a computer.
As a computing device, Communication module; Memory; and At least one processor connected to the memory and configured to execute at least one computer-readable program contained in the memory. Includes, The above at least one program is, Receive a video containing multiple frames, and The above plurality of frames are provided to a backbone network to generate a first feature map, a second feature map, a third feature map, and a fourth feature map corresponding to each frame included in the plurality of frames, and The generated second feature map, third feature map, and fourth feature map are provided to a refinement block to generate a refined second feature map, a refined third feature map, and a refined fourth feature map, and A final feature is generated by performing feature fusion based on the first feature map, the refined second feature map, the refined third feature map, and the refined fourth feature map, and The final features generated above are provided to a regression block to calculate importance scores corresponding to each of the frames, and A video summary for the received video is performed using the importance score corresponding to each frame calculated above, and The above plurality of frames are provided to the first stage of the backbone network to generate the first feature map corresponding to each of the frames, and The first feature map is provided to the second stage of the backbone network to generate the second feature map, and The first feature map and the second feature map are provided to the second stage of the backbone network to generate the third feature map, and The first feature map, the second feature map, and the third feature map are provided to the third stage of the backbone network to generate the fourth feature map, and A first fusion value is generated by performing feature fusion on the above-described refined third feature map and the above-described refined fourth feature map, and A second fusion value is generated by performing feature fusion on the first fusion value and the refined second feature map, and A third fusion value is generated by performing feature fusion on the second fusion value and the first feature map, and A computing device comprising instructions for generating the final feature by performing flattening on the third fusion value and the fourth feature map.
In Paragraph 12, The above backbone network is a computing device including a ViT-based backbone network.
delete
In Paragraph 12, The above-mentioned refining block is a computing device including a pyramid-shaped refining block.
In Paragraph 12, The above at least one program is, The above second feature map is provided to three DASPP modules and one MHSA module of the second stage of the refinement block to generate the refined second feature map, and The above third feature map is provided to two DASPP modules and one MHSA module of the third stage of the refinement block to generate the refined third feature map, and A computing device further comprising instructions for generating the refined fourth feature map by providing the above fourth feature map to one DASPP module and one MHSA module of the fourth stage of the refinement block.
delete
In Paragraph 12, The above regression block is a computing device comprising a plurality of layers for convolution, feature normalization, and deconvolution.
In Paragraph 12, The above at least one program is, Using importance scores corresponding to each of the frames above, a summary score corresponding to each shot constituting the video is calculated, and A computing device further comprising instructions for performing a video summary for the video based on a summary score corresponding to each of the above shots.
In Paragraph 19, The above at least one program is, A computing device further comprising instructions for calculating a summary score corresponding to each of the above shots based on the average of the importance scores of the frames included in each of the above shots.

Description

Video Summarization Method and Device Based on Multi-Scale Features The present invention relates to a multi-scale feature-based video summarization method and apparatus, and more specifically, to a multi-scale feature-based video summarization method and apparatus for generating a video summarization by extracting an optimal frame representing the video. Due to the recent rapid expansion of security surveillance cameras, a massive amount of video content is being generated daily. For example, numerous surveillance cameras are installed in airports, transportation systems, hospitals, and other public places, and administrators of video monitoring systems continuously monitor the footage captured by these cameras. However, since the video captured by surveillance cameras contains a large volume of visual data that is difficult to observe in real time, it presents a challenge for administrators to monitor all of the footage. Embodiments of the present invention will be described with reference to the accompanying drawings described below, wherein similar reference numerals indicate similar elements, but are not limited thereto. FIG. 1 is an exemplary drawing showing the structure of an artificial neural network model according to one embodiment of the present invention. FIG. 2 is an exemplary drawing showing the structure of a regression block according to one embodiment of the present invention. FIG. 3 is a diagram showing an example of a multi-scale feature-based video summarization method according to an embodiment of the present invention. FIG. 4 is a block diagram showing the internal configuration of a computing device according to one embodiment of the present invention. Hereinafter, specific details for implementing the present invention will be described in detail with reference to the attached drawings. However, in the following description, specific descriptions regarding widely known functions or configurations will be omitted if there is a risk of unnecessarily obscuring the essence of the present invention. In the attached drawings, identical or corresponding components are assigned the same reference numerals. Additionally, in the description of the following embodiments, the description of identical or corresponding components may be omitted. However, even if a description of a component is omitted, it is not intended that such component is not included in any embodiment. The advantages and features of the disclosed embodiments and the methods for achieving them will become clear by referring to the embodiments described below in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but may be implemented in various different forms, and these embodiments are provided merely to make the present invention complete and to fully inform those skilled in the art of the scope of the invention. The terms used in this specification will be briefly explained, and the disclosed embodiments will be described in detail. The terms used in this specification have been selected to be as generally used as possible, taking into account their functions in the present invention; however, these terms may vary depending on the intent of those skilled in the relevant field, case law, the emergence of new technologies, etc. Additionally, in specific cases, terms may be arbitrarily selected by the applicant, and in such cases, their meanings will be described in detail in the relevant description of the invention. Therefore, the terms used in this invention should be defined not merely by their names, but based on their meanings and the overall content of the present invention. In this specification, singular expressions include plural expressions unless the context clearly specifies them as singular. Additionally, plural expressions include singular expressions unless the context clearly specifies them as plural. Throughout the specification, when a part is described as including a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components. In the present invention, terms such as “comprising,” “comprising,” etc. may indicate the presence of features, steps, operations, elements, and/or components, but do not exclude the addition of one or more other functions, steps, operations, elements, components, and/or combinations thereof. In the present invention, where a specific component is described as being "combined," "combined," "connected," or "reacting" with any other component, the specific component may be directly combined, combined, and/or connected, or react with the other component, but is not limited thereto. For example, one or more intermediate components may exist between the specific component and the other component. Additionally, in the present invention, "and/or" may include each of the one or more listed items or a combination of at least some o