CN-122024146-A - Long video understanding method, long video understanding device and storage medium

CN122024146ACN 122024146 ACN122024146 ACN 122024146ACN-122024146-A

Abstract

The application relates to a long video understanding method, a device and a storage medium, wherein the long video understanding method comprises the steps of obtaining a long video and text prompt words corresponding to an understanding task, extracting original video features from the long video, determining an original index according to frame numbers, generating a whitening feature similarity matrix corresponding to the original video features, determining a causal score of each frame based on the whitening feature similarity matrix, mapping the original index according to the causal score of each frame to obtain a new index corresponding to the original index, reordering the original video features based on the new index to obtain video features consistent with context, encoding the text prompt words into the video features consistent with the context, splicing the text features to obtain a splicing result, and inputting the splicing result into a multi-mode large model to obtain an understanding result matched with the understanding task of the long video. The application solves the problem of lower understanding accuracy of the long video.

Inventors

Guo fangtai
ZHANG GEGE
ZHENG SHIWEI
Meng Zhangyuan
LIU YI

Assignees

之江实验室

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. A long video understanding method, comprising: Acquiring a long video and text prompt words corresponding to understanding tasks of the long video; Extracting original video features from the long video, and determining an original index according to frame numbers in the original video features; generating a whitening feature similarity matrix corresponding to the original video features, and determining a causal score of each frame in the original video features based on the whitening feature similarity matrix; Mapping the original index according to the causal score of each frame to obtain a new index corresponding to the original index, and reordering the original video features based on the new index to obtain video features consistent in context; And encoding the text prompt word into a video feature with the text feature consistent with the context, splicing the video feature to obtain a splicing result, and inputting the splicing result into a multi-mode large model to obtain an understanding result matched with the understanding task of the long video.
2. The long video understanding method according to claim 1, wherein the generating the whitening feature similarity matrix corresponding to the original video feature comprises: splitting the original video features into a plurality of groups according to space dimensions, and performing dimension rearrangement on the grouped original video features to obtain a plurality of groups of rearranged video features; Performing whitening conversion operation on each group of rearranged video features to obtain a single group of video whitening features; Integrating each single-group video whitening feature and then performing dimension rearrangement to obtain an overall video whitening feature; And generating the whitening feature similarity matrix based on the integral video whitening feature.
3. The long video understanding method according to claim 2, wherein said performing a whitening transformation operation on each set of rearranged video features results in a single set of video whitening features, comprising: determining a whitening transformation kernel matrix corresponding to each group of rearranged video features; The single set of video whitening features corresponding to the rearranged video features are determined based on the whitening transformation kernel matrix.
4. A long video understanding method according to claim 3, wherein said determining a whitening transformation kernel matrix corresponding to each set of said rearranged video features comprises: Calculating a feature mean value for each set of rearranged video features; Based on the feature mean and the rearranged video features, a feature covariance matrix is obtained; singular value decomposition is carried out on the characteristic covariance matrix to obtain an orthogonal matrix and a diagonal matrix; and obtaining the whitening transformation kernel matrix based on the orthogonal matrix and the diagonal matrix.
5. The long video understanding method according to claim 4, wherein said obtaining the whitening transformation kernel matrix based on the orthogonal matrix and the diagonal matrix comprises: Obtaining an inverse square diagonal matrix based on the diagonal matrix; And obtaining the whitening transformation kernel matrix based on the inverse square diagonal matrix, the orthogonal matrix and a preset whitening characteristic length.
6. The long video understanding method of claim 1, wherein said determining a causal score for each frame in the original video feature based on the whitening feature similarity matrix comprises: Obtaining a time sequence score of each frame of the original video based on the line index of the whitening feature similarity matrix; Calculating a context score and a smoothness score of each frame in the original video based on the whitening feature similarity matrix; and obtaining the causal score based on the timing score, the context score, and the smoothness score.
7. The long video understanding method according to claim 6, wherein the calculating a context score and a smoothness score for each frame in the original video based on the whitening feature similarity matrix comprises: Calculating the average value of all elements in each row of the whitening feature similarity matrix to obtain the context score of each frame; and obtaining the smoothness score of each frame based on local elements in a preset window in each row of the whitening feature similarity matrix.
8. The long video understanding method according to claim 7, wherein the obtaining a smoothness score of each frame based on local elements in a preset window in each row of the whitening feature similarity matrix includes: Calculating variance of local elements in a preset window in each row of the whitening feature similarity matrix to obtain local element variance; and obtaining the smoothness score based on the local element variance.
9. A long video understanding apparatus, the apparatus comprising: The acquisition module is used for acquiring the long video and text prompt words corresponding to the understanding tasks of the long video; The feature extraction module is used for extracting original video features from the long video and determining an original index according to frame numbers in the original video features; The system comprises a reordering module, a mapping module, a context-consistent video feature, a context-consistent processing module and a context-consistent processing module, wherein the reordering module is used for generating a whitening feature similarity matrix corresponding to the original video feature, determining a causal score of each frame in the original video feature based on the whitening feature similarity matrix, mapping the original index according to the causal score of each frame to obtain a new index corresponding to the original index, and reordering the original video feature based on the new index; And the reasoning module is used for splicing the text prompt word code into the video feature with the text feature consistent with the context to obtain a splicing result, and inputting the splicing result into a multi-mode large model to obtain an understanding result matched with the understanding task of the long video.
10. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the long video understanding method of any one of claims 1 to 8 when run.

Description

Long video understanding method, long video understanding device and storage medium Technical Field The present application relates to the field of computer vision, and in particular to a long video understanding method, apparatus, and storage medium. Background The long video understanding serves as a core research direction in the crossing field of computer vision and multi-mode artificial intelligence, and is a key technical support for practical applications such as intelligent security monitoring, intelligent robots, unmanned driving and the like. In the related art, the processing method of combining the multi-mode large model with the original time sequence characteristics of the video is adopted, so that the accuracy of long video understanding is low. At present, no effective solution is proposed for the problem of low understanding accuracy of long videos in the related technology. Disclosure of Invention The embodiment of the application provides a long video understanding method, a long video understanding device and a storage medium, which are used for at least solving the problem of low long video understanding accuracy in the related technology. In a first aspect, an embodiment of the present application provides a long video understanding method, including: Acquiring a long video and text prompt words corresponding to understanding tasks of the long video; Extracting original video features from the long video, and determining an original index according to frame numbers in the original video features; generating a whitening feature similarity matrix corresponding to the original video features, and determining a causal score of each frame in the original video features based on the whitening feature similarity matrix; Mapping the original index according to the causal score of each frame to obtain a new index corresponding to the original index, and reordering the original video features based on the new index to obtain video features consistent in context; And encoding the text prompt word into a video feature with the text feature consistent with the context, splicing the video feature to obtain a splicing result, and inputting the splicing result into a multi-mode large model to obtain an understanding result matched with the understanding task of the long video. In some embodiments, the generating the whitening feature similarity matrix corresponding to the original video feature includes: splitting the original video features into a plurality of groups according to space dimensions, and performing dimension rearrangement on the grouped original video features to obtain a plurality of groups of rearranged video features; Performing whitening conversion operation on each group of rearranged video features to obtain a single group of video whitening features; Integrating each single-group video whitening feature and then performing dimension rearrangement to obtain an overall video whitening feature; And generating the whitening feature similarity matrix based on the integral video whitening feature. In some embodiments, the performing a whitening conversion operation on each set of rearranged video features to obtain a single set of video whitening features includes: determining a whitening transformation kernel matrix corresponding to each group of rearranged video features; The single set of video whitening features corresponding to the rearranged video features are determined based on the whitening transformation kernel matrix. In some embodiments, the determining the whitening transformation kernel matrix corresponding to each set of the rearranged video features includes: Calculating a feature mean value for each set of rearranged video features; Based on the feature mean and the rearranged video features, a feature covariance matrix is obtained; singular value decomposition is carried out on the characteristic covariance matrix to obtain an orthogonal matrix and a diagonal matrix; and obtaining the whitening transformation kernel matrix based on the orthogonal matrix and the diagonal matrix. In some embodiments, the obtaining the whitening transformation kernel matrix based on the orthogonal matrix and the diagonal matrix includes: Obtaining an inverse square diagonal matrix based on the diagonal matrix; And obtaining the whitening transformation kernel matrix based on the inverse square diagonal matrix, the orthogonal matrix and a preset whitening characteristic length. In some of these embodiments, the determining a causal score for each frame in the original video feature based on the whitening feature similarity matrix comprises: Obtaining a time sequence score of each frame of the original video based on the line index of the whitening feature similarity matrix; Calculating a context score and a smoothness score of each frame in the original video based on the whitening feature similarity matrix; and obtaining the causal score based on the timing score, the context score, and