CN-121982604-A - Video recognition method and device, electronic device, storage medium and program product

CN121982604ACN 121982604 ACN121982604 ACN 121982604ACN-121982604-A

Abstract

The present disclosure provides a video recognition method and apparatus, an electronic device, a storage medium, and a program product. The video identification method comprises the steps of obtaining spatial characteristics of a current frame through spatial characteristic extraction of the current frame included in a video, obtaining fast time sequence characteristics of the current frame through fast branches of fast and slow RNNs formed by combining a fast and slow network with a cyclic neural network RNN based on the spatial characteristics of the current frame, obtaining slow time sequence characteristics of the current frame through slow branches of the fast and slow RNNs, and obtaining action identification results of the current frame based on the fast time sequence characteristics and the slow time sequence characteristics of the current frame.

Inventors

WANG JIAYANG
GUO ZIDONG
XU HAN
YANG RAN
LI DONGYU
Chi Daxuan
Tian Fanlun

Assignees

三星（中国）半导体有限公司
三星电子株式会社

Dates

Publication Date: 20260505
Application Date: 20260116

Claims (12)

1. A video recognition method, comprising: Obtaining the spatial characteristics of the current frame by extracting the spatial characteristics of the current frame included in the video; Based on the space characteristics of the current frame, the fast time sequence characteristics of the current frame are obtained through the fast branches of the fast and slow RNNs formed by combining the fast and slow networks with the cyclic neural network RNNs and the slow time sequence characteristics of the current frame are obtained through the slow branches of the fast and slow RNNs, And obtaining an action recognition result of the current frame based on the fast time sequence characteristic and the slow time sequence characteristic of the current frame.
2. The video recognition method of claim 1, wherein the step of obtaining the fast timing characteristics of the current frame through the fast branches of the fast and slow RNNs formed by combining the fast and slow networks with the recurrent neural network RNNs and obtaining the slow timing characteristics of the current frame through the slow branches of the fast and slow RNNs based on the spatial characteristics of the current frame comprises: based on the timing characteristics of the previous frame and the spatial characteristics of the current frame in time series before the current frame, the fast timing characteristics of the current frame are obtained by using the fast RNN calculation unit of the fast branch and the slow timing characteristics of the current frame are obtained by using the slow RNN calculation unit of the slow branch.
3. The video recognition method according to claim 2, wherein a plurality of first frames including the current frame, which are subjected to the calculation processing by the fast RNN calculation unit, have a first frame rate, and a plurality of second frames, which are subjected to the calculation processing by the slow RNN calculation unit, have a second frame rate, and Wherein the first frame rate is greater than the second frame rate and the plurality of second frames are included in the plurality of first frames.
4. The video recognition method of claim 3, wherein the step of obtaining the fast timing characteristic of the current frame by using the fast RNN calculation unit of the fast branch based on the timing characteristic of the previous frame preceding the current frame in terms of timing and the spatial characteristic of the current frame comprises: Based on the fast and slow timing characteristics of the previous frame and the spatial characteristics of the current frame, the fast timing characteristics of the current frame are obtained by performing calculation processing using a fast RNN calculation unit.
5. The video recognition method of claim 3, wherein the step of obtaining the slow timing characteristic of the current frame by using the slow RNN calculation unit of the slow branch based on the timing characteristic of the previous frame preceding the current frame in terms of timing and the spatial characteristic of the current frame comprises: determining whether a current frame is included in the plurality of second frames; Obtaining a slow timing characteristic of the current frame by performing a calculation process using a slow RNN calculation unit based on the slow timing characteristic of the previous frame and a spatial characteristic of the current frame in response to determining that the current frame is included in the plurality of second frames; in response to determining that the current frame is not included in the plurality of second frames, a slow timing characteristic of the previous frame is determined as a slow timing characteristic of the current frame.
6. The video recognition method of claim 5, wherein determining whether a current frame is included in the plurality of second frames comprises: determining a step size of the fast and slow RNNs related to a number of frames skipped by the calculation processing of the slow RNN calculation unit based on the first frame rate and the second frame rate; if the frame number of the current frame is divided by a step length of the fast and slow RNNs related to the number of frames skipped by the calculation processing of the slow RNN calculation unit, it is determined that the current frame is included in the plurality of second frames, wherein the step length is determined based on the first frame rate and the second frame rate.
7. The video recognition method of claim 1, wherein the step of obtaining the motion recognition result of the current frame based on the fast timing characteristic and the slow timing characteristic of the current frame comprises: the method comprises the steps of obtaining a fusion time sequence characteristic of a current frame by carrying out characteristic fusion on a fast time sequence characteristic and a slow time sequence characteristic of the current frame; And executing the action recognition of the current frame based on the fusion time sequence characteristic of the current frame to obtain the action recognition result of the current frame.
8. The video recognition method of claim 7, wherein the step of performing motion recognition of the current frame based on the fused timing characteristics of the current frame, and obtaining the motion recognition result of the current frame comprises: Updating the fast time sequence characteristic and the slow time sequence characteristic of the current frame based on the fusion time sequence characteristic of the current frame; Obtaining a fast motion prediction result of the current frame by performing motion prediction based on the updated fast timing characteristics of the current frame, and obtaining a slow motion prediction result of the current frame by performing motion prediction based on the updated slow timing characteristics of the current frame; And obtaining the motion recognition result of the current frame by carrying out feature fusion and screening on the fast motion prediction result and the slow motion prediction result of the current frame.
9. A video recognition device, comprising: a spatial feature acquisition module configured to acquire spatial features of a current frame by performing spatial feature extraction on the current frame included in the video; A time sequence feature acquisition module configured to acquire a fast time sequence feature of a current frame through a fast branch of a fast and slow RNN formed by combining the fast and slow networks with a cyclic neural network RNN and a slow time sequence feature of the current frame through a slow branch of the fast and slow RNN based on a spatial feature of the current frame, and And the motion recognition module is configured to obtain a motion recognition result of the current frame based on the fast time sequence characteristic and the slow time sequence characteristic of the current frame.
10. An electronic device, comprising: At least one processor; At least one memory storing computer-executable instructions, Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the video recognition method of any one of claims 1 to 8.
11. A computer readable storage medium, wherein instructions in the computer readable storage medium, when executed by at least one processor, cause the at least one processor to perform the video recognition method of any one of claims 1 to 8.
12. A computer program product comprising computer executable instructions which when executed by at least one processor implement the video recognition method according to any one of claims 1 to 8.

Description

Video recognition method and device, electronic device, storage medium and program product Technical Field The present disclosure relates to the field of computer vision, and more particularly, to a video recognition method and apparatus, an electronic device, a storage medium, and a program product for motion recognition. Background On-line action recognition (Online Action Detection, OAD) is a challenging task in video understanding in the field of computer vision technology, with the aim of correctly recognizing ongoing behavior from a video stream. With popularization of interactive equipment, the online action recognition technology has great application value in video monitoring, intelligent cabins, man-machine interaction and other applications. The main challenge of the online action recognition technology is to recognize the current action only according to the history information when the video frame arrives, which requires learning of long-range time sequence dependency. In the related art, an OAD scheme, such as a three-dimensional convolutional neural network (3 Dimensions Convolutional Neural Network, 3D-CNN), a video recognition method based on a deformer (Transformer) network, and a video recognition method based on a cyclic neural network (Recurrent Neural Network, RNN), is proposed, however, the related art has different problems in performing video recognition. If 3D-CNN uses the video segments with overlapping as input in a sliding window manner, and classifies the action types in the input video segments to obtain the recognition result of video classification, so that the 3D-CNN has the problem of redundant calculation, and the real-time performance required by the online action recognition task cannot be ensured. The RNN-based video recognition method has a disadvantage in terms of model accuracy because it is difficult to train a model. The video recognition method based on the transducer is difficult to ensure the real-time reasoning because of the large calculation complexity of the attention mechanism. Therefore, a video recognition technology is required that improves the accuracy of motion recognition while satisfying the real-time performance of reasoning. Disclosure of Invention To address at least the problems and/or disadvantages described above, embodiments of the present disclosure provide a video recognition method and apparatus, an electronic device, a storage medium, and a program product for motion recognition. According to one aspect of the embodiment of the disclosure, a video identification method is provided, which comprises the steps of obtaining spatial characteristics of a current frame by extracting the spatial characteristics of the current frame included in the video, obtaining fast time sequence characteristics of the current frame through fast branches of fast and slow RNNs formed by combining a fast and slow network with a cyclic neural network RNN based on the spatial characteristics of the current frame, obtaining slow time sequence characteristics of the current frame through slow branches of the fast and slow RNNs, and obtaining an action identification result of the current frame based on the fast time sequence characteristics and the slow time sequence characteristics of the current frame. Optionally, the step of obtaining the fast timing characteristics of the current frame through the fast branch of the fast and slow RNNs formed by combining the fast and slow networks with the recurrent neural network RNNs and obtaining the slow timing characteristics of the current frame through the slow branch of the fast and slow RNNs includes obtaining the fast timing characteristics of the current frame through a fast RNN calculation unit using the fast branch and obtaining the slow timing characteristics of the current frame through a slow RNN calculation unit using the slow branch based on the timing characteristics of the previous frame before the current frame in terms of timing and the spatial characteristics of the current frame. Alternatively, a plurality of first frames including the current frame, for which the calculation processing is performed by the fast RNN calculation unit, have a first frame rate, and a plurality of second frames, for which the calculation processing is performed by the slow RNN calculation unit, have a second frame rate, and wherein the first frame rate is greater than the second frame rate and the plurality of second frames are included in the plurality of first frames. Alternatively, the step of obtaining the fast timing characteristic of the current frame by using the fast RNN computing unit of the fast branch based on the timing characteristic of the previous frame and the spatial characteristic of the current frame in time series includes obtaining the fast timing characteristic of the current frame by performing computation processing by using the fast RNN computing unit based on the fast and slow timing characteristics