CN-116189055-B - Video recognition model training method, video recognition method, device and storage medium

CN116189055BCN 116189055 BCN116189055 BCN 116189055BCN-116189055-B

Abstract

The invention relates to the technical field of computer vision and provides a video recognition model training method, a video recognition method, equipment and a storage medium; the method comprises the steps of inputting a sample video into a video extraction model to obtain a first video feature output by the video extraction model, inputting sample sound into a sound extraction model to obtain a first sound feature output by the sound extraction model, performing feature matching on the first sound feature and the first video feature, and performing model training based on the feature matched feature to obtain a video identification model. According to the invention, the first sound features and the first video features are subjected to feature matching, and then model training is performed based on the features after feature matching, so that the model training effect is improved, the robustness of a video recognition model is improved, and the accuracy of video recognition is finally improved.

Inventors

ZHU YICHEN

Assignees

美的集团(上海)有限公司
美的集团股份有限公司

Dates

Publication Date: 20260508
Application Date: 20230227

Claims (7)

1. A method for training a video recognition model, comprising: Acquiring a sample video and a sample sound corresponding to the sample video; inputting the sample video into a video extraction model to obtain a first video feature output by the video extraction model; Inputting the sample sound into a sound extraction model to obtain a first sound feature output by the sound extraction model; performing feature matching on the first sound feature and the first video feature, and performing model training based on the feature after feature matching to obtain a video recognition model; the feature matching the first sound feature with the first video feature includes: Based on a global average pooling mode, performing dimension alignment on the first sound feature and the first video feature to obtain a second sound feature and a second video feature; Performing frequency spectrum transformation analysis on the second sound characteristic to obtain a sound signal channel, wherein the sound signal channel comprises a high-frequency signal channel and a low-frequency signal channel; performing spectral transformation analysis on the second video features to obtain a video signal channel, wherein the video signal channel comprises a high-frequency signal channel and a low-frequency signal channel; and aligning the high-frequency signal and the low-frequency signal of the sound signal channel with the high-frequency signal and the low-frequency signal of the video signal channel respectively based on a knowledge distillation mode so as to enhance the second sound characteristic to the second video characteristic.
2. The method according to claim 1, wherein the knowledge distillation method is a method of knowledge distillation optimization based on a mean absolute error loss function.
3. The method according to claim 1, wherein the video recognition model is used for performing video recognition on a video to be recognized to obtain a video recognition result, the video recognition result is obtained by performing video recognition based on a classification layer of the video recognition model, or the video recognition result is obtained by performing video recognition based on a regression layer of the video recognition model.
4. A video recognition model training method according to any one of claims 1 to 3, wherein the video recognition model is used for at least one of repetitive motion count recognition, motion recognition, video segmentation.
5. A method of video recognition, comprising: acquiring a video to be identified; Inputting the video to be identified into a video identification model to obtain a video identification result output by the video identification model; Wherein the video recognition model is trained by the video recognition model training method according to any one of claims 1 to 4.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the video recognition model training method of any one of claims 1 to 4 or the video recognition method of claim 5 when executing the program.
7. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the video recognition model training method of any one of claims 1 to 4 or the video recognition method of claim 5.

Description

Video recognition model training method, video recognition method, device and storage medium Technical Field The present invention relates to the field of computer vision, and in particular, to a video recognition model training method, a video recognition device, and a storage medium. Background With the rapid development of computer vision technology, the application range of computer vision technology is becoming wider and wider. For the video recognition task, video recognition needs to be performed on the video to be recognized to obtain a video recognition result. For example, repetitive motion counting is a traditional field in computer vision, and has a wide application range, such as analysis of sports videos and counting of exercise motions, which are important branches in the field of video understanding. In the traditional technology, methods such as Fourier analysis, wavelet transformation, convolutional neural network and the like are used for counting repeated actions, but the final video counting and identifying result always has the technical problem of low accuracy. Disclosure of Invention The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a video recognition model training method, which is characterized in that the first sound characteristic is matched with the first video characteristic, and then model training is carried out based on the characteristic after the characteristic matching, so that the model training effect is improved, the robustness of a video recognition model is improved, and finally the accuracy of video recognition is improved. The invention also provides a video identification device, a video identification method, electronic equipment and a storage medium. According to an embodiment of the first aspect of the present invention, a video recognition model training method includes: Acquiring a sample video and a sample sound corresponding to the sample video; inputting the sample video into a video extraction model to obtain a first video feature output by the video extraction model; Inputting the sample sound into a sound extraction model to obtain a first sound feature output by the sound extraction model; and performing feature matching on the first sound feature and the first video feature, and performing model training based on the feature after feature matching to obtain a video recognition model. According to the video recognition model training method, the sample video is input into the video extraction model to obtain the first video feature output by the video extraction model, the sample sound is input into the sound extraction model to obtain the first sound feature output by the sound extraction model, so that the first sound feature is matched with the first video feature, model training is performed based on the feature after feature matching, the model training effect is further improved, the robustness of the video recognition model is improved, and finally the accuracy of video recognition is improved. According to one embodiment of the present invention, the feature matching the first sound feature with the first video feature includes: and performing dimension alignment on the first sound feature and the first video feature based on a global average pooling mode to obtain a second sound feature and a second video feature. According to an embodiment of the present invention, the performing dimension alignment on the first sound feature and the first video feature based on the global average pooling manner to obtain a second sound feature and a second video feature further includes: Performing spectral transformation analysis on the second sound characteristic to obtain a sound signal channel; performing spectrum transformation analysis on the second video features to obtain a video signal channel; The sound signal channel in the time dimension is aligned with the video signal channel based on knowledge-based distillation such that the second sound feature is enhanced to the second video feature. According to one embodiment of the present invention, the sound signal channel includes a high frequency signal channel and a low frequency signal channel, and the video signal channel includes a high frequency signal channel and a low frequency signal channel; the knowledge-based distillation method for aligning the sound signal channel with the video signal channel in a time dimension includes: and aligning the high-frequency signal and the low-frequency signal of the sound signal channel with the high-frequency signal and the low-frequency signal of the video signal channel respectively based on a knowledge distillation mode. According to one embodiment of the invention, the knowledge distillation method is a method for optimizing knowledge distillation based on a mean absolute value error loss function. According to one embodiment of the present invention, the video recog