EP-4239585-B1 - VIDEO LOOP RECOGNITION METHOD AND APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM

EP4239585B1EP 4239585 B1EP4239585 B1EP 4239585B1EP-4239585-B1

Inventors

GUO, HUI

Dates

Publication Date: 20260506
Application Date: 20220512

Claims (15)

A video loop recognition method, the method being performed by a computer device, the method comprising: acquiring (S101) a target video clip pair of a to-be-recognized video by dividing the to-be-recognized video into a plurality of video clips, the target video clip pair comprising any two of the plurality of video clips, and determining a first target encoding feature of the target video clip pair under first modal information and a second target encoding feature of the target video clip pair under second modal information; the first modal information corresponding to the first target encoding feature being different from the second modal information corresponding to the second target encoding feature, each of the first modal information and the second modal information is selected from video modal information, audio modal information, speech text modal information, video title modal information, and cover modal information; acquiring (S102) a target multi-modal neural network model for performing loop recognition on the to-be-recognized video; the target network model comprising a first target sequence model corresponding to the first modal information and a second target sequence model corresponding to the second modal information; inputting (S103) the first target encoding feature to the first target sequence model to make the first target sequence model output a first target similarity result of the target video clip pair, the first target similarity result reflecting a degree of similarity between two video clips in the target video clip pair based on the first modal information; inputting (S104) the second target encoding feature to the second target sequence model to make the second target sequence model output a second target similarity result of the target video clip pair, the second target similarity result reflecting a degree of similarity between the two video clips in the target video clip pair based on the second modal information; and obtaining (S105) a loop comparison result of the target video clip pair based on a comparison of the first target similarity result with the second target similarity result; the loop comparison result being used for indicating a video type of the to-be-recognized video.
The method according to claim 1, wherein the first target sequence model comprises a first sequence representation learning layer and a first similarity measurement layer; the second target sequence model comprises a second sequence representation learning layer and a second similarity measurement layer, and the inputting (S103) the first target encoding feature to the first target sequence model to make the first target sequence model output a first target similarity result of the target video clip pair comprises: inputting the first target encoding feature to the first target sequence model, performing sequence feature learning on the first target encoding feature through the first sequence representation learning layer, inputting a first target learning feature obtained from the sequence feature learning to the first similarity measurement layer, and outputting, by the first similarity measurement layer, the first target similarity result of the target video clip pair; and the inputting (S104) the second target encoding feature to the second target sequence model to make the second target sequence model output a second target similarity result of the target video clip pair comprises: inputting the second target encoding feature to the second target sequence model, performing sequence feature learning on the second target encoding feature through the second sequence representation learning layer, inputting a second target learning feature obtained from the sequence feature learning to the second similarity measurement layer, and outputting, by the second similarity measurement layer, the second target similarity result of the target video clip pair.
The method according to claim 1, wherein the acquiring (S101) a target video clip pair of a to-be-recognized video, and determining a first target encoding feature and a second target encoding feature of the target video clip pair comprises: determining, when the to-be-recognized video is acquired, a video duration of the to-be-recognized video, and segmenting the to-be-recognized video based on the video duration to obtain N video clips; N being a positive integer; acquiring a video clip P i and a video clip P j from the N video clips, and taking the video clip P i and the video clip P j as the target video clip pair of the to-be-recognized video; i and j being positive integers less than or equal to N, and i being not equal to j; performing first feature extraction on each video clip in the target video clip pair to obtain the first target encoding feature of the target video clip pair; and performing second feature extraction on each video clip in the target video clip pair to obtain the second target encoding feature of the target video clip pair.
The method according to claim 3, wherein the first modal information is video modal information; and the performing first feature extraction on each video clip in the target video clip pair to obtain the first target encoding feature of the target video clip pair comprises: taking a video frame corresponding to each video clip in the target video clip pair as a to-be-processed video frame, and determining a frame extraction parameter based on a frame rate of the to-be-processed video frame; performing frame extraction processing on the to-be-processed video frame based on the frame extraction parameter to obtain a to-be-encoded video frame correlated with the to-be-processed video frame; acquiring a video encoding model correlated with the video modal information, inputting the to-be-encoded video frame to the video encoding model, and encoding the to-be-encoded video frame through the video encoding model to obtain a video encoding feature corresponding to the to-be-encoded video frame; and obtaining the first target encoding feature of the target video clip pair based on the video encoding feature corresponding to the to-be-encoded video frame; the first target encoding feature comprising a video encoding feature S i corresponding to the video clip P i and a video encoding feature S j corresponding to the video clip P j .
The method according to claim 3, wherein the second modal information is audio modal information; and the performing second feature extraction on each video clip in the target video clip pair to obtain the second target encoding feature of the target video clip pair comprises: taking an audio frame corresponding to each video clip in the target video clip pair as a to-be-processed audio frame, and performing audio preparation processing on the to-be-processed audio frame to obtain a to-be-encoded audio frame correlated with the to-be-processed audio frame; acquiring an audio encoding model correlated with the audio modal information, inputting the to-be-encoded audio frame to the audio encoding model, and encoding the to-be-encoded audio frame through the audio encoding model to obtain an audio encoding feature corresponding to the to-be-encoded audio frame; and obtaining the second target encoding feature of the target video clip pair based on the audio encoding feature corresponding to the to-be-encoded audio frame; the second target encoding feature comprising an audio encoding feature Y i corresponding to the video clip P i and an audio encoding feature Y j corresponding to the video clip P j .
The method according to claim 2, wherein the target video clip pair comprises a video clip P i and a video clip P j ; i and j being positive integers less than or equal to N, and i being not equal to j; N being a total quantity of the video clips in the to-be-recognized video; the first sequence representation learning layer comprises a first network layer correlated with the video clip P i and a second network layer correlated with the video clip P j , and the first network layer and the second network layer have a same network structure; and the inputting the first target encoding feature to the first target sequence model, performing sequence feature learning on the first target encoding feature through the first sequence representation learning layer, inputting a first target learning feature obtained from the sequence feature learning to the first similarity measurement layer, and outputting, by the first similarity measurement layer, the first target similarity result of the target video clip pair comprises: inputting the first target encoding feature to the first target sequence model; the first target encoding feature comprising a video encoding feature S i and a video encoding feature S j ; the video encoding feature S i being an encoding feature of the video clip P i under the first modal information; the video encoding feature S j being an encoding feature of the video clip P j under the first modal information; performing sequence feature learning on the video encoding feature S i through the first network layer in the first sequence representation learning layer to obtain a learning feature X i corresponding to the video encoding feature S i ; performing sequence feature learning on the video encoding feature S j through the second network layer in the first sequence representation learning layer to obtain a learning feature X j corresponding to the video encoding feature S j ; taking the learning feature X i and the learning feature X j as first target learning features, inputting the first target learning features to the first similarity measurement layer, and outputting, by the first similarity measurement layer, a similarity between the first target learning features; and determining the first target similarity result of the target video clip pair based on the similarity between the first target learning features.
The method according to claim 6, wherein the first network layer comprises a first sub-network layer, a second sub-network layer, a third sub-network layer, and a fourth sub-network layer; and the performing sequence feature learning on the video encoding feature S i through the first network layer in the first sequence representation learning layer to obtain a learning feature X i corresponding to the video encoding feature S i comprises: performing feature conversion on the video encoding feature S i through the first sub-network layer in the first network layer to obtain a first conversion feature corresponding to the video encoding feature S i ; inputting the first conversion feature to the second sub-network layer, and performing feature conversion on the first conversion feature through the second sub-network layer to obtain a second conversion feature corresponding to the first conversion feature; inputting the second conversion feature to the third sub-network layer, and performing feature conversion on the second conversion feature through the third sub-network layer to obtain a third conversion feature corresponding to the second conversion feature; and inputting the third conversion feature to the fourth sub-network layer, and maximally pooling the third conversion feature through a maximum pooling layer in the fourth sub-network layer to obtain the learning feature X i corresponding to the video encoding feature S i .
The method according to claim 7, wherein the first sub-network layer comprises a first convolutional layer, a second convolutional layer, and a dilated convolution layer; and the performing feature conversion on the video encoding feature S i through the first sub-network layer in the first network layer to obtain a first conversion feature corresponding to the video encoding feature S i comprises: convolving, when the video encoding feature S i is inputted to the first sub-network layer in the first network layer, the video encoding feature S i through the dilated convolution layer to obtain a first convolution feature corresponding to the video encoding feature S i ; convolving the video encoding feature S i through the first convolutional layer to obtain a second convolution feature corresponding to the video encoding feature S i , inputting the second convolution feature to the second convolutional layer, and convolving the second convolution feature through the second convolutional layer to obtain a third convolution feature; and concatenating the first convolution feature and the third convolution feature to obtain the first conversion feature corresponding to the video encoding feature S i .
The method according to claim 1, wherein the obtaining (S105) a loop comparison result of the target video clip pair based on a comparison of the first target similarity result with the second target similarity result comprises: comparing the first target similarity result with the second target similarity result; obtaining a loop video result of the target video clip pair in a case that the first target similarity result indicates that video clips in the target video clip pair are similar under the first modal information and the second target similarity result indicates that video clips in the target video clip pair are similar under the second modal information; obtaining a non-loop video result of the target video clip pair in a case that the first target similarity result indicates that video clips in the target video clip pair are not similar under the first modal information or the second target similarity result indicates that video clips in the target video clip pair are not similar under the second modal information; and taking the loop video result or the non-loop video result as the loop comparison result of the target video clip pair.
The method according to claim 1, further comprising: determining the video type of the to-be-recognized video to be a loop video type in a case that the loop comparison result is a loop video result; and generating loop prompt information based on the loop video type, and returning the loop prompt information to a user terminal; the user terminal being a transmitter of the to-be-recognized video.
The method according to claim 2, further comprising: acquiring a sample video clip pair for training an initial network model and a sample label of the sample video clip pair; the initial network model comprising a first initial sequence model and a second initial sequence model; the first initial sequence model comprising the first sequence representation learning layer and the first similarity measurement layer; the second initial sequence model comprising the second sequence representation learning layer and the second similarity measurement layer; acquiring a first sample encoding feature of the sample video clip pair under the first modal information and a second sample encoding feature of the sample video clip pair under the second modal information; inputting the first sample encoding feature to the first initial sequence model, and outputting, by the first initial sequence model, a first predicted similarity result of the sample video clip pair after process of the first sequence representation learning layer and the first similarity measurement layer in the first initial sequence model; inputting the second sample encoding feature to the second initial sequence model, and outputting, by the second initial sequence model, a second predicted similarity result of the sample video clip pair after process of the second sequence representation learning layer and the second similarity measurement layer in the second initial sequence model; obtaining a prediction label corresponding to a prediction loop result of the sample video clip pair based on a comparison of the first predicted similarity result with the second predicted similarity result; and iteratively training the initial network model based on the prediction label and the sample label, and taking an iteratively trained initial network model as the target network model for performing loop recognition on the to-be-recognized video.
The method according to claim 11, wherein the sample video clip pairs comprise a positive sample video clip pair and a negative sample video clip pair; the positive sample video clip pair being a video clip pair carrying a first sample label; the negative sample video clip pair being a video clip pair carrying a second sample label; the first sample label and the second sample label belonging to the sample labels; the prediction loop result comprises a first prediction loop result of the positive sample video clip pair and a second prediction loop result of the negative sample video clip pair; the prediction labels comprise a first prediction label corresponding to the first prediction loop result and a second prediction label corresponding to the second prediction loop result; and the iteratively training the initial network model based on the prediction label and the sample label, and taking an iteratively trained initial network model as the target network model for performing loop recognition on the to-be-recognized video comprises: determining, based on a sample proportion between the positive sample video clip pair and the negative sample video clip pair indicated by the sample label, a loss weight parameter correlated with a model loss function of the initial network model; obtaining a positive sample loss of the positive sample video clip pair based on the first prediction label and the first sample label, and obtaining a negative sample loss of the negative sample video clip pair based on the second prediction label and the second sample label; obtaining a model loss corresponding to the model loss function based on the positive sample loss, the negative sample loss, and the loss weight parameter, and iteratively training the initial network model based on the model loss to obtain a model training result; and taking, in a case that the model training result indicates that the iteratively trained initial network model satisfies a model convergence condition, the initial network model satisfying the model convergence condition as the target network model for performing loop recognition on the to-be-recognized video.
The method according to claim 12, further comprising: adjusting, in a case that the model training result indicates that the iteratively trained initial network model does not satisfy the model convergence condition, a model parameter of the initial network model based on the model loss function not satisfying the model convergence condition; and taking the initial network model after the adjustment of the model parameter as a transition network model, iteratively training the transition network model until an iteratively trained transition network model satisfies the model convergence condition, and taking the transition network model satisfying the model convergence condition as the target network model for performing loop recognition on the to-be-recognized video.
A computer device, comprising: a processor and a memory; and the processor being connected to the memory, the memory being configured to store a computer program, and the processor being configured to invoke the computer program, to cause the computer device to perform the method according to any one of claims 1 to 13.
A computer-readable storage medium, storing a computer program, the computer program being loaded and executed by a processor, to cause a computer device having the processor to perform the method according to any one of claims 1 to 13.

Description

This application claims priority to Chinese Patent Application No. 202110731049.4, filed with the Chinese Patent Office on June 30, 2021 and entitled "VIDEO LOOP RECOGNITION METHOD AND APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM". FIELD OF THE TECHNOLOGY The disclosure relates to the field of computer technologies, and in particular, to video loop recognition. BACKGROUND OF THE DISCLOSURE Carousel recognition refers to video loop recognition of a video clip that is constantly duplicated, so as to improve video quality. If an existing image recognition technology is applied to a process of video loop recognition, a computer device with an image recognition function, when extracting an image feature of each video frame of a to-be-recognized video, may match the image feature of each video frame with image features of subsequent video frames frame by frame, so as to determine duplicate video clips according to a quantitative proportion of counted duplicate frames. For example, Zhang, Y., Shao, L., & Snoek, C. G. M., Repetitive Activity Counting by Sight and Sound, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021 (CVPR), pp. 14070-14079, describes a method for counting repetitive actions in videos by sequentially processing overlapping clips using visual and audio modalities to estimate periodicity and fuse results for a final repetition count. However, once the video loop recognition is performed based on the quantitative proportion of the counted duplicate frames, the computer device may mistakenly determine duplicate video frames with irregular relations to be loop video frames, so that video clipping and other application scenarios are not supported. SUMMARY Embodiments of the disclosure provide a video loop recognition method, a computer device, and a storage medium, according to the appended claims, which can improve accuracy of video loop recognition. According to an aspect, an embodiment of the disclosure provides a video loop recognition method, including: acquiring a target video clip pair of a to-be-recognized video, and determining a first target encoding feature and a second target encoding feature of the target video clip pair; first modal information corresponding to the first target encoding feature being different from second modal information corresponding to the second target encoding feature; acquiring a target network model for performing loop recognition on the to-be-recognized video; the target network model including a first target sequence model correlated with the first modal information and a second target sequence model correlated with the second modal information; inputting the first target encoding feature to the first target sequence model to make the first target sequence model output a first target similarity result of the target video clip pair; inputting the second target encoding feature to the second target sequence model to make the second target sequence model output a second target similarity result of the target video clip pair; and obtaining a loop comparison result of the target video clip pair based on a comparison of the first target similarity result with the second target similarity result; the loop comparison result being used for indicating a video type of the to-be-recognized video. According to an aspect, an embodiment of the disclosure provides a video loop recognition apparatus, including: a target encoding feature acquisition module configured to acquire a target video clip pair of a to-be-recognized video, and determine a first target encoding feature and a second target encoding feature of the target video clip pair; first modal information corresponding to the first target encoding feature being different from second modal information corresponding to the second target encoding feature; a target network model acquisition module configured to acquire a target network model for performing loop recognition on the to-be-recognized video; the target network model including a first target sequence model correlated with the first modal information and a second target sequence model correlated with the second modal information; a first target similarity result determination module configured to input the first target encoding feature to the first target sequence model to make the first target sequence model output a first target similarity result of the target video clip pair; a second target similarity result determination module configured to input the second target encoding feature to the second target sequence model to make the second target sequence model output a second target similarity result of the target video clip pair; and a target similarity result comparison module configured to obtain a loop comparison result of the target video clip pair based on a comparison of the first target similarity result with the second target similarity result; the loop comparison result being used for indicating a video type o