CN-114973046-B - Video analysis method, device, processor and storage medium

CN114973046BCN 114973046 BCN114973046 BCN 114973046BCN-114973046-B

Abstract

The invention discloses a video analysis method, a video analysis device, a video analysis processor and a video storage medium. The method comprises the steps of obtaining a video to be analyzed, processing the video input feature processing model to obtain the feature of the global context relation of the video and the feature of the local context relation of the video, and determining target class fragments in the video based on the feature of the global context relation of the video and the feature of the local context relation of the video. The method and the device solve the technical problem that the accuracy of determining the video clips of interest to the user is higher because the video clips of interest to the user are determined only by means of the local context relation characteristics of the video.

Inventors

Chang Shuning
WANG PICHAO
WANG FAN
LI HAO

Assignees

阿里巴巴集团控股有限公司

Dates

Publication Date: 20260512
Application Date: 20210222

Claims (19)

1. A method of analyzing video, comprising: Acquiring a video to be analyzed; Processing the video input feature processing model to obtain the feature of the global context relation of the video and the feature of the local context relation of the video; Determining a target class segment in the video based on the characteristics of the global context of the video and the characteristics of the local context of the video, wherein the target class segment is obtained by screening a plurality of candidate video segments contained in the video as probability values of the target class segment according to the probability values of the candidate video segments, any one candidate video segment is used as the probability value of the target class segment, the probability value of the starting point, the probability value of the ending point, the first confidence score and the second confidence score corresponding to the candidate video segment are determined by the products of the probability value of the starting point, the probability value of the ending point, the probability value of the first confidence score and the second confidence score, the probability value of the starting point, the probability value of the ending point, the first confidence score and the second confidence score are used for representing the probability that the candidate video segment is the ending point, the probability value of the starting point and the second confidence score are obtained by scoring the confidence of the candidate video segment by different classifiers, and the local context of the candidate video segment is obtained by dividing the global context of the video.
2. The method of claim 1, wherein processing the video input feature processing model to obtain the feature of the global context of the video and the feature of the local context of the video comprises: Dividing the video into a plurality of video segments; Encoding the plurality of video clips to obtain a video feature matrix; And inputting the video feature matrix into the feature processing model for processing to obtain the features of the global context of the video and the features of the local context of the video.
3. The method of claim 1, wherein determining the target class segment in the video based on the characteristics of the global context of the video and the characteristics of the local context of the video comprises: determining probability values of a plurality of candidate video clips in the video as target class clips based on the characteristics of the global context of the video and the characteristics of the local context of the video; And screening the target class fragments from the plurality of candidate video fragments according to the probability values of the plurality of candidate video fragments serving as the target class fragments.
4. A method according to claim 3, characterized in that the method further comprises: Receiving a request instruction input by a target object for acquiring the target class fragment; pushing the target class segment to the target object in response to the request instruction, or And pushing the target class fragment to the target object when the target object is detected to play the video for the first time.
5. The method of claim 4, wherein after pushing the target class segment to the target object, the method further comprises: if the target object is detected to modify the target class segment, a modification result is obtained; And correcting the characteristic processing model based on the correction result to update the characteristic processing model.
6. The method of claim 2, wherein the feature processing model comprises a global feature extraction module and a local feature extraction module, wherein processing the video input feature processing model to obtain the feature of the global context of the video and the feature of the local context of the video comprises: Inputting the video feature matrix into the global feature extraction module to extract features of global context relation, and obtaining features of the global context relation of the video; And inputting the video feature matrix into the local feature extraction module to extract the features of the local context relation, and obtaining the features of the local context relation of the video.
7. The method of claim 2, wherein the feature processing model comprises a global feature extraction module and a local feature extraction module, wherein processing the video input feature processing model to obtain the feature of the global context of the video and the feature of the local context of the video comprises: performing dimension reduction processing on the video feature matrix to obtain a feature matrix after dimension reduction; Inputting the feature matrix after dimension reduction into the global feature extraction module to extract features of global context relation, and obtaining features of the global context relation of the video; And inputting the feature matrix after the dimension reduction into the local feature extraction module to extract the features of the local context relation, and obtaining the features of the local context relation of the video.
8. The method of claim 7, wherein inputting the feature matrix after the dimension reduction into the global feature extraction module to perform feature extraction of global context, and obtaining features of global context of the video comprises: determining that the feature matrix after dimension reduction is transmitted through three different linearity to obtain linear projection Q, linear projection K and linear projection V; Obtaining a feature carrying a global context relation through linear projection Q, linear projection K and linear projection V; And inputting the characteristics carrying the global context relation into a feedforward neural network for processing to obtain the characteristics of the global context relation of the video.
9. The method of claim 7 or 8, wherein prior to processing the video input feature processing model, the method further comprises: acquiring a video characteristic sample vector; and inputting the video feature sample vector into the global feature extraction module for learning and training, and simultaneously supervising the global feature extraction module by combining a behavior classification loss function so as to drive the global feature extraction module to capture the features of the global context.
10. The method of claim 7, wherein the local feature extraction module comprises a two-layer graph convolution layer, wherein the two-layer graph convolution layer comprises a first layer graph convolution layer and a second layer graph convolution layer, wherein inputting the feature matrix after the dimension reduction into the local feature extraction module for processing, and obtaining the feature of the local context of the video comprises: laminating the feature matrix after dimension reduction through the first layer of graph to obtain the features of a first local context relation; And inputting the characteristics of the first local context and the feature matrix subjected to dimension reduction into a second layer of graph convolution layer, and obtaining the characteristics of the local context of the video through the output of the second layer of graph convolution layer.
11. The method of claim 3, wherein determining the probability value for the plurality of candidate video segments in the video as target class segments based on the characteristics of the global context of the video and the characteristics of the local context of the video comprises: Fusing the features of the global context relation of the video and the features of the local context relation of the video to obtain fused features; Processing the fused features through a boundary point prediction module to obtain a first probability value of a plurality of candidate video segments in the video as the target class segments; processing the fused features through a confidence prediction module to obtain a plurality of candidate video fragments in the video as second probability values of the target class fragments; Based on the first probability value and the second probability value, a probability value for a plurality of candidate video segments in the video as target class segments is determined.
12. A method of analyzing video, comprising: receiving a service call request sent by a client, wherein the service call request carries a video to be analyzed; responding to the service call request, and processing the video input feature processing model in a server to obtain the feature of the global context relation of the video and the feature of the local context relation of the video; Determining a target class segment in the video based on the characteristics of the global context of the video and the characteristics of the local context of the video, wherein the target class segment is obtained by screening a plurality of candidate video segments contained in the video as probability values of the target class segment, any one candidate video segment is obtained by means of screening the plurality of candidate video segments as the probability value of the target class segment, the product of a starting point probability value, an ending point probability value, a first confidence score and a second confidence score corresponding to the candidate video segment is used for representing the probability that the candidate video segment is a starting point, the ending point probability value is used for representing the probability that the candidate video segment is an ending point, the first confidence score and the second confidence score are obtained by scoring the confidence of the candidate video segment by different classifiers, the starting point probability value, the ending point probability value, the first confidence score and the second confidence score are determined by the product of the starting point probability value, the ending point probability value, the first confidence score and the second confidence score, the probability value is used for representing the probability that the candidate video segment is an ending point, the candidate video segment is a candidate segment, and the local context is obtained by the local context of the candidate video segment is divided by the local context; and outputting the target class fragments.
13. A method of analyzing video, comprising: acquiring live video; Processing the live video input feature processing model to obtain features of global context relation of the live video and features of local context relation of the live video; Determining a target class segment in the live video based on the characteristics of the global context of the live video and the characteristics of the local context of the live video, wherein the target class segment is obtained by screening a plurality of candidate video segments contained in the live video according to probability values of the target class segment, any one candidate video segment is obtained by means of screening the candidate video segments, and is determined by the product of a starting point probability value, an ending point probability value, a first confidence score and a second confidence score corresponding to the candidate video segment, the starting point probability value is used for representing the probability that the candidate video segment is a starting point, the ending point probability value is used for representing the probability that the candidate video segment is an ending point, the first confidence score and the second confidence score are obtained by scoring the confidence of the candidate video segment by different classifiers, the starting point probability value, the ending point probability value, the first confidence score and the second confidence score are determined by products of the starting point probability value, the ending point probability value, the first confidence score and the second confidence score, and the probability that the candidate video segment is a starting point probability value is used for representing the probability that the candidate video segment is an ending point, and the probability that the candidate video segment is a ending point, and the confidence score is obtained by scoring the confidence score of the candidate video segment by the local context, and the local context of the candidate video segment is determined by the characteristics of the local context.
14. The method of claim 13, wherein after determining the target class segment in the live video, the method further comprises: clipping the target class segment from the live video; and publishing the target class fragments on the target application to popularize the target objects described in the target class fragments.
15. A video analysis apparatus, comprising: the first acquisition unit is used for acquiring a video to be analyzed; The first processing unit is used for processing the video input feature processing model to obtain the feature of the global context relation of the video and the feature of the local context relation of the video; A first determining unit, configured to determine a target class segment in the video based on a feature of a global context of the video and a feature of a local context of the video, where the target class segment is obtained by screening a plurality of candidate video segments included in the video as probability values of the target class segment, any one candidate video segment is obtained by taking the candidate video segment as a probability value of the target class segment, and determined by a product of a start point probability value, an end point probability value, a first confidence score and a second confidence score corresponding to the candidate video segment, where the start point probability value is used to characterize a probability that the candidate video segment is a start point, and the end point probability value is used to characterize a probability that the candidate video segment is an end point, where the first confidence score and the second confidence score are obtained by scoring the confidence levels of the candidate video segment by different classifiers, and where the start point probability value, the end point probability value, the first confidence score and the second confidence score are determined by products of the start point probability value, the end point probability value, the first confidence score and the second confidence score are obtained by dividing the candidate video segment by the global context of the local context.
16. A video analysis apparatus, comprising: the receiving unit is used for receiving a service call request sent by the client, wherein the service call request carries a video to be analyzed; The second processing unit is used for responding to the service call request, processing the video input feature processing model in a server to obtain the feature of the global context relation of the video and the feature of the local context relation of the video; The output unit is configured to output the target class segment, where the target class segment is obtained by screening a plurality of candidate video segments included in the video as probability values of the target class segment, any one candidate video segment is used as a probability value of the target class segment, and is determined by a product of a start point probability value, an end point probability value, a first confidence score and a second confidence score corresponding to the candidate video segment, where the start point probability value is used to characterize a probability that the candidate video segment is a start point, the end point probability value is used to characterize a probability that the candidate video segment is an end point, the first confidence score and the second confidence score are obtained by scoring the confidence of the candidate video segment by different classifiers, and the start point probability value, the end point probability value, the first confidence score and the second confidence score are determined by a feature of a global context of the candidate video segment and a feature of the local context, and the plurality of preset video segments are obtained by dividing the candidate video segments by lengths.
17. A video analysis apparatus, comprising: The second acquisition unit is used for acquiring live video; The third processing unit is used for processing the live video input feature processing model to obtain the features of the global context relation of the live video and the features of the local context relation of the live video; A second determining unit, configured to determine a target class segment in the live video based on a feature of a global context of the live video and a feature of a local context of the live video, where the target class segment is obtained by screening a plurality of candidate video segments included in the video as probability values of the target class segment, any one candidate video segment is obtained by taking the candidate video segment as a probability value of the target class segment, and is determined by a product of a start point probability value, an end point probability value, a first confidence score and a second confidence score corresponding to the candidate video segment, where the start point probability value is used to characterize a probability that the candidate video segment is a start point, and the end point probability value is used to characterize a probability that the candidate video segment is an end point, where the first confidence score and the second confidence score are obtained by scoring the confidence of the candidate video segment by different classifiers, and where the start point probability value, the end point probability score, the first confidence score and the second confidence score are obtained by dividing the candidate video segment by a global context, and the local context of the candidate video segment is determined by the local context.
18. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the method of any one of claims 1 to 14.
19. A processor for running a program, wherein the program when run performs the method of any one of claims 1 to 14.

Description

Video analysis method, device, processor and storage medium Technical Field The present invention relates to the technical field of video analysis processing, and in particular, to a video analysis method, a video analysis device, a video processor, and a video storage medium. Background The methods commonly used in the prior art to determine video clips of interest to a user are temporal sliding windows and probability prediction of clips of interest from video clip to video clip. The above method can well locate the video clips of interest to the user in the video under the simple and ideal state, but in practical situations, the video clips of interest to the user in the video often contain a large amount of noise and irrelevant background frame noise, which may be generated by behaviors per se or shots of the shooting process, and these problems are ignored by the current technology, so that the accuracy of determining the video clips of interest to the user is affected. For example, a video includes a fragment of cricket sports that the user needs to locate. There is a cheering scene after a goal in a cricket sports segment, and the current technology does not take into account the existence of such interference, and merely relying on the relationship of local context does not understand that the cheering scene should be part of the required positioning action, thus affecting the accuracy of determining the video segment of interest to the user. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the invention provides a video analysis method, a device, a processor and a storage medium, which at least solve the technical problem that the accuracy of determining the video clips of interest to a user is higher because the video clips of interest to the user are determined only by means of local context characteristics of the video. According to one aspect of the embodiment of the invention, a video analysis method is provided, which comprises the steps of obtaining a video to be analyzed, processing the video input feature processing model to obtain the feature of the global context relation of the video and the feature of the local context relation of the video, and determining target class fragments in the video based on the feature of the global context relation of the video and the feature of the local context relation of the video. Further, the video input feature processing model is processed to obtain the feature of the global context of the video and the feature of the local context of the video, and the method comprises the steps of dividing the video into a plurality of video segments, encoding the video segments to obtain a video feature matrix, and inputting the video feature matrix into the feature processing model to process to obtain the feature of the global context of the video and the feature of the local context of the video. Further, determining the target class segment in the video based on the feature of the global context of the video and the feature of the local context of the video includes determining probability values for a plurality of candidate video segments in the video as target class segments based on the feature of the global context of the video and the feature of the local context of the video, and screening the target class segment from the plurality of candidate video segments according to the probability values for the plurality of candidate video segments as target class segments. Further, the method comprises the steps of receiving a request instruction input by a target object to acquire the target class segment, responding to the request instruction, pushing the target class segment to the target object, or pushing the target class segment to the target object when the target object is detected to play the video for the first time. Further, after pushing the target class segment to the target object, the method further comprises the steps of obtaining a modification result if the target object is detected to modify the target class segment, and modifying the feature processing model based on the modification result to update the feature processing model. Further, the feature processing model comprises a global feature extraction module and a local feature extraction module, wherein the processing of the video input feature processing model to obtain the features of the global context of the video and the features of the local context of the video comprises inputting the video feature matrix into the global feature extraction module to extract the features of the global context of the video, and obtaining the features of the global context of the video; and inputting the video feature matrix into the local feature extraction module to extract the features of the local context relation, and obtaining the features of the local context relation of the video. The feature processing model compri