CN-122024126-A - Video time sequence event proposal method, system, equipment and medium

CN122024126ACN 122024126 ACN122024126 ACN 122024126ACN-122024126-A

Abstract

The invention discloses a video time sequence event proposal method, a system, equipment and a medium, which belong to the technical field of video analysis and comprise the steps of obtaining an input video, carrying out feature coding on the input video to obtain a time sequence feature sequence, extracting event features according to the time sequence feature sequence and combining with a learnable time query, predicting by utilizing the event features to obtain prediction parameters, establishing an initial event proposal through calculation of the prediction parameters, training a neural network model under the constraint of a loss function, evaluating the initial event proposal by utilizing the trained neural network model, and screening out a final event proposal. The method solves the problems that in the prior art, the video time sequence event proposal method has insufficient event feature extraction precision, the situation that the logic contradiction between the event center point and the start-stop boundary is easy to occur in the independent calculation of the prediction parameters and the screening result has subjectivity.

Inventors

LU XIANG
SU YANG
TIAN YUEWEI
YU XUAN
FU JUN
ZOU WENQIANG
LI KUN

Assignees

贵州电网有限责任公司

Dates

Publication Date: 20260512
Application Date: 20251226

Claims (10)

1. A video timing event proposal method, comprising: acquiring an input video, and performing feature coding on the input video to obtain a time sequence feature sequence; according to the time sequence feature sequence, combining with a learnable time inquiry, extracting event features; predicting by utilizing the event characteristics to obtain prediction parameters; establishing an initial event proposal through calculation of the prediction parameters; Training the neural network model under the constraint of the loss function, evaluating the initial event proposal by using the trained neural network model, and screening out the final event proposal.
2. The video timing event proposal method according to claim 1, characterized in that the step of feature-encoding the input video comprises: Extracting spatial features of the input video to obtain a frame-level feature sequence; and carrying out time sequence compression on the frame-level feature sequence to obtain the time sequence feature sequence.
3. The video timing event proposal method as set forth in claim 2, wherein the step of extracting the event features comprises: preprocessing the learnable time query; And interacting the preprocessed learnable time inquiry with the time sequence feature sequence through a cross attention mechanism to acquire the time feature.
4. A video timing event proposal method as set forth in claim 3, wherein the step of acquiring the prediction parameters comprises: the event characteristics are subjected to linear transformation through a prediction head, and initial prediction parameters are output; and processing the initial prediction parameters through an activation function to obtain the prediction parameters.
5. The video timing event proposal method as set forth in claim 4, wherein the step of establishing an initial event proposal comprises: the prediction parameters comprise a center point, duration time, left offset proportion and right offset proportion; Defining the center point, the duration, the left offset ratio and the right offset ratio by constraint conditions; and calculating the initial event proposal through a boundary calculation formula according to the defined center point, duration time, left offset proportion and right offset proportion.
6. The video timing event proposal method as claimed in claim 1, wherein the step of screening out the final event proposal comprises: Training the neural network model through the loss function; calculating the consistency score, duration rationality score and boundary quality score of the initial event proposal according to the preset evaluation rule by the neural network model after training; Weighting and fusing the consistency score, the duration rationality score and the boundary quality score to obtain the comprehensive quality score of the initial event proposal; And sorting all the initial event proposals according to the comprehensive quality score, and selecting the initial event proposal ranked before M as a final event proposal.
7. The video timing event proposal method as set forth in claim 6, wherein said loss function comprises: A base loss, a consistency loss, and a rationality loss; The basic loss quantifies the overlapping matching degree of the initial event proposal and the real event boundary by calculating the generalized intersection ratio of the initial event proposal and the real event boundary; the consistency loss constrains the geometric logic consistency of the central point and the start-stop boundary of the initial event proposal in the prediction parameters, and forces the central point to be the arithmetic average value of the start-stop boundary; the rationality loss defines a start-stop boundary of the initial event proposal and the duration.
8. A video timing event proposal system applying a video timing event proposal method as claimed in any one of claims 1 to 7, comprising: The time sequence feature acquisition module acquires an input video, and performs feature coding on the input video to obtain a time sequence feature sequence; The event feature extraction module is used for extracting event features according to the time sequence feature sequence and combining with the learnable time inquiry; the prediction parameter construction module predicts by utilizing the event characteristics to obtain prediction parameters; the initial event proposal construction module is used for constructing an initial event proposal through calculation of the prediction parameters; And the final event proposal construction module is used for training the neural network model under the constraint of the loss function, evaluating the initial event proposal by using the trained neural network model and screening out the final event proposal.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of a video timing event proposal method according to any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of a video timing event proposal method according to any one of claims 1 to 7.

Description

Video time sequence event proposal method, system, equipment and medium Technical Field The invention relates to the technical field of video analysis, in particular to a video time sequence event proposal method, a system, equipment and a medium. Background Timing event proposal is a core task in the field of video understanding, whose goal is to accurately locate timing segments containing specific semantic information from long, uncut video. The main problems of the prior art are that the information of the central point and the boundary prediction are inconsistent, the gradient conflict of the multi-objective optimization and the event screening mechanism depend on heuristic post-processing. Taking the representative methods such as PDVC, UEDVC and the like as examples, the event proposal module still has the problems that although the event proposal module realizes end-to-end optimization, the proposed duration distribution learning is poor, fragments with uniform length tend to be generated and deviate from the diversity of real world events, the absolute error of boundary prediction is large, the positioning result is not accurate enough, the event screening strategy is simple, and the finally selected event set cannot be guaranteed to have the best quality and coverage. Therefore, a technical solution for event screening that can ensure mathematical consistency of event timing boundary prediction and design an effective joint optimization strategy, thereby achieving high quality and no need of post-processing is needed. Disclosure of Invention The present invention has been made in view of the above-described problems. Therefore, the technical problem solved by the invention is how to realize the extraction of the key parts in the long video by a video time sequence event proposal method, and meanwhile, the efficiency of video time sequence event positioning is improved when heuristic processing is not relied on. The technical scheme includes that an input video is obtained, feature encoding is carried out on the input video to obtain a time sequence feature sequence, event features are extracted according to the time sequence feature sequence in combination with a learnable time query, prediction is carried out by utilizing the event features to obtain prediction parameters, initial event proposal is built through calculation of the prediction parameters, a neural network model is trained under constraint of a loss function, the trained neural network model is used for evaluating the initial event proposal, and final event proposal is screened out. The method for suggesting the video time sequence event comprises the steps of carrying out spatial feature extraction on the input video to obtain a frame-level feature sequence, and carrying out time sequence compression on the frame-level feature sequence to obtain the time sequence feature sequence. The method for suggesting the video time sequence event comprises the steps of preprocessing the learnable time query, and interacting the preprocessed learnable time query with the time sequence feature sequence through a cross attention mechanism to obtain the time feature. The method has the advantages that preprocessing operation is performed on the learnable time query, the randomness interference of the initial query vector can be reduced, the information interaction between the two is completed by means of a cross attention mechanism, the time query can be promoted to accurately capture key information which is strongly related to the event in the time sequence characteristics, the extracted event characteristics are more consistent with the real event semantics and time distribution rules, the flow is simplified, and the execution efficiency is guaranteed. The method for proposing the video time sequence event comprises the steps of carrying out linear transformation on the event characteristics through a pre-measuring head, outputting initial prediction parameters, and processing the initial prediction parameters through an activating function to obtain the prediction parameters. The method for proposing the video time sequence event comprises the steps of establishing an initial event proposal, limiting the center point, the duration, the left offset proportion and the right offset proportion by constraint conditions, and calculating the initial event proposal according to the limited center point, duration, the left offset proportion and the right offset proportion by a boundary calculation formula. The method has the advantages that a clear data basis is provided for event boundary calculation through specific constitution of clear prediction parameters, constraint conditions are used for limiting values of the parameters, boundary logic contradiction caused by unreasonable parameters is avoided, initial event proposal is generated by combining the parameters and a boundary calculation formula, timing sequence positioning accuracy of the propo