CN-122024323-A - Point supervision time sequence action positioning method based on global statistical optimization

CN122024323ACN 122024323 ACN122024323 ACN 122024323ACN-122024323-A

Abstract

The invention discloses a point supervision time sequence action positioning method based on global statistics optimization, which comprises the steps of firstly obtaining input video and manual point supervision labeling information, extracting a time-space characteristic sequence of the video by utilizing a pre-trained double-flow three-dimensional convolutional neural network, secondly generating an initial pseudo tag set based on a one-stage point supervision action detector, thirdly constructing an optimization strategy based on global statistics, counting reference time length of each action category as a standard, thirdly, performing differential reconstruction on the initial pseudo tag, namely avoiding reservation of a high-confidence pseudo tag to prevent negative optimization, performing weighted contraction of a low-confidence pseudo tag by taking a labeling point as a center and combining the reference time length, and removing background noise, and fifthly, training a full supervision time sequence action positioning model by utilizing the optimized pseudo tag, and outputting a final action positioning result. According to the invention, no additional network training cost is needed, and the accuracy of motion positioning is effectively improved through a global statistical optimization method.

Inventors

YANG YANG
XU ZHEN

Assignees

西安交通大学

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (6)

1. A point supervision time sequence action positioning method based on global statistics optimization is characterized by constructing a global statistics optimization strategy without additional training parameters based on a point supervision learning framework, and performing differential reconstruction on an initial pseudo tag by jointly utilizing a global category duration distribution rule and a geometric anchor point characteristic of point labeling, and specifically comprises the following steps: firstly, acquiring unclamped input video and manual point supervision and labeling information of an action instance to be detected in the video, processing the input video by utilizing a pre-trained double-flow three-dimensional convolutional neural network, and extracting a characteristic sequence of the video; generating an initial pseudo tag set by using a one-stage point supervision action detector based on point supervision labels, wherein the initial pseudo tag set comprises a prediction category, a start time, an end time and a confidence score of each action instance; third, a pseudo tag optimization strategy based on global statistics is constructed, namely global reference time length of each action prediction category is calculated through a global category statistical modeling module, and a global time length standard is built; Performing a geometric reconstruction strategy on the low-confidence pseudo tag, centering on a real labeling point and removing background noise by combining a global reference time length; and fifthly, taking the optimized pseudo tag as a supervision signal, training a full supervision time sequence action positioning network, performing action positioning on the uncut video by using the trained network, and outputting a final action category and a time boundary.
2. The point supervision time sequence action positioning method based on global statistical optimization of claim 1 is characterized in that the pre-trained double-flow three-dimensional convolutional neural network is an I3D network, the one-stage point supervision action detector is a segment-level action discrimination network based on reliable awareness, and the two steps are that the one-stage point supervision action detector is used for generating an initial pseudo tag set, and the specific operation steps are as follows: Step 1, inputting a characteristic sequence of the video into a one-stage point supervision action detector, and outputting a class activation sequence and a class irrelevant attention sequence; Step 2, performing multi-threshold truncation processing on the class activation sequence to generate candidate action proposals, and calculating the external-internal comparison score of each candidate action proposal as a confidence score; And 3, screening out the proposal which covers the labeling point and has the highest confidence score from the candidate action proposals by combining the position information of the point supervision labeling, and forming the initial pseudo tag set.
3. The method for positioning point supervision time sequence actions based on global statistical optimization according to claim 1, wherein the operation steps of the third step are as follows: Step 1, traversing an initial pseudo tag set of the whole data set, and aggregating all pseudo tag proposals according to action categories; Step 2, sequencing the pseudo tag proposals of each category according to confidence scores, and screening out high-quality pseudo tag proposals to construct a reference sample set by setting a confidence threshold or a relative ranking threshold; And 3, calculating the statistical average value of all pseudo tag proposal time lengths in the reference sample set, and defining the statistical average value as the global reference time length of the action category.
4. The method for positioning point supervision time sequence actions based on global statistical optimization according to claim 1, wherein the geometric reconstruction optimization module in the fourth step comprises the following operation steps: step 1, judging whether a real action marking point is contained in a time range of each pseudo tag proposal to be optimized, and if so, locking the marking point closest to the proposal center as a geometric anchor point; Judging whether the confidence score of the pseudo tag proposal meets the preset high confidence condition, if so, judging that the proposal is a high confidence pseudo tag, and exempting, so as to keep the original starting time and ending time unchanged; and 3, if the exemption judgment logic is not met, judging that the proposal is a pseudo tag with low confidence coefficient to be optimized, performing geometric reconstruction on the proposal by using the global reference time length and the geometric anchor point, and updating the starting time and the ending time of the proposal.
5. The method for positioning point-supervised time sequence actions based on global statistical optimization of claim 4, wherein the specific calculation formula of geometric reconstruction in the step 3 is defined as follows: first, calculate the correction duration using the weighted fusion formula : Wherein, the For the original duration of the current pseudo tag proposal, For the global reference duration of the current pseudo tag proposal, The fusion weight coefficient is preset; secondly, performing shrinkage treatment on the corrected time length to obtain a final time length : Wherein, the Is a preset shrinkage coefficient, and ; Finally with the geometric anchor point Centered according to the final time length Recalculating the start time And end time : 。
6. The method of claim 4, wherein the proposed confidence score is determined to be a high confidence pseudo tag if it is higher than a preset absolute high confidence threshold or higher than a relatively high confidence threshold of the class.

Description

Point supervision time sequence action positioning method based on global statistical optimization Technical Field The invention belongs to the technical field of computer vision and video understanding, and particularly relates to a point supervision time sequence action positioning method based on global statistical optimization. Background With the rapid development of the internet and multimedia technologies, video data has been increasing explosively. Timing action positioning (Temporal Action Localization, TAL), a core task in the field of video understanding, aims to identify the specific category of action occurrence from a long video that is not cut out, and accurately position its start time and end time. The technology has wide application prospect in the fields of intelligent video monitoring, video content retrieval, man-machine interaction, sports event analysis and the like. Although the traditional full-supervision time sequence action positioning method achieves better detection performance, the traditional full-supervision time sequence action positioning method relies on accurate frame boundary marking, and labor cost is extremely high. In order to reduce the labeling cost, point-supervision time-series action positioning (Point-supervised Temporal Action Localization, P-TAL) which only needs to label one time Point for each action instance gradually becomes a research hotspot. In the field of point supervision technology, the early mainstream paradigm follows a two-stage framework of "classified guidance positioning". SF-Net (ECCV 2020) proposed by Ma et al is taken as a base stone in the field, a pseudo-label mining mechanism under single frame supervision is established, adjacent frames are mined by taking a mark point as a seed to generate pseudo-action labels, and Lee et al (ICCV 2021) introduces a greedy algorithm for solving the problem of action integrity loss in point supervision, and attempts to search a candidate sequence for an optimal pseudo sequence capable of covering a complete action instance so as to relieve the phenomenon of over-positioning or insufficient positioning. In recent years, researchers have come to pay attention to the use of reliability propagation and timing saliency information in order to further break through performance bottlenecks. The HR-Pro proposed by Zhang et al (AAAI 2024) constructs a hierarchical reliability propagation framework that for the first time systematically exploits the inherent reliability of point labeling at different levels by introducing online update memories at the segment level and fine-tuning boundaries in conjunction with regression headers at the instance level. In addition, for the "confidence-quality misalignment" bottleneck present in the traditional multi-instance learning framework, cia et al (CVPR 2024) propose a method to realign confidence with timing saliency information, attempting to calibrate the scoring quality of the action proposal by introducing saliency priors. Although the above method continuously improves the detection performance, in practical application, the quality bottleneck of the pseudo tag is still significant. In order to break through the upper performance limit, a new paradigm of "pseudo tag refinement" represented by PseR (IEEE TMM 2025) has recently emerged. PseR proposes a three-phase framework including a seed proposal generation, proposal propagation and refinement network, attempting to generate proposal bags by scale and center perturbation (Perturbation) of the initial seeds and training additional refinement networks to screen for more optimal pseudo tags. Although this paradigm outperforms the past two-phase framework in performance, the existing point supervision technical system still has the following significant drawbacks: 1. The initial pseudo tag has the problem of 'overcomplete', taking the seed generation stage of HR-Pro and PseR as an example, the initial proposal which depends on multi-threshold truncation generation often has the serious 'overcomplete'. Because of the lack of explicit boundary constraints, the proposed duration generated by the model often far exceeds the real duration of the action instance, including excessive background segments. 2. The optimization method based on deep learning has the risks of negative optimization and high cost that PseR and other methods adopt a finishing strategy based on deep learning, and a complex finishing network (comprising a selection module and a sorting module) needs to be built and specially trained. The pure data driven 'self-learning' mode lacks deterministic prior constraint, generates a large number of candidates by randomly perturbing a seed proposal, and then relies on network prediction scoring for screening. When the quality of the input initial pseudo tag is high, the random disturbance and screening may introduce deviation, so that the quality of the pseudo tag is reduced. In addition, training additional refi