CN-122019834-A - Non-supervision video segment retrieval method based on time sequence anchor point mining and semantic alignment

CN122019834ACN 122019834 ACN122019834 ACN 122019834ACN-122019834-A

Abstract

The invention discloses an unsupervised video segment retrieval method based on time sequence anchor mining and semantic alignment, which comprises a video retrieval training system based on time sequence anchor mining and point supervision learning, wherein the system comprises a key anchor extraction module, a semantic alignment description generation module and a point supervision learning enhancement module, wherein the key anchor extraction module is used for extracting a key time anchor from an original unaffiliated video sequence, the semantic alignment description generation module is used for constructing a pseudo tag triplet, so that a weak supervision video segment retrieval model is trained, in the training process, the point supervision comparison learning target is constructed through the point supervision learning enhancement module to further optimize the model, so that the finally trained weak supervision video segment retrieval model outputs a segment start-stop time boundary related to query statement semantics, the problems of low pseudo tag quality, fuzzy positioning boundary and inconsistent semantic space under an unsupervised scene are effectively overcome, and the positioning precision and generalization capability of video segment retrieval are remarkably improved.

Inventors

HUANG SHUANGPING
Ke Quhui

Assignees

华南理工大学

Dates

Publication Date: 20260512
Application Date: 20260207

Claims (7)

1. An unsupervised video segment retrieval method based on time sequence anchor point mining and semantic alignment is characterized by comprising the following steps: step 1, constructing a video retrieval training system based on time sequence anchor point mining and point supervision learning, wherein the system comprises a key anchor point extraction module, a semantic alignment description generation module and a point supervision learning enhancement module; Step 2, inputting an original video sequence, analyzing time sequence significance of a video frame through a key anchor point extraction module, generating a score curve of the key frame, and extracting a peak frame of the score curve to serve as a key time anchor point; Step 3, generating original action description by using a visual language model according to the key time anchor point and the context frame sequence thereof through a semantic alignment description generation module, performing semantic deviation correction and rewriting by using a large language model to obtain a standard query sentence, and constructing a pseudo tag triplet for training a weak supervision video segment retrieval model; Step 4, constructing a point supervision comparison learning target by a point supervision learning enhancement module, wherein the point supervision comparison learning target is used for providing additional point-level supervision information for a weak supervision video segment retrieval model; And 5, inputting the video sequence to be queried and the query statement of the user into a trained weak surveillance video segment retrieval model, executing cross-mode semantic matching through the trained weak surveillance video segment retrieval model, and outputting a segment start-stop time boundary related to the query statement semantic in the video sequence to be queried.
2. The method for retrieving the unsupervised video clip based on time sequence anchor point mining and semantic alignment according to claim 1, wherein the specific step of extracting the key time anchor point by using the key anchor point extraction module in the step 2 is as follows: step 201, constructing a positive prompt word for representing a dynamic significant event and a negative prompt word for representing a static background environment, and forming a logically opposite judgment text pair; Step 202, inputting a video sequence into a visual language model frame by frame, obtaining response probabilities of each frame for the positive prompt word and the negative prompt word, and marking the response probabilities as positive confidence score and negative confidence score respectively; Step 203, calculating to obtain a final key frame score of a single frame through the difference between the positive confidence score and the negative confidence score, and obtaining a time sequence score curve changing along with the video time; and 204, performing local maximum screening on the time sequence score curve, and determining the video frame which meets a preset threshold and is positioned at the peak position of the time sequence score curve as the key time anchor point.
3. The method for retrieving an unsupervised video clip based on sequential anchor mining and semantic alignment according to claim 2, wherein the design criteria and processing details of the positive confidence score and the negative confidence score in step 202 comprise the following steps: step 20201, inputting the video sequence to be queried and the forward prompt word into a visual language model together; Step 20202, performing a forward reasoning process of the visual language model, after receiving the video sequence, locking the unnormalized probability distribution vector corresponding to the end of the forward hint word in the output sequence of the visual language model by means of tensor slice indexing, and retrieving the value of the first marker "Yes" representing positive semantics from the unnormalized probability distribution vector And the value of the second label "No" representing negative semantics ; Step 20203, for a value of "Yes" for the first marker And the value of the second label "No Performing soft max normalization of two categories, and calculating to obtain the positive response probability of the current video frame under the current forward prompt word And to compare the probability of positive response As a corresponding forward confidence score; the probability of affirmative response The calculation formula of (2) is as follows: ; Wherein, the Representing the execution of the natural constant for each element in the input unnormalized probability distribution vector Logits The underlying exponential operation, i.e. Softmax normalization to Logits, For the value of the probability distribution vector Logits corresponding to the first label "Yes", For the value of the probability distribution vector Logits corresponding to the second label "No", The positive response probability of the current video frame under the current forward prompt word is given; step 20204, inputting the video sequence to be queried and the negative prompt word into a visual language model, processing according to the same logic described in steps 20202 to 20203, obtaining a negative response probability representing that the current video frame is in a static background environment, and determining the negative response probability as a corresponding negative confidence score.
4. The method for retrieving an unsupervised video clip based on time-series anchor mining and semantic alignment according to claim 3, wherein the forward prompt word in step 201 includes an analysis instruction for visual change of a video sequence, for guiding a visual language model to pay attention to visual change characteristics in a video frame, and performing boolean judgment on whether there is a motion or motion with semantic meaning for the video frame; the negative prompt word comprises an analysis instruction for the static attribute of the video sequence, and is used for guiding the visual language model to analyze the visual static attribute of the scene and executing Boolean judgment on whether the video frame is a background environment without an active event.
5. The method for retrieving the unsupervised video clip based on time sequence anchor point mining and semantic alignment according to claim 1, wherein the specific steps of constructing the pseudo tag triples through the semantic alignment description generation module in the step 3 are as follows: Step 301, using the key time anchor point as a center, expanding a preset time radius forwards and backwards in a bidirectional manner, sampling the expanded video interval at equal intervals, and splicing to form a time sequence image sequence representing the action evolution process; Step 302, inputting the time sequence image sequence into a visual language model, configuring a strong constraint description instruction, and guiding the visual language model to output a single-sentence original action description conforming to a preset grammar template; Step 303, inputting the original action description into a large language model, performing semantic deviation correction and rewriting on the original action description by using the large language model, and generating a standard query statement which is equivalent in semantic meaning and expresses a specification; And 304, constructing the pseudo tag triples by mapping the generated standard query sentences, the corresponding original video sequences and the key time anchor point association.
6. The method for retrieving an unsupervised video segment based on sequential anchor mining and semantic alignment according to claim 1, wherein the specific processing steps of providing the point-level supervision information by the point supervision learning enhancement module in the step 4 are as follows: Step 401, coding a video sequence into a frame-by-frame visual feature sequence through a weak supervision video segment retrieval model, completing cross-mode semantic matching by combining the semantic features of the standard query statement, and predicting to generate a plurality of candidate time intervals; Step 402, judging the candidate time interval by using a key time anchor point, marking the candidate time interval as a positive sample interval if the candidate time interval covers the key time anchor point on a time axis, and marking the candidate time interval as a negative sample interval if the candidate time interval does not cover the key time anchor point; Step 403, aggregating the frame-by-frame visual features of the positive sample interval and the negative sample interval to generate interval-level features, and constructing a point supervision comparison learning target according to the interval-level features; And step 404, performing joint optimization on the point supervision comparison learning target and the supervision target of the weak supervision video segment retrieval model, calculating a joint loss function, and training the weak supervision video segment retrieval model by minimizing the joint loss function.
7. The method for retrieving an unsupervised video segment based on sequential anchor mining and semantic alignment according to claim 6, wherein the constructing a point supervised contrast learning target by the point supervised learning enhancement module in step 403 comprises the following steps: Step 40301, extracting frame visual features of the positions corresponding to the key time anchor points from the frame-by-frame visual feature sequence, and using the frame visual features as reference anchor point characterization; Step 40302, respectively extracting the frame-by-frame visual characteristics corresponding to all video frames contained in the time interval according to all positive sample intervals and negative sample intervals determined in step 402, and calculating the average value of all frame-by-frame visual characteristics in the time interval through mean value pooling to serve as the interval level characteristic of the sample interval; step 40303, pairing all positive sample intervals and all negative sample intervals in pairs to construct a comparison sample pair covering all positive sample intervals and all negative sample intervals; Step 40304, for each group of comparison sample pairs, respectively calculating cosine similarity between positive sample interval features and reference anchor points And cosine similarity of negative sample interval characteristics and reference anchor point characteristics ; Step 40305, calculating a contrast loss function based on the cosine similarity of step 40304 The contrast loss function The calculation formula is as follows: ; Wherein, the In order to compare the value of the loss function, Indicating that the maximum value is taken, Is a preset interval parameter; and averaging the generated contrast loss values of all the contrast samples to obtain a final point supervision contrast learning target.

Description

Non-supervision video segment retrieval method based on time sequence anchor point mining and semantic alignment Technical Field The invention belongs to the technical field of computer vision and artificial intelligence, and particularly relates to an unsupervised video segment retrieval method based on time sequence anchor point mining and semantic alignment. Background The video segment retrieval (Video Moment Retrieval, VMR) aims at precisely positioning a specific time sequence segment in an untrimmed long video according to a natural language query statement, and the capability lays a technical foundation for tasks such as automatic monitoring analysis, intelligent video search, man-machine interaction and the like requiring fine-grained video understanding, and by associating a language concept with a time sequence interval, the VMR enables a computer system to efficiently navigate and analyze massive unstructured video data. The traditional video segment retrieval method is highly dependent on a large-scale data set with accurate time sequence annotation, however, the accurate start-stop boundary for each video-query pair annotation is extremely expensive in labor cost and difficult to expand in a large-scale scene, and the high annotation cost of video time sequence positioning severely limits the generalization capability and application landing of a supervised learning model unlike the image-text pair data which can be grabbed on a large scale on the Internet. To break through the annotation bottleneck, the focus of the study gradually shifts to Zero-sample video segment retrieval (Zero-Shot Video Moment Retrieval, ZSVMR). The paradigm aims to work directly on unlabeled raw video, and uses extensive knowledge contained in a pre-trained Visual Language Model (VLMs) to establish timing-semantic alignment, thereby eliminating reliance on manual labeling. Under an unsupervised training setting, the biggest challenge is how to build fine-grained "video-text" associations from video without any annotations. At present, two main non-supervision implementation paths mainly exist in academia and industry, wherein the first path is a full supervision training paradigm generated based on pseudo-time boundaries, namely, an attempt is made to directly generate pseudo-labels with time start-stop boundaries by using a multi-mode large model, and full supervision training is performed on a retrieval model by using the pseudo-labels. However, the method has a remarkable technical bottleneck, because the existing multi-mode large model is not originally designed for a long video time sequence positioning task, the generated pseudo boundary is always full of noise and extremely inaccurate, so that a downstream retrieval model not only learns wrong boundary distribution, but also has the upper limit of performance limited by zero sample positioning capability of a pre-training model, even the phenomenon that the trained model shows performance reverse hanging which is weaker than that of an original model in direct reasoning occurs, a second path is a training-free prediction model based on a visual language model, the method gives up model training, directly uses the large language model to calculate the similarity of a video frame aiming at a test query statement, and cooperates with complex post-processing logic to extract fragments. However, the method is difficult to meet the real-time requirement in practical application, the inference calculation amount for carrying out frame-by-frame or high-frequency sliding window on a long video is huge, the inference delay is extremely high, meanwhile, the similarity curve output by the visual language model is always full of noise due to the redundancy of visual information, the method is extremely dependent on manually designed screening threshold values and post-processing rules, and the scene lacks robustness. In addition, the existing unsupervised method also commonly faces the problem of semantic gap, and the original video description generated by the model in an unconstrained state often has the problems of grammar errors, uncommon vocabulary use or visual detail redundancy and the like, so that obvious semantic distribution differences exist between the original video description and the refined and standardized statement query statement used by a user in actual retrieval, and the difficulty of video-text semantic alignment in unsupervised training is further increased. In view of the inaccuracy caused by forced prediction of time boundaries in the prior art, the high calculation cost caused by direct reasoning and other problems, how to avoid inaccurate segment boundary prediction, instead, the high-quality time anchor point is mined by utilizing the adept image description capability of the multi-mode model, and the boundary perception capability of the weak supervision model is enhanced by constructing high-precision point-level supervision information, so th