CN-122024124-A - Target behavior detection method, system and device based on long-time-sequence video analysis

CN122024124ACN 122024124 ACN122024124 ACN 122024124ACN-122024124-A

Abstract

The invention discloses a target behavior detection method, a system and a device based on long time sequence video analysis, which relate to the field of artificial intelligence and computer vision, acquire long time sequence videos and divide the long time sequence videos into a plurality of video fragments, process each video fragment by utilizing a visual feature extraction module in a vision-language model to acquire a fragment-level visual feature sequence, process the video representation vector based on an adaptive time sequence pooling module, determine three-layer text representation vectors based on a layered text prompt module for a plurality of behavior categories, and detect target behaviors by combining a preset feature alignment and classification module in the vision-language model to finally acquire target behaviors, corresponding occurrence time periods and target clinical abstract texts under the long time sequence videos. The scheme can reliably identify each target behavior and occurrence time period in the long-time sequence video, and can obtain readable target clinical abstract text, so that video-text retrieval is realized, and the scheme has stronger interpretability and traceability.

Inventors

HE XIANGJIAN
ZHANG HAN
XING CHANG

Assignees

宁波诺丁汉大学

Dates

Publication Date: 20260512
Application Date: 20251224

Claims (10)

1. The target behavior detection method based on long-time sequence video analysis is characterized by comprising the following steps of: Acquiring a long time sequence video to be analyzed; Dividing the long time sequence video according to a preset time window to obtain a plurality of video fragments corresponding to different time periods; For each video segment, processing the video segment by utilizing a visual feature extraction module in a pre-trained visual-language model to obtain a segment-level visual feature sequence composed of frame-level visual feature vectors, and further processing the segment-level visual feature sequence based on an adaptive time sequence pooling module in the visual-language model to obtain a video representation vector corresponding to the video segment; Generating, by a hierarchical text prompt module in the visual-language model, three-tier text representation vectors for a plurality of behavioral categories, the three-tier text representation vectors including a first-tier text representation vector corresponding to behavioral category prompt text, a second-tier text representation vector corresponding to behavioral action detail text, and a third-tier text representation vector corresponding to clinical interpretation descriptive text; and determining whether the behavior corresponding to the video segment is a target behavior or not according to the video representation vector, the three-layer text representation vector and a preset feature alignment and classification module in the vision-language model, and calling a corresponding clinical interpretation description text as a target clinical abstract text when the target behavior is determined, so as to obtain each target behavior, an occurrence time period corresponding to the target behavior and the target clinical abstract text included in the long-time sequence video.
2. The method for detecting target behavior based on long-time-series video analysis according to claim 1, wherein processing the video clip by using a visual feature extraction module in a pre-trained visual-language model to obtain a clip-level visual feature sequence composed of frame-level visual feature vectors comprises: Sampling the video segment according to a preset fixed frame rate, and adjusting the size of each frame image obtained by sampling to a preset size to obtain a frame image sequence comprising T frame images, wherein T is an integer greater than 1; And carrying out forward reasoning on each frame image in the frame image sequence by utilizing a pre-trained CLIP visual encoder to obtain a D-dimensional frame-level visual feature vector corresponding to the current frame image, and further obtaining a segment-level visual feature sequence which is composed of T frame-level visual feature vectors and has a dimension of T multiplied by D, wherein D is an integer larger than 1.
3. The method for detecting target behavior based on long-time-series video analysis of claim 2, wherein processing the segment-level visual feature sequence based on an adaptive time-series pooling module in the visual-language model to obtain a video representation vector corresponding to the video segment comprises: Determining a local window by taking a T-th frame-level visual feature vector under the segment-level visual feature sequence as a center, and further carrying out one-dimensional convolution on the frame-level visual feature vector under the local window to obtain an initial attention score, wherein the local window comprises the center, o frame-level visual feature vectors before the center and o frame-level visual feature vectors after the center, o is an integer not less than 1, T is not less than 1 and T is an integer; Calculating cosine similarity between the t-th frame-level visual feature vector and all frame-level visual feature vectors under the segment-level visual feature sequence, and determining an average value of the cosine similarity as a consistency score; Determining a final original score of a t-th frame-level visual feature vector by using a learnable parameter and a first preset relational expression, wherein the first preset relational expression is as follows: s_t=β·e_conv_t+(1–β)·e_sim_t Where s_t represents the final raw score, β represents the learnable parameter, e_conv_t represents the initial attention score, and e_sim_t represents the consistency score; And normalizing the T final original scores through softmax to obtain the corresponding attention weights of each frame, and further weighting and summing the visual feature vectors of each frame level according to the corresponding attention weights to obtain the video representation vector corresponding to the video clip.
4. The method for detecting target behavior based on long-time-series video analysis of claim 2, wherein generating the three-layer text representation vector by a hierarchical text prompt module in the visual-language model for a plurality of behavior categories comprises: Three layers of prompt texts are generated for each behavior category under a preset text prompt library, wherein the three layers of prompt texts comprise a first layer of behavior category prompt text, a second layer of behavior action detail text and a third layer of clinical interpretation description text; inputting each layer of text under the three layers of prompt texts to a CLIP text encoder respectively to obtain corresponding three layers of D-dimensional text feature vectors; And aiming at each layer, performing fine tuning processing on the D-dimensional text feature vector under the current layer by using the adapters connected in series under the current layer so as to obtain the D-dimensional text representation vector under the current layer.
5. The method for detecting target behavior based on long-time-series video analysis according to claim 2, wherein determining whether the behavior corresponding to the video clip is a target behavior according to the video representation vector, the three-layer text representation vector, and a preset feature alignment and classification module in the vision-language model comprises: Processing the video representation vector by using a video linear transformation matrix to obtain a mapped video representation vector, and processing the three-layer text representation vector by using a text linear transformation matrix to obtain a three-layer mapped text representation vector, wherein the mapped video representation vector and the three-layer mapped text representation vector belong to the same d-dimensional feature space, and d is an integer greater than 1; For each behavior category, respectively calculating cosine similarity between a corresponding three-layer mapping text representation vector and the mapping video representation vector under the current behavior category, and taking the maximum value in the cosine similarity as the score of the current behavior category; And carrying out softmax operation on the scores under all the behavior categories to obtain probability distribution of the behavior categories, and further determining whether the behavior corresponding to the video clip is the result of the target behavior based on the probability distribution.
6. The method for detecting target behavior based on long-time-series video analysis according to claim 1, wherein the vision-language model is constructed based on CLIP and the vision-language model is pre-trained based on a preset loss function and a data set generated by a data acquisition and labeling module so as to maximize the similarity between a positively matched video representation and a corresponding text representation; The step of generating the data set according to the data acquisition and labeling module comprises the following steps: acquiring a plurality of sections of long time sequence videos and labeling each long time sequence video to obtain a data set which corresponds to the video-text and is labeled and used for training; the step of labeling the plurality of sections of the long time sequence video comprises the following steps: Obtaining labeling results input for each long-time-sequence video, determining a plurality of target behavior fragments under each long-time-sequence video, and recording behavior categories and corresponding time stamps to which the target behavior fragments belong; and acquiring clinical interpretation description texts input for each target behavior segment so as to record the behavior category, the corresponding time stamp and the clinical interpretation description text of the target behavior segment, thereby realizing the corresponding labeling of the video-text.
7. The method for detecting target behavior based on long-time-series video analysis according to claim 5, wherein the pre-training step of the vision-language model comprises: And extracting N behavior categories from the data set under each round of the training tasks with fewer samples, extracting K video-text corresponding labeling results from each behavior category as a support set, and using the rest video-text corresponding labeling results as a query set.
8. The target behavior detection method based on long-time-series video analysis according to claim 1, wherein the target behavior detection method further comprises: When an input search text aiming at the long-time sequence video to be analyzed is obtained, calculating cosine similarity between the input search text and video representation vectors corresponding to all video segments, sequencing the input search text and the video representation vectors according to the cosine similarity from high to low, and determining at least one video segment most relevant to the input search text as a target search term according to the sequencing.
9. A target behavior detection system based on long-time-sequence video analysis, comprising: The acquisition unit is used for acquiring the long time sequence video to be analyzed; The dividing unit is used for dividing the long time sequence video according to a preset time window to obtain a plurality of video fragments corresponding to different time periods; The video representation vector determining unit is used for processing the video segments by utilizing a visual feature extraction module in a pre-trained visual-language model to obtain segment-level visual feature sequences composed of frame-level visual feature vectors, and further processing the segment-level visual feature sequences based on an adaptive time sequence pooling module in the visual-language model to obtain video representation vectors corresponding to the video segments; A text representation vector determining unit for generating, by a hierarchical text prompt module in the visual-language model, three layers of text representation vectors including a first layer of text representation vector corresponding to a behavior category prompt text, a second layer of text representation vector corresponding to a behavior action detail text, and a third layer of text representation vector corresponding to a clinical interpretation description text for a plurality of behavior categories; The target behavior detection unit is used for determining whether the behavior corresponding to the video segment is a target behavior according to the video representation vector, the three-layer text representation vector and a preset feature alignment and classification module in the vision-language model, and calling a corresponding clinical interpretation description text as a target clinical abstract text when the behavior is determined to be the target behavior, so that each target behavior, an occurrence time period corresponding to the target behavior and the target clinical abstract text included in the long-time sequence video are obtained.
10. A target behavior detection device based on long-time-sequence video analysis, comprising: A memory for storing a computer program; a processor for implementing the steps of the target behavior detection method based on long-time-series video analysis according to any one of claims 1 to 8 when executing the computer program.

Description

Target behavior detection method, system and device based on long-time-sequence video analysis Technical Field The invention relates to the technical field of artificial intelligence and computer vision, in particular to a target behavior detection method, system and device based on long-time sequence video analysis. Background Autism spectrum disorders are a class of neurological disorders characterized primarily by social interaction disorders, interest-score and repetitive-score behaviors, where early screening and intervention can significantly improve the long-term development outcome of children, however, current clinical evaluations rely primarily on trained evaluators for subjective observations and scoring in outpatients, scales or structured interviews (e.g., ADOS-2), but the evaluation procedure in this manner is time and labor consuming, requires higher labor costs, and has limited consistency among different evaluators, with inadequate capture of abnormal behavior that occurs more often in some transient, concealed or natural home situations. With the development of computer vision and artificial intelligence technology, an attempt is made to objectively identify related behaviors of autism by using a video automatic analysis technology, typically, short video segments of children are collected, and a manual feature (such as an optical flow histogram HOF or a directional gradient histogram HOG) or a model of a 2D/3D convolutional neural network, a transducer and the like is utilized to classify a notch action (such as a clap, a foot, shaking and the like), but the method still has the defects that on one hand, short-time-sequence videos such as short clip segments within 30-90 seconds can be processed, which are derived from the fact that an existing model is trained based on a public data set (such as an SSBD and an ESBD) which only contains short clip segments of 30-90 seconds, and often only contains single notch plate behaviors, so that classification results of the existing model are difficult to reflect the occurrence sequence, context and co-representation relations of behaviors in 20-30 minutes continuous interaction under real clinical or family situations, and time sequence convolution of the long-time-sequence videos can be significantly increased, and the existing model is difficult to process the fact that the existing model is difficult to match with a time sequence model in a time sequence, and the existing model is difficult to be traced back by a time sequence, and the existing model is difficult to be compared with a clinical model, and the existing model is difficult to be naturally clearly described by a visual label, and the existing model is difficult to be clearly and a model is difficult to describe due to the fact that the existing model is in a model is based on a model with a standard model with a visual model, which has a high-quality model, and a poor in a visual condition, and a visual model. Therefore, a technical solution is needed that can work under long-time-sequence natural interactive video and can establish an alignment relationship between a behavior pattern in the video and clinical language. Disclosure of Invention The invention solves the problem of providing a target behavior detection method, a target behavior detection system and a target behavior detection device based on long-time sequence video analysis, realizing video-text and text-video bidirectional retrieval, and being more convenient for practical application, and having stronger interpretation and traceability of the scheme. In order to solve the technical problems, the application provides a target behavior detection method based on long-time sequence video analysis, which comprises the following steps: Acquiring a long time sequence video to be analyzed; Dividing the long time sequence video according to a preset time window to obtain a plurality of video fragments corresponding to different time periods; For each video segment, processing the video segment by utilizing a visual feature extraction module in a pre-trained visual-language model to obtain a segment-level visual feature sequence composed of frame-level visual feature vectors, and further processing the segment-level visual feature sequence based on an adaptive time sequence pooling module in the visual-language model to obtain a video representation vector corresponding to the video segment; Generating, by a hierarchical text prompt module in the visual-language model, three-tier text representation vectors for a plurality of behavioral categories, the three-tier text representation vectors including a first-tier text representation vector corresponding to behavioral category prompt text, a second-tier text representation vector corresponding to behavioral action detail text, and a third-tier text representation vector corresponding to clinical interpretation descriptive text; and determining whether the behavior correspondin