CN-121996814-A - Multi-target accurate retrieval method and system in security video

CN121996814ACN 121996814 ACN121996814 ACN 121996814ACN-121996814-A

Abstract

The invention provides a multi-target accurate retrieval method and a system in security video, which relate to the field of interdisciplinary application and are characterized by comprising the following steps of carrying out standardized preprocessing on an original security video, and deploying an improved YOLOv model on a preprocessed key frame to carry out high-precision multi-target detection. The invention has the advantages that three complementary characteristics of appearance, motion and semantics of the targets are synchronously extracted by improving YOLOv model, cross-frame association and track modeling are realized by adopting an enhanced DeepSORT tracker, a compact feature index library constructed by combining a local sensitive hash algorithm and a three-level progressive retrieval mechanism are adopted, the precision and efficiency of multi-target retrieval in massive videos are remarkably improved, the target relevance and the complex scene semantic understanding capability are enhanced, meanwhile, the calculation cost is greatly reduced by a rapid approximate matching and grading retrieval strategy, and the requirements of security and protection application on high accuracy, rapid response and real-time intelligent research and judgment are met.

Inventors

XUE WEINA
YU CHANGXIU

Assignees

浙江工业大学

Dates

Publication Date: 20260508
Application Date: 20260114

Claims (10)

1. A multi-target accurate search method in security video is characterized by comprising the following steps: The method comprises the steps of carrying out standardized preprocessing on an original security video, including unified frame rate adjustment, standardized resolution and video segmentation by combining a scene change detection technology, and screening a representative frame sequence by adopting a key frame extraction strategy on the basis; On the preprocessed key frame, an improved YOLOv model is deployed to perform high-precision multi-target detection to obtain a detection result, and the YOLOv model not only outputs a target bounding box, but also synchronously extracts three complementary features, namely appearance features, motion features and semantic features; Inputting the detection result into an improved DeepSORT tracker, realizing cross-frame target association by utilizing an appearance characteristic and motion characteristic fusion strategy, and carrying out cross-frame tracking on the detected target to construct a motion characteristic track of the target; Obtaining a static appearance according to fusion of the detection result and the motion characteristic track, generating a unified high-dimensional target characterization vector through appearance characteristics, motion characteristics and semantic characteristics or motion track information, and performing approximate nearest neighbor mapping on the vector by adopting a local sensitive hash algorithm to construct a compact characteristic index library supporting quick similarity query; Performing a three-level progressive search mechanism based on the established feature index library, wherein the three-level progressive search mechanism comprises coarse search, fine search and association search; and carrying out confidence quantization on the output search result, performing deduplication, sequencing and visual integration, and generating a structured and interpretable multi-target search report.
2. The method for accurately retrieving multiple targets in a security video according to claim 1, wherein the improved YOLOv model is based on an original YOLOv architecture, and a multi-branch feature fusion module is introduced to synchronously extract appearance features, motion features and semantic features of targets, respectively, wherein the appearance features acquire high-resolution target texture information by enhancing space detail retention capability in a backbone network, the motion features capture dynamic properties of the targets by embedding a lightweight optical flow estimation sub-network or an inter-frame differential guide mechanism, and the semantic features generate embedded vectors with high-level semantic consistency by means of projection heads aligned with a pre-trained visual-language model.
3. The method for multi-objective accurate retrieval in security video according to claim 1, wherein the improved DeepSORT tracker builds an enhanced depth appearance embedding model using appearance features and dynamically optimizes objective appearance templates through an online update strategy. Semantic features are introduced as high-level constraint, semantic consistency correction is carried out on a matching cost matrix in the track association process, and ID switching problems caused by shielding, deformation or illumination change are effectively restrained.
4. The multi-target accurate retrieval method in security video according to claim 2, wherein the appearance characteristics are extracted based on ResNet-50 depth neural networks, and visual information including clothes, hairstyles, body states and the like of pedestrians comprises a spatial track, a temporal track and a behavior track of a target, wherein the behavior track is obtained by analyzing speed change, movement direction and stay mode of the target.
5. The method for accurately searching multiple targets in security video according to claim 4, wherein the space-time index structure adopts a time-space two-dimensional index to realize rapid target positioning and related video segment filtering. In the multi-level retrieval, the coarse retrieval is quickly filtered through the space-time index, the fine retrieval is accurately matched based on the feature similarity, and the association retrieval utilizes the target association model to identify different expression forms of the same target.
6. The multi-target accurate retrieval system in the security video is characterized by comprising a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for carrying out standardized pretreatment on an original security video, comprises unified frame rate adjustment and resolution standardization, and carries out video segmentation by combining a scene change detection technology, adopts a key frame extraction strategy on the basis, screens a representative frame sequence, and supports various video format input and real-time stream processing; The intelligent analysis module is used for deploying an improved YOLOv model on the preprocessed key frame to carry out high-precision multi-target detection, and comprises a target detection unit, a feature extraction unit and a target tracking unit; The index management module is used for carrying out approximate nearest neighbor mapping on the vector by a local sensitive hash algorithm, constructing a compact characteristic index library supporting quick similarity query, and constructing and maintaining a characteristic index and a space-time index by the index management module; The retrieval service module is used for carrying out a three-level progressive retrieval mechanism based on the index structure established in the steps, and comprises rough retrieval, fine retrieval and association retrieval, and the retrieval service module provides a multi-level retrieval algorithm and result ordering. And the system management module is used for generating a structured and interpretable multi-target retrieval report and supporting backtracking and intelligent research and judgment of security events, and provides a user interface and a right management function.
7. The multi-target accurate retrieval system in security video according to claim 6, wherein the target detection unit adopts a modified YOLOv deep learning model, and can detect multiple target types including pedestrians, vehicles and the like.
8. The multi-target accurate retrieval system in security video according to claim 7, wherein the feature extraction unit is capable of extracting three types of features, namely appearance features, motion features and semantic features, and fusing the three types of features into a unified target characterization vector.
9. The multi-objective accurate retrieval system in security video according to claim 6, wherein the index management module employs an improved locality sensitive hashing algorithm to construct a feature index, supporting dynamic updating and maintenance.
10. The multi-objective accurate search system in security video according to claim 6, wherein the search service module supports a plurality of search modes including search based on feature conditions, search based on space-time conditions and search based on behavior conditions.

Description

Multi-target accurate retrieval method and system in security video Technical Field The invention relates to the field of interdisciplinary application, in particular to a multi-target accurate retrieval method and system in security video. Background Along with acceleration of urban progress and improvement of public safety requirements, a security video monitoring system becomes an important component of urban security protection, and the current security video system has a plurality of problems in quickly and accurately searching specific targets in massive video data although the data acquisition and storage aspects are perfect. At present, along with the acceleration of the urban process and the improvement of public safety requirements, although the security video monitoring system is gradually perfected in terms of data acquisition and storage, the realization of rapid and accurate multi-target retrieval in massive videos still faces a plurality of challenges, such as low retrieval precision, poor efficiency, weak multi-target relevance, lack of semantic understanding capability, insufficient instantaneity in complex environments and the like, in the prior art, the requirements of high accuracy, high response speed and intelligent research and judgment in actual security scenes are difficult to be met, and therefore, a novel video analysis method which integrates appearance, motion and semantic features and has efficient indexing and multi-stage retrieval mechanisms is needed. Disclosure of Invention The invention aims to provide a multi-target accurate retrieval method and a multi-target accurate retrieval system in a security video, which solve the technical problems of low retrieval precision, poor efficiency, weak multi-target relevance, lack of semantic understanding and insufficient instantaneity in a security video monitoring system in the prior art. In order to achieve the aim of the invention, the invention adopts the following technical scheme: A multi-target accurate retrieval method in security video is characterized by comprising the steps of carrying out standardized pretreatment on original security video, including unified frame rate adjustment, standardized resolution, video segmentation by combining a scene change detection technology, and screening a representative frame sequence by adopting a key frame extraction strategy on the basis; On the preprocessed key frame, an improved YOLOv model is deployed to perform high-precision multi-target detection to obtain a detection result, and the YOLOv model not only outputs a target bounding box, but also synchronously extracts three complementary features, namely appearance features, motion features and semantic features; Inputting the detection result into an improved DeepSORT tracker, realizing cross-frame target association by utilizing an appearance characteristic and motion characteristic fusion strategy, and carrying out cross-frame tracking on the detected target to construct a motion characteristic track of the target; Obtaining a static appearance according to fusion of the detection result and the motion characteristic track, generating a unified high-dimensional target characterization vector through appearance characteristics, motion characteristics and semantic characteristics or motion track information, and performing approximate nearest neighbor mapping on the vector by adopting a local sensitive hash algorithm to construct a compact characteristic index library supporting quick similarity query; Performing a three-level progressive search mechanism based on the established feature index library, wherein the three-level progressive search mechanism comprises coarse search, fine search and association search; and carrying out confidence quantization on the output search result, performing deduplication, sequencing and visual integration, and generating a structured and interpretable multi-target search report. The improved sensitive hash algorithm is used for carrying out quick similarity retrieval on the high-dimensional object characterization vector fused with appearance, motion and semantic information, so as to construct a compact feature index library, each historical object (such as pedestrians and vehicles) is expressed as the high-dimensional vector fused with the appearance, motion and semantic information, and the high-dimensional vector is mapped into a plurality of hash buckets through the sensitive hash algorithm function and stored. When a user initiates a search request similar to 'search for people wearing red clothes and walking left', the system converts the query condition into a corresponding vector and calculates a hash value by using the same sensitive hash algorithm function, and then only accurate similarity matching is needed in the target hash bucket and the adjacent buckets, so that the search range is obviously shortened, and the calculation cost is greatly reduced. The rough detection method in