CN-122024122-A - Video target detection method, system and storage medium based on artificial intelligence
Abstract
The application relates to the technical field of artificial intelligence, discloses a video target detection method, a system and a storage medium based on artificial intelligence, and aims to solve the problem of insufficient detection stability caused by neglecting inter-frame time sequence association, poor environmental adaptability and difficult consideration of real-time performance and precision in the prior art. The method comprises the steps of extracting a continuous video frame sequence, carrying out space-time joint modeling to obtain a multi-scale space-time feature tensor, generating a dynamic candidate region through a space-time attention mechanism, carrying out semantic classification, bounding box regression and cross-frame track association on the candidate region, outputting a detection result with an identity mark, and dynamically adjusting model parameters according to detection performance deviation. By the scheme, the robustness to complex scenes such as shielding, motion blurring and the like is remarkably improved, the identity consistency of the target is enhanced, and the synergy of high precision and real-time performance is realized.
Inventors
- WANG XIAODAN
Assignees
- 河北并济茂利科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251223
Claims (10)
- 1. The video target detection system based on artificial intelligence is characterized by comprising the following components: the video frame sequence input module is used for receiving an original video stream and extracting a continuous video frame sequence according to time sequence; the multi-scale space-time feature extraction module is used for carrying out space-time joint modeling on the video frame sequence and extracting multi-scale space-time feature tensors containing target appearance and motion information; The dynamic target candidate region generation module is used for generating a candidate target region set which dynamically evolves along with time through a space-time attention mechanism based on the multi-scale space-time feature tensor; The target semantic refining and track association module is used for carrying out semantic classification, bounding box regression and cross-frame track consistency constraint on the candidate target region set and outputting a stable target detection result with an identity mark; The model self-adaptive optimization module is used for dynamically adjusting the parameter weight generated by the feature extraction and the candidate region according to the deviation between the detection result and a preset confidence threshold; The multi-scale space-time characteristic extraction module adopts a mixed architecture of a three-dimensional convolution neural network and a deformable space-time convolution, the three-dimensional convolution neural network slides in a time dimension by a convolution kernel of 5 multiplied by 5, the deformable space-time convolution dynamically adjusts a standard convolution sampling position by learning an offset field, the offset field is generated by a lightweight sub-network according to gradient responses of a current frame and a front frame and a rear frame, the multi-scale space-time characteristic tensor is formed by splicing feature graphs output by different depth network layers after up-sampling or down-sampling alignment, the scale number is 3, and the corresponding receptive fields are 32 pixels, 64 pixels and 128 pixels respectively.
- 2. The artificial intelligence based video object detection system of claim 1, wherein the dynamic object candidate region generation module comprises a spatiotemporal attention weight calculation unit and a region proposal network, wherein the spatiotemporal attention weight calculation unit performs self-attention calculation on the multi-scale spatiotemporal feature tensor in a time dimension to generate a temporal attention weight matrix.
- 3. The video object detection system based on artificial intelligence according to claim 1, wherein the object semantic refining and track association module comprises an object classification regression sub-module and a cross-frame track association sub-module, the object classification regression sub-module adopts a two-layer fully connected network to conduct category prediction and boundary frame coordinate correction on each candidate area, the category number is 80, the boundary frame correction adopts a Smooth L1 loss function, the cross-frame track association sub-module matches detection frames which are larger than 0.5 and have consistent categories in adjacent frames based on a Hungary algorithm, and carries out prediction on object motion states by introducing Kalman filtering, the state vectors comprise positions, speeds and sizes, and a covariance matrix initial value is set to be a diagonal matrix diag ([ 10,10,1,1,5,5 ]).
- 4. The artificial intelligence based video object detection system of claim 1, wherein the model adaptive optimization module calculates a difference between an average accuracy average (mAP) of a current detection result and a preset threshold value of 0.85 through an online evaluation module, and if the continuous 5 frames of maps are lower than the threshold value, triggers a parameter fine adjustment mechanism.
- 5. The artificial intelligence-based video object detection system of claim 1, wherein the video frame sequence input module extracts a frame sequence from an original video stream at a sampling rate of 30 frames/second, normalizes each frame, scales pixel values to a [0,1] interval, wherein the length of the continuous video frame sequence is fixed to 16 frames, and if the total number of frames of video is insufficient, the last frame is repeated at the end until the length requirement is met, and wherein the original video stream has a resolution of not less than 1280 x 720 and a color space of RGB.
- 6. The artificial intelligence based video object detection system of claim 1, wherein the output feature tensor dimension of the multi-scale spatio-temporal feature extraction module is c×t×h×w, wherein c=256 is the number of channels, t=16 is the time step, H and W are the height and width of the feature map respectively, the values of which are determined by the input resolution and the network downsampling rate, the feature tensor is subjected to a layer normalization process to stabilize the training process before being input to the dynamic object candidate region generation module, and the normalization parameters are frozen in the reasoning stage.
- 7. The video target detection method based on artificial intelligence is characterized by comprising the following steps of: step S110, receiving an original video stream and extracting a continuous video frame sequence according to time sequence; step S120, carrying out space-time joint modeling on the video frame sequence, and extracting multi-scale space-time characteristic tensors containing target appearance and motion information; Step S130, generating a candidate target area set which dynamically evolves along with time through a space-time attention mechanism based on the multi-scale space-time feature tensor; step S140, carrying out semantic classification, bounding box regression and cross-frame track consistency constraint on the candidate target region set, and outputting a stable target detection result with an identity mark; Step S150, dynamically adjusting the parameter weight generated by the feature extraction and the candidate region according to the deviation between the detection result and a preset confidence threshold; The method comprises the steps of S120, wherein a mixed architecture of a three-dimensional convolution neural network and deformable space-time convolution is adopted, the three-dimensional convolution neural network slides in a time dimension through a convolution kernel of 5 multiplied by 5, the deformable space-time convolution dynamically adjusts a standard convolution sampling position through learning an offset field, the offset field is generated by a lightweight sub-network according to gradient responses of a current frame and a front frame and a rear frame, the multiscale space-time characteristic tensor is formed by splicing feature graphs output by different depth network layers after up-sampling or down-sampling alignment, the scale number is 3, and the corresponding receptive fields are 32 pixels, 64 pixels and 128 pixels respectively.
- 8. The method according to claim 7, wherein in step S130, the spatiotemporal attention weight calculation unit performs self-attention calculation on the multi-scale spatiotemporal feature tensor in a time dimension to generate a time attention weight matrix.
- 9. The method of claim 7, wherein in the step S140, two-layer full-connection network is used to perform category prediction and bounding box coordinate correction for each candidate region, the number of categories is 80, the bounding box correction uses a smoothl 1 loss function, cross-frame track correlation is based on hungarian algorithm to match detection frames with IoU greater than 0.5 and consistent categories in adjacent frames, and kalman filtering is introduced to predict the motion state of the target, the state vector comprises position, speed and size, and the initial value of covariance matrix is set to diagonal matrix diag ([ 10,10,1,1,5,5 ]).
- 10. A non-transitory computer readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement an artificial intelligence based video object detection method.
Description
Video target detection method, system and storage medium based on artificial intelligence Technical Field The invention belongs to the field of artificial intelligence, and particularly relates to a video target detection method, a video target detection system and a video target storage medium based on artificial intelligence. Background With the rapid development of artificial intelligence technology, video target detection is one of the core tasks in the field of computer vision, and plays an increasingly important role in a plurality of key scenes such as intelligent security, automatic driving, industrial quality inspection, smart city and the like. Video object detection aims to accurately identify and locate a specific class of object objects from successive video frames, the performance of which directly determines the reliability and level of intelligence of downstream applications. The video target detection method based on deep learning becomes a mainstream technical path in recent years, and space-time characteristics are modeled through a convolutional neural network or a transducer architecture so as to realize efficient perception of targets in a dynamic scene. The method generally relies on large-scale labeling data for end-to-end training, and introduces a time sequence modeling mechanism on the basis of static image detection so as to improve detection accuracy and robustness. The prior art still faces multiple challenges in practical application, namely, firstly, the traditional method mostly adopts a frame-by-frame independent processing strategy, strong time sequence correlation among video frames is ignored, so that computational redundancy is high and target identity consistency is difficult to maintain, secondly, the existing model has limited adaptability to complex dynamic scenes (such as shielding, illumination mutation and rapid movement), detection stability is insufficient, most systems lack of lightweight design, real-time performance and precision are difficult to consider when deployed on edge equipment, and finally, model training is highly dependent on a large amount of manual annotation data, so that annotation cost is high and generalization capability is limited to specific scene distribution. The problems seriously restrict the scale landing of the video target detection technology in a real complex environment, and a novel artificial intelligent detection scheme which combines time sequence modeling efficiency, environmental robustness and deployment flexibility is needed. Disclosure of Invention The invention aims to make up the defects of the prior art and provides a video target detection method, a video target detection system and a video target storage medium based on artificial intelligence, which can effectively solve the problems in the background art. In order to achieve the aim, the invention provides the technical scheme that on one hand, the system based on the artificial intelligence comprises a video frame sequence input module, a video frame sequence extraction module and a video frame sequence extraction module, wherein the video frame sequence input module is used for receiving an original video stream and extracting a continuous video frame sequence according to time sequence; the system comprises a video frame sequence, a multi-scale space-time feature extraction module, a dynamic target candidate region generation module, a target semantic refining and track association module, a model self-adaptive optimization module, a parameter weight generation module and a target semantic analysis module, wherein the video frame sequence is subjected to space-time joint modeling to extract multi-scale space-time feature tensors containing target appearance and motion information; On the other hand, the video target detection method based on artificial intelligence comprises the specific steps of receiving an original video stream and extracting a continuous video frame sequence according to time sequence, conducting space-time joint modeling on the video frame sequence, extracting a multi-scale space-time characteristic tensor containing target appearance and motion information, generating a candidate target area set dynamically evolving along with time through a space-time attention mechanism based on the multi-scale space-time characteristic tensor, conducting semantic classification, boundary frame regression and cross-frame track consistency constraint on the candidate target area set, outputting a stable target detection result with an identity mark, and dynamically adjusting parameter weights generated by feature extraction and the candidate area according to deviation of the detection result and a preset confidence threshold value, wherein the step S150 is used for carrying out the dynamic adjustment on the parameter weights generated by the candidate area. The multi-scale space-time characteristic extraction module adopts a mixed architecture combinin