CN-121982657-A - Video stream processing method and device for video monitoring scene based on YOLO model

CN121982657ACN 121982657 ACN121982657 ACN 121982657ACN-121982657-A

Abstract

The invention discloses a video stream processing method and a video stream processing device for video monitoring scenes based on a YOLO model, wherein the video stream processing method comprises the following steps of collecting video streams in real time, extracting multi-scale motion information and spatial structure information of the video streams to output a time sequence significance map, extracting basic visual features of the video streams to output basic feature maps and dividing spatial blocks for the basic feature maps; space block division is carried out on the time sequence saliency map, the saliency category of each space block is confirmed, and then a corresponding control signal is generated according to each space block in the saliency category-based feature map; and respectively carrying out condition calculation on each space block of the basic feature map according to the control signal and outputting a target detection result. The video stream processing method provided by the invention can identify the inter-frame motion and scene change aiming at the edge video analysis scene, dynamically allocate the computing resources, and remarkably reduce the computing cost and the power consumption on the premise of ensuring the detection precision of the key target.

Inventors

TANG WENKAI
CHENG HAO

Assignees

上海芯力基半导体有限公司

Dates

Publication Date: 20260505
Application Date: 20260409

Claims (13)

1. The video stream processing method for the video monitoring scene based on the YOLO model is characterized by being used for detecting target detection objects in the video monitoring scene, respectively performing condition calculation on the target detection objects and non-target detection objects and outputting target detection results, and comprises the following steps: Collecting video stream in real time, extracting multi-scale motion information and spatial structure information of the video stream to output a time sequence significance map, wherein the multi-scale motion information is a differential characteristic of the video stream, and the spatial structure information is an original frame characteristic of the video stream; Carrying out space block division on the time sequence saliency map, confirming the saliency category of each space block, and generating a corresponding control signal according to each space block in the saliency category-based feature map; the spatial blocks of the basic feature map are in one-to-one correspondence with the spatial blocks of the time sequence saliency map, and the saliency categories of the spatial blocks of the time sequence saliency map comprise high saliency blocks, medium saliency blocks, low saliency blocks and extremely low saliency blocks, and each control signal corresponds to a calculation mode to be executed on the spatial blocks of the basic feature map; And respectively carrying out condition calculation on each space block of the basic feature map according to the control signal, and outputting a target detection result, wherein the target detection result is a video stream marked with the position of a target detection object, and the condition calculation is that the complete calculation is carried out on the space blocks in the basic feature map corresponding to the high-significance blocks, the calculation is not carried out on the space blocks in the basic feature map corresponding to the extremely-low-significance blocks, and the sparse calculation is carried out on the space blocks in the basic feature map corresponding to the medium-significance blocks and the low-significance blocks.
2. The video stream processing method for a video monitoring scene based on a YOLO model according to claim 1, wherein the difference features are obtained by pixel level subtraction and weighted summation calculation, the difference features include adjacent frame differences, interval frame differences and accumulated weighted differences of N frame images of each frame image in the video stream, and the time sequence saliency map is obtained by the following method: The method comprises the steps of inputting a video frame sequence of NXHXW XK, calculating adjacent frame differences, interval frame differences and accumulated weighted differences of N frame images of the video frame sequence, splicing and fusing original frame characteristics of a video stream and the calculated adjacent frame differences, interval frame differences and accumulated weighted differences in channel dimensions to form an input tensor, carrying out 3D depth separable convolution operation and time sequence compression on the input tensor, and outputting a time sequence significance map of 1 XH/4 XW/4, wherein N is the number of frames of the input video stream, H is the height of a picture, W is the width of the picture, and K is the dimension.
3. The video stream processing method for a video monitoring scene based on a YOLO model according to claim 2, wherein the calculation formula of the adjacent Frame difference is Frame t －Frame t-1 , the calculation formula of the interval Frame difference is Frame t －Frame t-2 , frame t is a feature of a t-th Frame image, frame t-1 is a feature of a t-1 th Frame image, frame t-2 is a feature of a t-2 nd Frame image, and t is equal to or greater than 5.
4. The video stream processing method for a video monitoring scene based on a YOLO model according to claim 2, wherein the 3D depth separable convolution operation and the time sequence compression are performed through a neural network structure formed by stacking three layers of 3D depth separable convolution layers and one layer of 2D depth separable convolution layer, the three layers of 3D depth separable convolution layers are respectively a1 st layer of 3D depth separable convolution layer, a2 nd layer of 3D depth separable convolution layer and a 3 rd layer of 3D depth separable convolution layer, the convolution kernel size of the 1 st layer of 3D depth separable convolution layer is (3, 3), the step size is (1, 2), and the number of convolution channels is 32; the 2-layer 3D depth separable convolution layer has a convolution kernel size of (3, 3), a step length of (1, 2), a convolution channel number of 64, the 3-layer 3D depth separable convolution layer has a convolution kernel size of (3, 3), a step length of (1, 1) and a convolution channel number of 64, the 2D depth separable convolution layer has a convolution kernel size of (1, 1), and the convolution channel number of 1, and the 2D depth separable convolution layer compresses the time dimension.
5. The video stream processing method for video monitoring scene based on YOLO model according to claim 4, wherein the feature map with size of nxh/2 xw/2 xq 1 is outputted by the layer 1 3D depth separable convolution layer, Q1 is the number of channels of the layer 1 3D depth separable convolution layer, the feature map with size of nxh/4 xw/4 xq 2 is outputted by the layer 2 3D depth separable convolution layer, Q2 is the number of channels of the layer 2 3D depth separable convolution layer, and the feature map with size of nxh/4 xw/4 xq 3 is outputted by the layer 3D depth separable convolution layer, and Q3 is the number of channels of the layer 3D depth separable convolution layer.
6. The video stream processing method for a video surveillance scene based on a YOLO model according to claim 4, wherein the layer 1 3D depth separable convolutional layer, the layer 2 3D depth separable convolutional layer, the layer 3D depth separable convolutional layer and the 2D depth separable convolutional layer are sequentially connected, the layer 1 3D depth separable convolutional layer is further connected with the 2D depth separable convolutional layer in a jumping manner, and the layer 2 3D depth separable convolutional layer is further connected with the 2D depth separable convolutional layer in a jumping manner.
7. The video stream processing method for a video monitoring scene based on the YOLO model according to claim 1, wherein the method for performing spatial block division on the time sequence saliency map and confirming the saliency category of each spatial block, and generating a corresponding control signal according to each spatial block in the saliency category-based feature map is as follows: The method comprises the steps of dividing a time sequence saliency map into 4×4 space blocks, wherein each space block comprises 16 pixels, calculating a maximum saliency value S max , an average saliency value S avg and a rest count C static of each space block of the time sequence saliency map, judging the current space block as a high saliency block and generating a first control signal for the high saliency block if S max > T of the current space block, otherwise, comparing C static of the current space block with the frame number N of a video stream, judging the current space block as an extremely low saliency block and generating a second control signal for the extremely low saliency block if S max is less than or equal to T and C static is less than or equal to N, judging the current space block as an extremely low saliency block and generating a second control signal for the extremely low saliency block if S max of the current space block is less than or equal to T and C static < N, comparing S avg of a single current space block with T, judging the current space block as a medium saliency block and generating a third control signal for the medium saliency block if S max is less than or equal to T and C static ＜N、S avg is less than or equal to T, and judging the current space block as a fourth saliency block is less than or equal to T and is less than or equal to T.
8. The video stream processing method for video surveillance scene of claim 7, wherein the maximum saliency value S max of each spatial block of the temporal saliency map is a maximum saliency value of 16 pixels in each spatial block, the average saliency value S avg of each spatial block of the temporal saliency map is an average of saliency values of 16 pixels in each spatial block, and the still count C static of each spatial block of the temporal saliency map is a number of times that a current spatial block is consecutively determined as a low saliency block and/or an extremely low saliency block.
9. The video stream processing method for video surveillance scenes based on YOLO model of claim 8, wherein the saliency value of each pixel in each spatial block of the temporal saliency map is calculated by the following formula: Where s is a saliency value of each pixel in each spatial block of the time sequence saliency map, a value range of s is 0-1, σ sigmoid () is an activation function, conv2D1 (1,1) is a 2D convolution operation, concat () is a splicing function, β1 is a first layer operation, β2 is a second layer operation, β3 is a third layer operation, M1 is an adjacent frame difference, M2 is an interval frame difference, M3 is an accumulated weighted difference, and F is an original frame feature.
10. The video stream processing method for video monitoring scene based on YOLO model as defined in claim 7, wherein the method for respectively performing conditional computation on each spatial block of the basic feature map according to the control signal is that complete computation is performed on the spatial block in the basic feature map corresponding to the high-saliency block according to the first control signal, no computation is performed on the spatial block in the basic feature map corresponding to the very low-saliency block according to the second control signal, and the feature map of the spatial block is buffered in the system for direct inter-frame multiplexing in the subsequent, and the third control signal and the fourth control signal trigger sparse computation; The complete calculation comprises the steps of performing standard convolution calculation of all subsequent layers, performing complete up-sampling and fusion operation in a feature pyramid network, performing calculation of all attention heads if an attention module exists, and preferentially processing calculation tasks of high-significance blocks on resource scheduling.
11. The video stream processing method for video surveillance scene based on YOLO model of claim 10, wherein the sparse computation performed according to the third control signal comprises performing a subsequent depth separable convolution computation of all layers, performing a complete up-sampling and fusion operation in a feature pyramid network, for a spatial block in a base feature map corresponding to a mid-saliency block; the sparse computation performed according to the fourth control signal includes performing a subsequent depth separable convolution computation of all layers on spatial blocks in the underlying feature map corresponding to low-saliency blocks, skipping a complete upsampling and fusion operation in the feature pyramid network, and turning off the attention mechanism.
12. The video stream processing method for video monitoring scene based on YOLO model according to claim 10, wherein the original frame features are obtained by directly reading RGB three-channel pixel values of the input video stream, and the basic feature map is obtained by performing multi-level downsampling and feature extraction on a single frame image after standard 2D convolution, batch normalization, activation function and pooling layer in a backbone network of the YOLO model.
13. A video stream processing device facing a video monitoring scene based on a YOLO model, which is characterized in that the video stream processing device processes a video stream by using the video stream processing method facing the video monitoring scene based on the YOLO model according to any one of claims 1 to 12, the video stream processing device comprises a time sequence saliency sensing module and a structured sparse execution module, the YOLO model comprises a trunk network, a feature pyramid and a detection head, the time sequence saliency sensing module is connected in parallel with the trunk network, the trunk network is configured to extract basic visual features of the video stream to output a basic feature map, the time sequence saliency sensing module is configured to extract multi-scale motion information and spatial structure information of the video stream to output a time sequence saliency map, the structured sparse execution module is configured to space-block divide the time sequence saliency map and confirm the saliency category of each space block, then generate a corresponding control signal for scheduling features and/or detection heads in the YOLO model according to the saliency category as the basic feature map, the time sequence saliency pyramid is configured to output a feature pyramid control signal to the detection head, and the result is calculated according to the feature pyramid configuration and the detection head is configured to calculate the feature pyramid configuration.

Description

Video stream processing method and device for video monitoring scene based on YOLO model Technical Field The invention relates to the technical field of computers, in particular to a video stream processing method and device for video monitoring scenes based on a YOLO model. Background With the depth fusion of artificial intelligence and edge computation, video analytics applications have migrated from cloud centralized processing to edge-side distributed perception. In the edge video analysis scene of intelligent security, automatic driving, unmanned aerial vehicle inspection, unmanned ship monitoring and the like, the equipment needs to perform real-time target detection on continuous video streams locally. These edge devices (e.g., embedded AI chips, mobile GPUs, NPUs) have severe limitations in terms of computing power, memory capacity, and power consumption budget. How to balance high-precision real-time detection and limited hardware resources has become a key bottleneck restricting the scale landing of edge video analysis technology. The current main stream technical route mainly focuses on model compression and optimization, including methods of model lightweight structural design (such as MobileNet, shuffleNet), knowledge distillation, network pruning, parameter quantization and the like, and the technologies generate smaller and faster model versions to adapt to the edge deployment requirements by performing global optimization on the model in a training stage. Existing optimization methods are static and offline in nature, and each frame is independently processed, and the high correlation between continuous frames is completely ignored. In an actual video scene, a large amount of background areas are kept unchanged among frames, but the existing method still carries out repeated calculation frame by frame, which causes serious problems of video stream time sequence redundancy calculation and resource waste. Disclosure of Invention The invention aims to provide a video stream processing method and device for a video monitoring scene based on a YOLO model, which can identify inter-frame motion and scene change aiming at an edge video analysis scene, dynamically allocate computing resources and obviously reduce computing cost and power consumption on the premise of ensuring key target detection accuracy. In order to achieve the above purpose, the invention adopts the following technical scheme: A video stream processing method facing a video monitoring scene based on a YOLO model comprises the following steps: for detecting target detection objects in the video monitoring scene, respectively performing condition calculation on the target detection objects and the non-target detection objects, and outputting target detection results, the video stream processing method comprises the following steps: Collecting video stream in real time, extracting multi-scale motion information and spatial structure information of the video stream to output a time sequence significance map, wherein the multi-scale motion information is a differential characteristic of the video stream, and the spatial structure information is an original frame characteristic of the video stream; Carrying out space block division on the time sequence saliency map, confirming the saliency category of each space block, and generating a corresponding control signal according to each space block in the saliency category-based feature map; the spatial blocks of the basic feature map are in one-to-one correspondence with the spatial blocks of the time sequence saliency map, and the saliency categories of the spatial blocks of the time sequence saliency map comprise high saliency blocks, medium saliency blocks, low saliency blocks and extremely low saliency blocks, and each control signal corresponds to a calculation mode to be executed on the spatial blocks of the basic feature map; And respectively carrying out condition calculation on each space block of the basic feature map according to the control signal, and outputting a target detection result, wherein the target detection result is a video stream marked with the position of a target detection object, and the condition calculation is that the complete calculation is carried out on the space blocks in the basic feature map corresponding to the high-significance blocks, the calculation is not carried out on the space blocks in the basic feature map corresponding to the extremely-low-significance blocks, and the sparse calculation is carried out on the space blocks in the basic feature map corresponding to the medium-significance blocks and the low-significance blocks. The design obviously enhances the perception capability of a moving object and simultaneously suppresses the response of a static background by extracting time sequence saliency information from an input video stream to identify an inter-frame change area, so that a network can simultaneously perceive a static structure (