CN-117576600-B - Convolutional neural network inference optimization method based on video detection

CN117576600BCN 117576600 BCN117576600 BCN 117576600BCN-117576600-B

Abstract

The invention discloses a convolutional neural network inference optimization method based on video detection, which comprises a video frame counting module, a background frame storage module, a similarity detection module, a convolutional neural network inference calculation module and a correlation matrix calculation module, wherein the video frame counting module is used for recording the total amount of input video frames and the amount of input video frames after updating background frames, the background frame storage module is used for storing background frames and inference results thereof and recording updated interval time of the background frames, the similarity detection module is used for comparing similarity between the input video frames and the background frames, the convolutional neural network inference calculation module is used for obtaining the inference results of the input video frames, and the correlation matrix calculation module is used for calculating a threshold matrix and an accumulated frame matrix. The invention relates to a video task convolutional neural network inference optimization method with low cache cost, which does not need to consume a large amount of search time or learning cost, and reasonably multiplexes background frame inference results by dynamically updating background frames and comparing the similarity of the background frames and input frames, so that the calculation amount of the convolutional neural network for processing monitoring video is reduced, and the inference time is shortened under acceptable precision loss.

Inventors

ZHAO HONGZHI
ZHEN XIN
LIU SHUN

Assignees

北京交通大学

Dates

Publication Date: 20260508
Application Date: 20220804

Claims (6)

1. The convolutional neural network inference optimization method based on video detection is characterized by comprising a video frame counting module, a convolutional neural network inference calculation module, a similarity detection module and a correlation matrix calculation module; the video frame counting module is used for recording the total input video frame number and the input video frame number after updating the background frame; The convolutional neural network inference calculation module is used for storing background frames, calculating and storing inference results of the background frames, and recording time length from updating the background frames to updating the background frames next time; The similarity detection module is used for comparing the similarity of the input video frame and the background frame; The correlation matrix calculation module is used for calculating a threshold matrix and an accumulated frame matrix; the method comprises the following working steps: s1, input video frame Then go to step S2; S2, adding 1 to the total video frame number N input by the video frame counting module, wherein the current video frame is Then, the step S3 is carried out; S3, judging whether N is greater than 1, if N 1, Go to step S4, if N 1, Turning to step S10; s4, the convolutional neural network deducing and calculating module sets the video frame when N is equal to 1 as an initial background frame and the video frame As background frames Preserving and calculating The inferred result of (a) is stored, time begins to be timed, and then step S5 is carried out; s5, the correlation matrix calculation module calculates an accumulated frame matrix Then go to step S6; S6, the convolutional neural network deducing and calculating module judges whether the Time is greater than or equal to the set updating Time length t, if so T, go to step S7, if Time T, turning to step S8; s7, judging whether video frames are input or not, if yes, turning to S1, otherwise, ending; S8, the convolutional neural network inference calculation module reads the accumulated frame matrix in the correlation matrix calculation module Calculating and setting a historical average frame as a background frame Preserving and calculating The inferred result of (2) is stored, the Time is set to 0 for re-timing, and then the step S9 is carried out; s9, the correlation matrix calculation module resets a threshold matrix The element in the element is 0, and then the step S20 is carried out; S10, the video frame counting module calculates N' and adds 1, and then the step S11 is carried out; s11, judging whether N 'is larger than 1, if N' 1, Go to step S15, if N' 1, Turning to step S12; S12, the similarity detection module calculates an input video frame And background frame Similarity matrix of (c) And threshold matrix Comparing element by element, counting the element number num larger than the threshold matrix, and then turning to step S13; S13, the similarity detection module judges whether num is larger than a similarity threshold variable alpha, if num Alpha, go to step S14, if num Then go to step S17; S14, setting a similarity mark variable by the similarity detection module Is that Then, the process proceeds to step S15; S15, reading by the convolutional neural network inference calculation module Directly outputting the inferred result of (2), and then turning to step S16; s16, the correlation matrix calculation module calculates and updates a threshold matrix And accumulating a frame matrix Then go to step S6; S17, setting a similarity mark variable by the similarity detection module Is that Then, the process proceeds to step S18; s18, setting video frames by the convolutional neural network inference calculation module As background frames Preserving and calculating And saving the inferred result of (a), setting Time to 0 for re-timing, and then turning to step S19; S19, the correlation matrix calculation module resets a threshold matrix The element in the matrix is 0, and the updated accumulated frame matrix is calculated Then, the process proceeds to step S20; s20, setting N' as 0 by the video frame counting module, and then turning to step S6; The correlation matrix calculation module in S16 calculates a threshold matrix according to the average change value of the pixel values at each pixel point of all the historical video frames from the background frame to the input video frame Accumulating a matrix of frames Then the sum of the pixel values at each pixel point of all the historical video frames is calculated.
2. The convolutional neural network inference optimization method based on video detection as set forth in claim 1, wherein the video frame counting module records the total number N of input video frames and the number N ' of input video frames after updating the background frames, the initial value of N is 0, the initial value of N ' is 0, the background frames are not counted in N ', the value of N is increased by one every video frame, the value of N ' is increased by one every video frame after updating the background frames, and the value of N ' is set to 0 when the background frames stored in the convolutional neural network inference calculation module are updated.
3. The method for optimizing convolutional neural network inference based on video detection of claim 1, wherein the convolutional neural network inference calculation module stores background frames, operates the convolutional neural network algorithm to calculate and store inference results of the background frames, directly multiplexes the stored inference results of the background frames with respect to video frames similar to the background frames, and records a time period from updating the background frames to the next updating of the background frames for ensuring that the background frames can be updated in time without updating the background frames for a long time.
4. The method for inference optimization of convolutional neural network based on video detection as set forth in claim 1, wherein the similarity detection module subtracts pixel values at each corresponding pixel point of the input video frame and the background frame to obtain a similarity matrix, and the similarity matrix is compared with a threshold matrix of the correlation matrix calculation module And comparing matrix element values one by one to judge the similarity of the input video frame and the background frame.
5. The method of convolutional neural network inference optimization based on video detection as set forth in claim 1, wherein the calculation flow comprises three phases, an initialization phase, a similar frame processing phase and a background frame updating phase, wherein the initialization phase comprises steps S4 and S5, the similar frame processing phase comprises steps S14, S15 and S16, the background frame updating phase comprises steps S8, S9, S17, S18, S19 and S20, the video frame counting module resets N' to 0 every Time a background frame is updated, the convolutional neural network inference calculation module resets Time to 0 to start timing again, and the correlation matrix calculation module resets the threshold matrix Is a matrix of zero values.
6. The method for convolutional neural network inference optimization based on video detection as set forth in claim 1, wherein in step S8, the convolutional neural network inference calculation module reads the accumulated frame matrix in the correlation matrix calculation module Dividing N' to calculate to obtain a historical average frame, setting the historical average frame as a background frame to be stored, and running a convolutional neural network algorithm to obtain a new background frame estimation result and storing the new background frame estimation result.

Description

Convolutional neural network inference optimization method based on video detection Technical Field The invention belongs to the field of convolutional neural network algorithms. And more particularly, to a convolutional neural network inference optimization method based on video detection. Background Convolutional neural networks (CNN, convolutional Neural Network) have become a representative in the DEEP LEARNING (deep learning) field. In the development process of the convolutional neural network, researchers adopt more layers and more complex network structures in order to improve the accuracy of the network model, so that the parameter quantity and the calculation quantity of the model are continuously increased, and the calculation quantity of the network models such as VGG16, VGG19 and the like even exceeds 100GFLOPs. Currently, when a convolutional neural network is used for deducing the video, each frame of the video is sent into a network model to carry out deducing calculation, and the calculated amount of the convolutional neural network directly influences the processing efficiency of a video deducing task. In recent years, the internet of things rapidly develops, and in order to quickly respond to a user request and protect user privacy data, a deep learning algorithm such as a convolutional neural network is deployed on internet of things terminal equipment with limited resources and operation is completed, so that the development trend is gradually developed. Therefore, the calculation amount of the convolutional neural network model is reduced as much as possible while ensuring that the model precision loss is within an acceptable range, the memory occupation is reduced, and the model deducing time is reduced. When the video source is a real-time video monitoring system, a certain correlation exists between the pixels in the image sequence and the adjacent frames in the time domain, and the correlation is also called the temporal locality of the video when the continuous frames sent into the network have no obvious moving objects and a large number of similar scenes exist and when the objects move, the correlated pixel values have larger fluctuation. Currently, some research works exploit the temporal-spatial locality of adjacent frames of a video detection task to optimize the inferred computation process of convolutional neural networks. The method mainly comprises three schemes, namely caching the calculation result of each layer of the key video frame, carrying out block search on the similar region, and carrying out intermediate result reuse on the similar region, wherein the other scheme also needs to cache the calculation result of each layer of the key video frame, but only updates the calculation result of the region corresponding to the change pixel according to the change pixel of the judgment video frame, and the other scheme is to cache part of key intermediate calculation result, bind the intermediate calculation result with network output, and skip the rest calculation and directly output the result if any intermediate calculation result is consistent with the previous one in the reasoning process. The scheme adopts the scheme of space time exchange, but when the network is optimized, a large amount of buffer memory space is additionally increased to store data, meanwhile, the searching cost is increased, and when a hash table structure is established to manage the data, the additional model buffer memory and model learning cost are also increased. These can all present certain complications and limitations when deployed in devices where the resources themselves are limited. Disclosure of Invention 1. Technical problem to be solved In most surveillance scenes, the lens is in a stationary state for a substantial portion of the time, there are no moving people or objects in the picture, and the background is in a stationary or slowly changing state. Aiming at the defects of the prior art, how to use the characteristic of slow background change of video frames, a convolution neural network deducing optimization scheme with low cache cost is provided, so that the calculated amount in the convolution neural network deducing process is reduced, the speed of the convolution neural network deducing calculation is improved, and the occupation of the memory in the resource limited equipment is reduced. 2. Technical proposal The invention provides a convolutional neural network inference optimization method based on video detection, which aims to utilize space-time locality among monitoring video frames to construct a region which is static for a long time in a video into a background frame of the video, buffer the background frame and the final inference result of the convolutional neural network processing the background frame, compare the similarity between the current video frame and the background frame, directly multiplex the final inference result of the background frame when the