CN-122019432-A - NPU-based edge calculation video stream processing method and system
Abstract
The invention discloses an edge computing video stream processing method and system based on NPU, the method comprises video stream data acquisition and writing, zero copy hardware preprocessing, model loading and asynchronous reasoning, and result acquisition and post-processing. According to the invention, by constructing a control-calculation hardware level decoupling architecture and starting a special NPU to perform reasoning, the load and the power consumption of the CPU are greatly reduced, and the problems of system frequency reduction and dead halt caused by overheat of the CPU are thoroughly solved. By establishing the full-link zero-copy data channel based on the DMA and RGA hardware, the processing delay of the video stream is obviously reduced, and stable real-time analysis under high frame rate is realized. By adopting the operator fusion and INT8 mixed quantization technology, the calculation and memory access efficiency of the model on the NPU is doubled, and higher energy efficiency ratio and throughput are obtained while the accuracy is ensured. Finally, the edge equipment can stably operate at low temperature for a long time in a closed environment without fan heat dissipation.
Inventors
- Lai Xifei
- ZHANG ZHENG
- LIU CHANG
- SUN FENG
- FENG LEI
Assignees
- 宁波兴博元智能技术有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260122
Claims (9)
- 1. The edge computing video stream processing method based on the NPU is characterized by comprising the following steps of: S1, video stream data acquisition and writing, namely controlling image acquisition equipment to acquire video stream data, and directly writing the acquired data into a pre-established physical continuous shared memory pool through direct memory access DMA; s2, zero-copy hardware preprocessing, namely calling a special hardware preprocessing unit integrated with the SoC, directly preprocessing video stream data stored in the physical continuous shared memory pool to generate tensor data meeting the input requirement of a target neural network model, and directly writing the processed data into an input memory area of the NPU; S3, model loading and asynchronous reasoning, namely loading the target neural network model subjected to computational graph optimization, operator fusion and low-precision quantization to an on-board NPU, asynchronously issuing a reasoning instruction aiming at tensor data to the NPU by a Central Processing Unit (CPU), and independently executing a reasoning calculation task by the NPU; s4, obtaining and post-processing results, namely informing the CPU to obtain a structural reasoning result through an interrupt mechanism after the NPU finishes reasoning calculation, and carrying out post-processing on the obtained reasoning result by the CPU.
- 2. The method according to claim 1, wherein in step S2, the dedicated hardware preprocessing unit is a 2D graphics acceleration engine RGA, the preprocessing includes at least color space conversion and image scaling, and the physical continuous shared memory pool is applied and managed by a direct rendering manager DRM and DMA-BUF mechanism of a Linux operating system.
- 3. The NPU-based edge computing video stream processing method according to claim 1, wherein in step S3, the computation graph optimization and operator fusion performed on the target neural network model includes identifying continuous computation nodes in the model, merging a convolution layer, a batch normalization layer and an activation layer into a single composite operator, so that middle layer feature data flows in an NPU internal cache, and writing into an external dynamic random access memory DDR is avoided.
- 4. The NPU-based edge computing video stream processing method according to claim 1 or 3, wherein in step S3, the low-precision quantization uses an INT8 full integer quantization based on KL divergence calibration, or a hybrid quantization strategy, where a network layer sensitive to precision in the model is kept in FP16 format, and a feature extraction layer in the model is quantized in INT8 format.
- 5. The method for processing NPU-based edge computing video stream as claimed in claim 1, wherein in step S3, the CPU sends an inference command asynchronously to the NPU, specifically, after the CPU calls the NPU operation interface to send the inference command, the CPU returns to and executes other control tasks or enters a suspension state immediately, without blocking to wait for the NPU inference to complete.
- 6. The method according to claim 1 or 5, further comprising a dynamic voltage frequency adjustment DVFS step of monitoring a load of the NPU and/or a temperature of the SoC in real time, and dynamically adjusting an operating voltage and frequency of the NPU according to a monitoring result.
- 7. An NPU-based edge computation video stream processing system operating on a system on a chip SoC integrated with an NPU and a dedicated hardware pre-processing unit, comprising: the zero-copy data channel management module is used for creating and managing a physically continuous shared memory pool, controlling image acquisition data to be directly written into the shared memory pool through DMA, and controlling the special hardware preprocessing unit to directly read and write the data in the shared memory pool; The model optimization and loading module is used for performing computational graph optimization, operator fusion and low-precision quantification on the target neural network model, and loading the optimized model to the NPU; the asynchronous reasoning scheduling module is used for enabling the CPU to asynchronously issue reasoning tasks to the NPU and receiving a returned result after the NPU finishes calculation; And the system tuning module is used for dynamically adjusting the working state of the NPU according to the load of the NPU and/or the temperature of the SoC.
- 8. The system of claim 7, wherein the dedicated hardware preprocessing unit is a 2D graphics acceleration engine RGA, and the zero-copy data channel management module implements reservation and management of the physical contiguous shared memory pool through a direct rendering manager DRM subsystem and a contiguous memory allocator CMA mechanism of a Linux kernel.
- 9. The system of claim 7, wherein the asynchronous inference scheduling module comprises an asynchronous callback unit and a dynamic frequency modulation unit, wherein the asynchronous callback unit is used for realizing the response of non-blocking call of a CPU and the completion of interrupt of an NPU, and the dynamic frequency modulation unit is used for executing the decision of the system tuning module and adjusting the voltage and the frequency of the NPU in real time.
Description
NPU-based edge calculation video stream processing method and system Technical Field The invention relates to the technical field of embedded artificial intelligence and edge calculation, in particular to a method and a system for processing an edge calculation video stream of a neural processing unit NPU. Background Along with the deep fusion of artificial intelligence and the Internet of things technology, the intelligent video monitoring system is increasingly widely applied to the fields of intelligent transportation, public safety, industrial inspection and the like. Such applications typically require computation-intensive tasks such as target detection, feature extraction, etc. to be done in real-time in a massive video stream, and place stringent demands on the response speed, power consumption, and stability of the system. In the traditional centralized cloud computing mode, because of limitation of network bandwidth and transmission delay, real-time requirements are difficult to meet, so that an edge computing architecture for sinking computing tasks to network edge devices (such as intelligent cameras and embedded industrial personal computers) has become an important development direction of industry. Currently, most of the mainstream embedded edge vision systems adopt a computing architecture with a general purpose processor (CPU) as a core, or are assisted by a Graphics Processor (GPU) for acceleration. However, when processing continuous high-resolution video streams and running modern deep neural network models such as YOLO, such general-purpose computing architecture exposes the following inherent drawbacks, which become major bottlenecks that limit system performance and reliability: (1) The computational architecture does not match and robustly preempt the problem that the CPU acts as a general purpose processor whose architecture design is adept at handling complex control logic and serial tasks, but is not suitable for performing highly parallelized convolution operations in deep learning models. In practical deployment, the CPU often needs to simultaneously take on various tasks such as video stream acquisition, image preprocessing, neural network reasoning, and upper layer application logic (e.g., alarm, data uploading). The deep coupling of the control task and the calculation task causes that CPU resources are occupied by high-load reasoning tasks in a large quantity, so that upper business logic response delay, system overall throughput reduction and even a clamping phenomenon are caused, and the real-time requirement in a high-dynamic scene cannot be met. (2) In the conventional processing flow based on a general-purpose operating system such as Linux, video frame data usually needs to be copied from a camera drive buffer area in a kernel state to an application memory space in a user state for preprocessing and post-processing by a CPU. Such frequent memory copy operations not only consume a significant number of CPU cycles, but more severely occupy limited external memory (DDR) bandwidth, introducing significant end-to-end processing delays. This problem is particularly pronounced for high frame rate video streams, which becomes a major obstacle to achieving deterministic low latency processing. (3) The thermal failure and system stability risk caused by high power consumption is that the energy efficiency ratio (TOPS/W) of the CPU to perform floating-point dense neural network reasoning is low. Under long-time and high-load operation conditions, the chip can generate a large amount of heat, so that the core temperature rapidly rises. When the temperature exceeds the hardware protection threshold (typically 75-85 ℃), the system will trigger a temperature controlled frequency down mechanism, forcing the CPU operating frequency down to reduce heating, which will directly result in a process frame rate cliff drop. Under extreme conditions, the system may restart or crash due to overheat, and continuous and stable operation for 7×24 hours in an unattended and environment-closed industrial scene cannot be ensured. To address the above challenges, the prior art has proposed improvements. For example, some patents such as CN116977826a, which are reconfigurable neural network target detection systems and methods under an edge-side computing architecture, propose edge-side target detection systems that use reconfigurable neural network accelerators to work cooperatively with processors, and other patents such as CN113240101a, which are heterogeneous SoC implementation methods for the software and hardware collaborative acceleration of convolutional neural networks, focus on the implementation of the convolutional neural network acceleration through the software and hardware collaborative design. However, these schemes either fail to completely implement hardware-level decoupling of control and computation tasks, or do not systematically solve the problem of full-link zero copy from data