CN-122023776-A - Multi-target detection method based on task driving
Abstract
The invention discloses a multi-target detection method based on task driving, which comprises the steps of S1 constructing and preprocessing a data set, collecting image data of a region to be detected and monitored according to a target detection task, marking the obtained image data, making the data set, preprocessing the data set, S2 constructing and training a multi-target detection model, constructing a multi-target detection network, and utilizing a data set training model, S3 detecting and verifying, namely locally deploying the multi-target detection model, accessing multiple real-time video streams into the multi-target detection model to carry out multi-target detection, judging whether targets exist in the monitored region, and comparing and analyzing a multi-channel video stream pushing result with a round inspection algorithm pushing result to output a final result. The method combines UIB module light algorithm and round robin algorithm to save bandwidth consumption, and combines PIoUv2 loss function to improve detection precision, accuracy, instantaneity and adaptability under the conditions of limited hardware and limited bandwidth.
Inventors
- CHEN DALONG
- ZHANG DONGDONG
- Fu Yunkai
- ZHOU JIE
- Zhu Ronghe
- MENG WEI
Assignees
- 南京华苏科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260203
Claims (10)
- 1. The multi-target detection method based on task driving is characterized by comprising the following steps of: S1, constructing and preprocessing a data set, namely acquiring image data of a region to be detected and monitored according to a target detection task, marking the obtained image data with a maximum rectangular frame, and dividing the marked data set into a training set and a testing set; S2, constructing and training a multi-target detection model, namely constructing a multi-target detection network YOLOv n_ UIB _ PIoUv2, and then utilizing a data set training model to obtain the multi-target detection model; And S3, detecting and verifying that a multi-target detection model is deployed locally, a plurality of paths of real-time video streams are accessed into the multi-target detection model to carry out multi-target detection, judging whether targets exist in a monitoring area, and comparing and analyzing a plurality of paths of video stream pushing results with a round inspection algorithm pushing result to output a final result.
- 2. The task-driven multi-target detection method according to claim 1, wherein the step S2 is characterized in that YOLOv target detection models are selected as detection bases, a UIB module light weight algorithm and a round robin algorithm are combined to save bandwidth consumption, a PIoUv loss function is combined to improve detection accuracy, a YOLOv n_ UIB _ PIoUv2 network is built, a YOLOv n_ UIB _ PIoUv2 multi-target detection model is obtained through training, a YOLOv n_ UIB _ PIoUv2 network introduces a C2f_ UIB module to replace a C2f module, and a C2f_ UIB module utilizes an inversion bottleneck module and introduces additional depth convolution extraction features to ensure calculation efficiency and minimize parameters.
- 3. The task-driven multi-objective detection method according to claim 2, wherein the YOLOv n_ UIB _ PIoUv2 network in step S2 uses PIoUv2 loss function instead of original CIoU loss function in the prediction output, and the PIoU loss function integrates adaptive penalty factors according to the target size, and combines gradient enhancement by anchor frame quality, and Powerful-IoUv2 introduces a non-monotonic priority mechanism, enhancing the capability of emphasizing IoU anchor frames with medium quality of 0.3-0.7.
- 4. A task driven based multi-objective detection method as recited in claim 3, wherein the PIoU loss function is formulated as: ; ; ; ; ; ; Wherein, the To protect the function of the adaptive penalty factor, q is the anchor frame quality factor, In the case of a non-monotonic priority mechanism, As a function of the loss, A loss function is based PIoU; is PIoUv2 improved loss function, P is adaptive penalty factor, For the purpose of the priority adjustment parameter, Is a non-monotonic priority weighting function.
- 5. The task-driven multi-target detection method according to claim 3, wherein a round robin algorithm is introduced into the YOLOv n_ UIB _ PIoUv2 multi-target detection model in the step S2, and the edge device is enabled to execute target detection tasks on only part of video streams in each time frame through a periodic round robin grouping mechanism, specifically, for N video streams accessed in real time, the algorithm divides the N video streams into M detection groups, and the edge device sequentially executes target detection on each group of video streams according to a preset period, thereby realizing the effect of time-division multiplexing of edge computing resources and transmission bandwidths.
- 6. The task-driven multi-target detection method according to claim 5, wherein the specific steps of the round robin algorithm are as follows: (1) Dividing N video streams accessed in real time into M detection groups, recording as G 1 ~G M , initializing a counter i=1, and setting the period as M; (2) Extracting data of a group of frames of the current frame G i ; (3) Calling a multi-target detection model, namely accessing the current multi-path video stream G i groups of frame data into the target detection model; (4) Temporarily storing or uploading the result, namely temporarily storing or uploading the detection result output by the target detection model in the step (3); (5) Round robin switching, namely releasing the resource i=i+1, and if i > M, i=1; (6) And (3) repeating the period, namely returning to the step (2) to carry out current frame detection again after the period M is waited to be completed.
- 7. The task-driven multi-objective detection method according to claim 5, wherein when training YOLOv n_ UIB _ PIoUv2 model is adopted in the step S2, the loss function is PIoUv, the learning rate of the optimizer AdamW is 0.01, the momentum is set to 0.937, the weight attenuation is 0.0005, the training wheel number is 300, the batch size is set to 8, the input image size is downsampled to 640×640, all images in the training set are traversed, and model training is completed to obtain the multi-objective detection model.
- 8. The task driven based multi-objective detection method according to claim 7, wherein in said step S2 the c2f_ UIB module implements a three-step expansion-convolution-reduction strategy with an inverted residual framework using an inverted bottleneck module, wherein the initial dimension expansion increases feature representation capability, then uses depth separable convolution for computation to extract features, and finally maintains information fidelity by dimension reduction based on linear activation.
- 9. The task-driven based multi-target detection method according to claim 5, wherein the step S2 further comprises constructing one or more of a fast R-CNN multi-target detection model, an SSD multi-target detection model, a RETINANET multi-target detection model, a YOLOv S multi-target detection model, and a YOLOv n multi-target detection model.
- 10. The task-driven multi-target detection method according to claim 5, wherein the target detection model in the step S2 is a Faster R-CNN model, and the method comprises the steps of extracting a characteristic network, generating a region network, ROI Pooling layers, classifying and regressing the characteristic network, receiving an image with any size at an input end of the model, and inputting the image into the network after standardized processing; Generating an image feature map by adopting a pre-training CNN in a model feature extraction network through convolution and pooling operation, taking the feature map as input in a region generation network, parallelly outputting foreground/background judgment of a candidate region and a bounding box offset by a convolution layer by means of a preset anchor point to realize end-to-end candidate region generation, mapping the candidate region generated by the RPN on the feature map in a ROI Pooling layer, pooling the feature map into a fixed size feature, further processing the fixed size feature in a classification and regression network, outputting target class probability and bounding box refinement parameters, and adopting cross entropy loss and smooth L1 loss to form a multi-task loss function; The target detection model in the step S2 is an SSD model and comprises a basic network, a multi-scale feature layer and a detection head, wherein an image with a fixed size is received at the input end of the model, and is input into the network after being preprocessed; the method comprises the steps of adopting VGG as a CNN network in a model basic network for extracting features from an image bottom layer to a middle layer, generating feature graphs with different scales in a multi-scale feature layer by carrying out convolution operation on the output features of the basic network so as to adapt to target detection with different sizes; The target detection model in the step S2 is RETINANET, which comprises a backbone network, a characteristic pyramid network, a classification sub-network and a regression sub-network, wherein the backbone network, the characteristic pyramid network, the classification sub-network and the regression sub-network are used for receiving images at the input end of the model and carrying out standardized pretreatment, the ResNet network is used for extracting basic characteristics in the backbone network of the model, a multi-scale characteristic hierarchy is built in the characteristic pyramid network, high-low layer characteristics are fused, the characterization capability of targets with different scales is enhanced, the classification sub-network is used for processing the characteristics with different scales through a plurality of convolution layers, and a target class score is output; The target detection model in the step S2 is YOLOv S model, which comprises an input end, a main network, a connecting network and a prediction output end, wherein the YOLOv model network comprises four parts of the input end, the main network, the connecting network and the prediction output end, and both detection precision and efficiency are considered. The input end inherits and optimizes the Mosaic data enhancement to expand the diversity of the data set, adds the adaptive anchor frame calculation, automatically obtains the optimal anchor frame parameters through iterative training, adopts the adaptive image scaling, adds the least black edge to reduce the calculated amount and improves the reasoning speed. And a backbone network, namely adding a Focus structure, slicing and processing an image, and outputting a high-concentration characteristic map through convolution operation to provide high-quality characteristics for subsequent detection. Connecting a network, namely adopting an FPN+PAN structure and fine-tuning, wherein the FPN fusion characteristic from top to bottom improves the target detection capability, and the PAN location information from bottom to top increases the location precision; the prediction output end is CIOU _Loss for regression tasks and BCE_Loss for classification tasks, and double Loss guarantees the detection precision and stability; The YOLOv model network used in the step S2 takes an input end, a main network, a neck network and a prediction output end as cores, wherein the main network adopts a C2f module, the neck network adopts a FPN+PAN structure and is updated to a C2f+SPPF combination, the prediction output end adopts a forward version DIoU _loss/CIoU _loss of CIOU _loss as a regression task, the regression accuracy of a boundary frame is further optimized, and the classification task combines BCE_loss and Focal_loss to relieve the problem of class unbalance and the double-Loss collaborative guarantee the detection stability.
Description
Multi-target detection method based on task driving Technical Field The invention belongs to the technical field of computer vision, and particularly relates to a multi-target detection method based on task driving. Background Along with the rapid development of computer vision technology, multi-target detection is taken as one of core technologies, and has extremely high application value in the field of real-time supervision (such as intelligent traffic control, industrial production safety monitoring, public area safety monitoring and other scenes). In an intelligent traffic scene, various targets such as vehicles, pedestrians, traffic marks and the like are required to be detected in real time and accurate position information is output, data support is provided for violation identification and flow scheduling, in industrial production safety monitoring, equipment abnormal states, personnel violation operation and dangerous goods are required to be rapidly identified, continuity and safety of a production process are ensured, in public area security monitoring, suspicious personnel and carried goods in dense crowds are required to be tracked and detected in real time, and early warning and rapid disposal of a power-assisted security event are required. The development of the multi-target detection technology is always developed around precision improvement and efficiency optimization, and revolutionary evolution from traditional machine learning to deep learning is experienced, but under the complex requirement of a real-time supervision scene, the core technical bottleneck is not broken through yet. Before deep learning technology is raised, multi-objective detection is mainly constructed by a traditional machine learning method, and a core logic is a combined mode of 'manual feature extraction and classifier judgment'. The algorithm at this stage needs to rely on expert experience to design feature extraction rules, such as edge detection by Sobel operator and Canny algorithm, or capture the texture and shape information of the target by adopting manual feature descriptors such as HOG and SIFT, and then combines with classifiers such as Support Vector Machine (SVM) and random forest to distinguish the target from the background. For example, in early traffic monitoring, vehicle contours are often extracted through HOG features, and vehicle detection is achieved in cooperation with an SVM classifier. The method has the inherent limitations that the design of manual features is highly dependent on specific scenes, the generalization capability is extremely poor, the features designed for daytime illumination conditions are almost ineffective in a low-light environment at night, meanwhile, the complexity of feature engineering causes the difficulty of an algorithm to adapt to the requirement of simultaneous detection of multiple targets, and the adaptability to the scenes such as shielding, scale change and the like is insufficient, so that the detection requirement of the multiple targets and complex environments in real-time supervision cannot be completely met. The rise of deep learning technology, especially the application of Convolutional Neural Network (CNN), brings qualitative leap to multi-target detection, and realizes the key transition from 'manual feature' to 'automatic feature'. The CNN can automatically learn the hierarchical feature representation from the original image through the stacked structure of multi-layer convolution and pooling, and the accuracy and generalization capability of detection are obviously improved from the edge and texture features of the bottom layer to the semantic features of the high layer. The evolution of the algorithm at this stage forms two main directions, namely a two-stage detection framework represented by FasterR-CNN and a one-stage detection framework represented by Yolo and SSD. The appearance of Faster R-CNN marks the mature application of deep learning in the field of target detection, and the candidate region is automatically generated by introducing a Region Proposal Network (RPN), so that the low-efficiency selective search in the traditional method is replaced, and the end-to-end optimization of 'candidate region generation-feature extraction-classification regression' is realized. The subsequent Fast R-CNN further improves the detection efficiency through RoI Pooling layers of unified feature extraction scale, so that the algorithm precision reaches a new height. The inherent complexity of the two-stage architecture results in limited reasoning speed-the complex region generation and feature screening process makes it difficult to meet the millisecond response requirements in real-time regulatory scenarios, such as in highway section vehicle detection where the processing frame rate per second is well below the real-time standard of 25 frames. In order to solve the speed problem, one-stage algorithms such as the YOLO series and the like are ge