CN-116206241-B - Lightweight target identification method and system integrating position enhancement and self-adaptive labels

CN116206241BCN 116206241 BCN116206241 BCN 116206241BCN-116206241-B

Abstract

The invention discloses a lightweight target recognition method integrating position enhancement and self-adaptive label distribution, which belongs to the technical field of lightweight target recognition, and aims at target features under a video monitoring view angle, shallow features rich in position information and deep features rich in context information are subjected to feature fusion to extract target positions through geometrical offset relation between context information and targets, and then multi-scale targets obtained by grouping clustering and self-adaptive labels are distributed as targets matched with appropriate targets, so that target recognition is realized. The method accords with the target bbox positioning in the edge equipment, accords with the small target identification characteristic, and has reliable result and high processing efficiency.

Inventors

LI XUEMEI

Assignees

广西泰绘信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20230217

Claims (7)

1. A lightweight target identification method integrating position enhancement and self-adaptive label distribution is characterized by comprising the following steps: Comparing the width and height of a real frame of a target with the width and height of a picture under a video monitoring view angle, clustering the grouping anchor frames according to the ratio, and distributing anchors with different scales for the target; The method comprises the steps that an image under a video monitoring visual angle passes through a main network formed by ShuffleNetV light-weight networks, and a feature map of a target in the image is initially extracted by the main network; identifying a multi-scale target according to the feature map of the target through the position enhancement FPN, and carrying out feature fusion on the shallow feature map and the deep feature map to extract the target position; combining anchors with different scales obtained by grouping anchor frame clustering, and distributing the anchors into target matching anchors and type tags through self-adaptive tags; According to the anchors and the type labels distributed by each target, comparing the targets with the target real frames, and calculating the geometric offset and the classification loss value between each assigned anchor and the corresponding target real frame; Estimating all parameter values in the network model according to the classification loss value and the regression loss value by combining with an SGD optimization algorithm until the calculated loss value reaches the minimum so as to complete target recognition training, then discarding label distribution and the SGD optimization algorithm, repeating the network model calculation flow, and using a non-maximum suppression algorithm NMS so as to complete target recognition; the anchors with different scales obtained by clustering in combination with the grouping anchor frames are distributed into target matching anchors through self-adaptive labels, and the method comprises the following steps: Combining anchors with different scales obtained by grouping anchor frames, and distributing at least one anchor with similar shape for a small-scale target and a medium-large-scale target through shape similarity matching; If the anchors allocated based on the shape similarity matching do not meet IoU minimum value matching, the targets which do not meet IoU minimum value matching participate in label allocation of the next level, so that the real frames of each target are ensured to be allocated to the anchors; the allocation of at least one anchors with similar shapes to a small-scale target and a medium-scale target through shape similarity matching comprises the following steps: Calculating the mean value and standard deviation of the aspect ratio values of the real frames of the targets and different anchors under the same level, and taking the sum of the mean value and the standard deviation as the upper boundary constraint value of the different anchors of the real frames of the targets by calculating the sum of the mean value and the standard deviation so as to ensure that the real frames of each target are distributed to at least one anchor with similar shape; The upper boundary constraint value is: , , , , , representing the height of the ith target real box, Representing the height of the j-th anchor, Representing the width of the ith target real frame, Representing the width of the j-th anchor, Representing the ratio of the widths of the j-th anchor of the i-th real frame, Representing the high ratio of the j-th anchor of the i-th real frame, The number of anchors is indicated as the number of anchors, A maximum value representing a ratio of the width of the j-th anchor to the height of the i-th real frame, Representing the r-means of the ith real box, The r variance representing the ith real box, Representing the upper boundary constraint value of the ith real box.
2. The method of claim 1, wherein the step of clustering the grouping anchor frames according to the ratio of the width and height of the real frame of the target to the width and height of the picture under the video monitoring view angle, and assigning anchors of different scales to the target comprises: The method comprises the steps of obtaining a ratio of the width of a real frame of a target to the width of a picture under a video monitoring visual angle, and obtaining a ratio of the height of the real frame of the target to the height of the picture under the video monitoring visual angle; the real frames with the minimum value smaller than the threshold value alpha in the wide ratio and the high ratio are gathered into n types, wherein n is more than or equal to 1, and anchors with different scales are set for small-scale targets; the real frames with the minimum value larger than or equal to alpha in the wide ratio and the high ratio are gathered into m types, wherein m is larger than or equal to 1, and anchors with different scales are arranged for medium-large scale targets.
3. The method according to claim 2, wherein the identifying the multi-scale target by the location enhancement FPN according to the features of the target, performing feature fusion on the shallow feature map and the deep feature map to extract the target location, includes: For a medium-large scale target, firstly downsampling a characteristic diagram obtained by a first ShuffleV Block convolution Block to obtain C1, secondly, carrying out Concat characteristic fusion on the characteristic diagram obtained by the C1 and a third ShuffleV Block convolution Block to obtain C11, and extracting the position of the medium-large scale target by convolution of the C11; For a small-scale target, firstly, up-sampling a result after C11 convolution to obtain C13, secondly, down-sampling a feature map obtained by a first ShuffleV Block convolution Block to obtain C2, and performing Concat feature fusion on the feature map obtained by the C2 and a second ShuffleV Block convolution Block to obtain C21 and C21 fused deep semantic information, convoluting the C21 to obtain C22, and finally, performing Add feature fusion on the C22 and the C13 to extract the position of the small-scale target.
4. The method of claim 2, wherein the IoU minimum match is: the mean and standard deviation of IoU of the real frame of each object and different anchors under the same level are calculated, and the sum is taken as the lower boundary constraint value of the real frame IoU of each object by calculating the sum of the mean and the standard deviation.
5. The method of claim 4, wherein the lower boundary constraint value is: , , , wherein, IoU representing the ith real box and the jth anchor, Representing the average of all IoU of the ith real box, Representing the standard deviation of all IoU of the ith real box, The number of anchors is indicated as the number of anchors, Representing the lower boundary constraint value of the ith real box.
6. A lightweight target recognition system incorporating location augmentation and adaptive tag assignment, comprising: The grouping anchor frame clustering module is used for comparing the width and height of the real frames of the targets with the width and height of the pictures under the video monitoring view angle, clustering the grouping anchor frames according to the ratio, and distributing anchors with different scales for the targets; The main network construction module is used for preliminarily extracting a characteristic diagram of a target in the image through a main network formed by ShuffleNetV light-weight networks under the video monitoring view angle; the position enhancement module is used for identifying a multi-scale target according to the feature map of the target through the position enhancement FPN, and carrying out feature fusion on the shallow feature map and the deep feature map to extract the target position; the self-adaptive label distribution module is used for combining anchors with different scales obtained by clustering the grouping anchor frames and distributing the anchors into target matching anchors and type labels through the self-adaptive labels; the loss calculation module is used for comparing the model labels with the target real frames according to the anchors and the type labels allocated to each target, and calculating the geometric offset and the classification loss value between the anchors allocated to each target and the corresponding target real frames; The target recognition module is used for estimating all parameter values in the network model according to the classification loss value and the regression loss value and combining with the SGD optimization algorithm until the calculated loss value reaches the minimum so as to complete target recognition training, then discarding the label distribution and the SGD optimization algorithm, repeating the calculation flow of the network model, and using a non-maximum suppression algorithm NMS so as to complete target recognition; the anchors with different scales obtained by clustering in combination with the grouping anchor frames are distributed into target matching anchors through self-adaptive labels, and the method comprises the following steps: Combining anchors with different scales obtained by grouping anchor frames, and distributing at least one anchor with similar shape for a small-scale target and a medium-large-scale target through shape similarity matching; If the anchors allocated based on the shape similarity matching do not meet IoU minimum value matching, the targets which do not meet IoU minimum value matching participate in label allocation of the next level, so that the real frames of each target are ensured to be allocated to the anchors; the allocation of at least one anchors with similar shapes to a small-scale target and a medium-scale target through shape similarity matching comprises the following steps: Calculating the mean value and standard deviation of the aspect ratio values of the real frames of the targets and different anchors under the same level, and taking the sum of the mean value and the standard deviation as the upper boundary constraint value of the different anchors of the real frames of the targets by calculating the sum of the mean value and the standard deviation so as to ensure that the real frames of each target are distributed to at least one anchor with similar shape; The upper boundary constraint value is: , , , , , representing the height of the ith target real box, Representing the height of the j-th anchor, Representing the width of the ith target real frame, Representing the width of the j-th anchor, Representing the ratio of the widths of the j-th anchor of the i-th real frame, Representing the high ratio of the j-th anchor of the i-th real frame, The number of anchors is indicated as the number of anchors, A maximum value representing a ratio of the width of the j-th anchor to the height of the i-th real frame, Representing the r-means of the ith real box, The r variance representing the ith real box, Representing the upper boundary constraint value of the ith real box.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 5.

Description

Lightweight target identification method and system integrating position enhancement and self-adaptive labels Technical Field The invention belongs to the technical field of light-weight target identification, and particularly relates to a light-weight target identification method and system integrating position enhancement and self-adaptive label distribution. Background Currently, the industrial application of target recognition is commonly realized by acquiring videos in real time through an online camera and uploading the videos to a cloud or local server where a deep learning recognition model is deployed. With the development of edge computing in recent years, the computing task of the model is advanced to the edge of the intelligent device. The recognition model is directly deployed in the edge equipment, and part of calculation tasks are born by using the calculation resources of the recognition model to realize recognition, so that the real-time performance of recognition is improved and the pressure of the central server is relieved. The light-weight target detection model needs to be built in the edge equipment, and the current method for building the light-weight model is divided into manual design of the light-weight model and model compression. The manually designed lightweight model is a model with efficient feature extraction capability, such as MobileNet series, shuffleNet series, gostNet series, under the constraints of computational power and model size. Model compression is a method for reducing the calculation amount of an original model without degrading the recognition accuracy, such as pruning, knowledge distillation, quantization, and the like. Most of the existing light-weight target recognition models have good recognition effects on open-source data sets, but have poor effects when applied to data under a video monitoring view angle, namely the generalization capability is poor. Compared with an open source reference data set, the data acquired under the video monitoring view angle has the characteristics of different target dimensions, a large number of targets with low resolution, and the like. This can lead to problems with the current lightweight target recognition model of bbox of inaccurate positioning, low recognition rate, and unreasonable selection of training positive samples. Therefore, a light-weight target recognition method aiming at the video monitoring view angle and the characteristics of small targets is needed. Disclosure of Invention Aiming at the defects or improvement demands of the prior art, the invention provides a lightweight target identification method and system integrating position enhancement and self-adaptive label distribution aiming at edge equipment based on limited computing and storage resources by combining small-scale target characteristics. In order to achieve the above object, the present invention provides a lightweight target recognition method integrating position enhancement and adaptive tag allocation, including: Comparing the width and height of a real frame of a target with the width and height of a picture under a video monitoring view angle, clustering the grouping anchor frames according to the ratio, and distributing anchors with different scales for the target; The method comprises the steps that an image under a video monitoring visual angle passes through a main network formed by ShuffleNetV light-weight networks, and a feature map of a target in the image is initially extracted by the main network; identifying a multi-scale target according to the feature map of the target through the position enhancement FPN, and carrying out feature fusion on the shallow feature map and the deep feature map to extract the target position; combining anchors with different scales obtained by grouping anchor frame clustering, and distributing the anchors into target matching anchors and type tags through self-adaptive tags; According to the anchors and the type labels distributed by each target, comparing the targets with the target real frames, and calculating the geometric offset and the classification loss value between each assigned anchor and the corresponding target real frame; And according to the classification loss value and the regression loss value, combining with an SGD optimization algorithm, estimating all parameter values in the network model until the calculated loss value reaches the minimum so as to complete target identification training, then discarding the label distribution and the SGD optimization algorithm, repeating the network model calculation flow, and using a non-maximum suppression algorithm NMS so as to complete target identification. In some optional embodiments, the clustering the grouping anchor frames according to the ratio of the width and height of the real frame of the target to the width and height of the picture under the video monitoring view angle, and allocating anchors with different scales to the t