CN-122023898-A - Object identification method and device based on target detection and twin network

CN122023898ACN 122023898 ACN122023898 ACN 122023898ACN-122023898-A

Abstract

The application provides an object identification method and device based on target detection and a twin network. The method comprises the steps of carrying out feature extraction on object images of different object types in an object library by utilizing a feature extraction module to obtain fixed dimension feature vectors corresponding to corresponding object images, storing the fixed dimension feature vectors into an object feature database, obtaining a current input image in a camera video stream in an online stage, positioning objects in the input image by utilizing a target detection model to obtain a positioning frame of each object, cutting out a local image of each object based on the positioning frame, extracting feature vectors of each local image of each object by utilizing the feature extraction module, and carrying out similarity matching on the extracted feature vectors and the fixed dimension feature vectors stored in the object feature database to obtain the object types matched with the local image of each object. The method realizes object identification with high precision, high expansibility and low maintenance cost.

Inventors

YANG MENG
ZHANG JIANNAN

Assignees

中国移动通信集团辽宁有限公司
中国移动通信集团有限公司

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. An object recognition method based on target detection and a twin network, the method comprising: In an offline stage, extracting features of object images of different object categories in an object library by utilizing a feature extraction module to obtain fixed dimension feature vectors corresponding to the corresponding object images, and storing the fixed dimension feature vectors into an object feature database; in an online stage, acquiring a current input image in a camera video stream, and positioning objects in the input image by using a target detection model to obtain a positioning frame of each object; cutting out partial images of the objects based on the positioning frames, and extracting feature vectors of each partial image of the objects by utilizing a feature extraction module; And performing similarity matching on the extracted feature vector and the fixed dimension feature vector stored in the object feature database to obtain an object type matched with the object partial image.
2. The method of claim 1, wherein the object detection model is a YOLOv model, wherein the YOLOv model adopts CSPDARKNET53 as a backbone network, outputs a multi-scale feature map, performs feature fusion through a path aggregation network, and outputs a positioning frame; The loss function of training the YOLOv model is: Wherein, the In order to classify the loss of the device, In order to achieve the goal of loss, In order to locate the loss of position, Is a loss weight coefficient.
3. The method of claim 2, wherein the calculation formula for the target loss is: Wherein, the For a target loss corresponding to a small target, For a target loss corresponding to a medium target, Target loss corresponding to a large target.
4. The method of claim 1, wherein the feature extraction module is implemented based on a modified twin network; The improved twin network is of a double-branch symmetrical structure and comprises an input layer, a feature extraction layer, a feature normalization layer and a loss calculation layer; The image input by the input layer is a triplet image, and the triplet image comprises an anchor image, a positive image and a negative image, wherein the anchor image is an object image in an object database, the positive image is a real shot image with the same object type as the anchor image, and the negative image is a real shot image with a different object type from the anchor image.
5. The method of claim 4, wherein the feature extraction layer comprises a ResNet backbone network, a channel attention module and a GeM pooling layer connected in sequence, wherein the channel attention module is arranged at each residual block output end of the ResNet backbone network; the executing step of the attention module comprises the following steps: (1) Feature map output to residual block Global average pooling is carried out to obtain channel feature vectors The formula is as follows: Wherein, the Representing the global pooling result of the c-th channel, For the pixel value of the feature map at the c-th channel (i, j), H represents the height of the feature map, and W represents the width of the feature map; (2) The 1D convolution kernel size k is adaptively selected according to the channel number C, and the formula is as follows: Wherein, the method comprises the steps of, Ensuring that k is an odd number for a rounding function; (3) After carrying out 1D convolution operation on the channel characteristic vector x, generating attention weight through Sigmoid function The formula is: Wherein, the method comprises the steps of, As a function of the Sigmoid, A 1D convolution representing a kernel size k; (4) Weighting the attention weight w and the feature map F output by the residual block channel by channel to obtain an enhanced feature map The formula is: 。
6. A method according to claim 3, wherein the twin network compresses the feature map into a vector of a fixed dimension using GeM pooling layer after feature extraction, and performs L2 normalization operation on the resulting vector of the fixed dimension to make the vector modulo 1.
7. The method of claim 1, wherein the object features database is implemented using a vector search database indexed by ivf_flag, and the similarity measure employs L2 euclidean distance; similarity matching the extracted feature vector with a fixed dimension feature vector stored in the object feature database, including: Calculating the L2 distance between the extracted feature vector and the fixed dimension feature vector stored in the object feature database, and taking the Top-1 vector with the minimum L2 distance as a matching result; if the L2 distance is smaller than the first preset threshold value, judging that the object type is consistent with the object type corresponding to the matching vector.
8. An object recognition device based on object detection and a twin network, the device comprising: The extraction unit is used for extracting the characteristics of the object images of different object categories in the object library by utilizing the characteristic extraction module in the off-line stage to obtain fixed dimension characteristic vectors corresponding to the corresponding object images; The storage unit is used for storing the fixed dimension feature vector into an object feature database; The device comprises an acquisition unit, a target detection model, a target detection unit and a control unit, wherein the acquisition unit is used for acquiring a current input image in a camera video stream in an online stage; the extraction unit is also used for cutting out partial images of the objects based on the positioning frame, and extracting the feature vector of each partial image of the objects by utilizing the feature extraction module; And the matching unit is used for carrying out similarity matching on the extracted feature vector and the fixed dimension feature vector stored in the object feature database to obtain an object type matched with the object partial image.
9. An electronic device, characterized in that the electronic device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus; a memory for storing a computer program; a processor for implementing the method of any of claims 1-7 when executing a program stored on a memory.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-7.

Description

Object identification method and device based on target detection and twin network Technical Field The application relates to the technical field of object recognition, in particular to an object recognition method and device based on target detection and a twin network. Background The object recognition technology is a key technology in the modern science and technology field, can enable a computer to recognize and classify different objects from images or videos through deep learning and traditional machine learning algorithms, and is widely applied to the fields of automatic driving, safety monitoring, medical image analysis, retail industry, augmented reality, robot technology, agriculture, manufacturing industry and the like. The existing object identification mainly comprises two modes, namely a traditional machine learning-based method, wherein key characteristics of object shapes, textures and colors are manually extracted SIFT, SURF, HOG and the like, then classifier training such as a support vector machine, a random forest, a k nearest neighbor and the like is used for identifying object types, and a target detection-based deep learning method such as a YOLO series algorithm, a fast R-CNN and the like is used for improving the detection performance, and YOLOv is taken as an example, a large amount of object image data covering different brands, angles and illumination conditions is required to be collected first, after the positions and the types of the objects are marked by using marking tools, model training is input, and the detection performance is improved by adjusting parameters such as learning rate, batch size and the like. The object recognition method based on the target detection has the advantages that the object recognition method based on the traditional machine learning relies on manual feature extraction, researchers are required to have deep field knowledge and experience, complex abstract features of images are difficult to capture, the performance is limited when diversified and high-dimensional data are processed, the generalization capability is weak, the recognition accuracy is obviously reduced when the object recognition method faces new object types, meanwhile, a large amount of time and resources are consumed for adjusting a parameter optimization model, fitting is easy to carry out when the data amount is small, the efficiency and expansibility of processing a large-scale data set are also challenged, and the problems of high maintenance cost, poor expansibility and maintainability exist when the object recognition scheme based on the target detection is increased, and the data is required to be collected again and a target detection model is trained. Disclosure of Invention The object recognition method and device based on the target detection and twin network are used for achieving high-precision, high-expansibility and low-maintenance-cost object recognition by decoupling the object recognition into two independent subtasks, namely positioning and classifying, and combining the improved target detection and twin network. In a first aspect, there is provided a method of object recognition based on object detection and a twinning network, the method may comprise: In an offline stage, extracting features of object images of different object categories in an object library by utilizing a feature extraction module to obtain fixed dimension feature vectors corresponding to the corresponding object images, and storing the fixed dimension feature vectors into an object feature database; in an online stage, acquiring a current input image in a camera video stream, and positioning objects in the input image by using a target detection model to obtain a positioning frame of each object; cutting out partial images of the objects based on the positioning frames, and extracting feature vectors of each partial image of the objects by utilizing a feature extraction module; And performing similarity matching on the extracted feature vector and the fixed dimension feature vector stored in the object feature database to obtain an object type matched with the object partial image. In one possible implementation, the target detection model is YOLOv model, wherein the YOLOv model adopts CSPDARKNET53 as a backbone network, outputs a multi-scale feature map, performs feature fusion through a path aggregation network, and outputs a positioning frame; The loss function of training the YOLOv model is: Wherein, the In order to classify the loss of the device,In order to achieve the goal of loss,In order to locate the loss of position,Is a loss weight coefficient. In one possible implementation, the calculation formula of the target loss is: Wherein, the For a target loss corresponding to a small target,For a target loss corresponding to a medium target,Target loss corresponding to a large target. In one possible implementation, the feature extraction module is based on an improved twin networ