EP-3633553-B1 - METHOD AND APPARATUS FOR TRAINING OBJECT DETECTION MODEL

EP3633553B1EP 3633553 B1EP3633553 B1EP 3633553B1EP-3633553-B1

Inventors

ZHANG, CHANGZHENG
JIN, XIN
TU, Dandan

Dates

Publication Date: 20260506
Application Date: 20190305

Claims (11)

An object detection model training method performed by a computing device, wherein the method comprises: obtaining a training image, and establishing a backbone network based on the training image, wherein the backbone network includes a convolutional network of K convolution layers, K is a positive integer greater than 0, and extracts feature maps of the training image; inputting, into a region proposal network, feature maps output by the backbone network; selecting, by the region proposal network based on a region proposal parameter, a plurality of proposal regions from the feature maps output by the backbone network, and inputting feature submaps corresponding to the plurality of proposal regions into a classifier, wherein the region proposal parameter comprises a length and a width of a proposal region; detecting, by the classifier, a to-be-detected object in the training image based on the feature submaps corresponding to the plurality of proposal regions wherein a size of the to-be-detected object is not distinguished; comparing the detection result, detected by the classifier, in the training image with a prior result of the to-be-detected object in the training image to obtain a comparison result, and exciting, based on the comparison result, at least one of the following: a model parameter of a convolution kernel of the backbone network, a model parameter of a convolution kernel of the region proposal network, the region proposal parameter, and a parameter of the classifier; duplicating the classifier after the comparing step to obtain P copies, wherein P is at least two, thereby obtaining at least two classifiers: classifying, by the region proposal network, the plurality of proposal regions into P proposal region sets, wherein P is at least two, based on sizes of the plurality of proposal regions, wherein each proposal region set comprises at least one proposal region; inputting, by the region proposal network into the at least two classifiers, feature submaps corresponding to proposal regions comprised in each proposal region set, wherein proposal regions in each proposal region set have approximate sizes, and wherein feature submaps corresponding to the P proposal region sets are separately input into the P classifiers, so that one proposal region set is corresponding to one classifier, and a feature submap corresponding to a proposal region in the proposal region set is input into the corresponding classifier; performing, by each of the at least two classifiers, the following actions: detecting a to-be-detected object in the training image based on the feature submap corresponding to the proposal region comprised in an obtained proposal region set for detecting to-be-detected objects with approximate sizes; and comparing the detection results output by the at least two classifiers each with the prior result of a size of the to-be-detected object that is in the training image and that is corresponding to the feature submap input into the corresponding classifier, thereby obtaining a respective comparison difference, and exciting at least one of the following: the model parameter of the convolution kernel of the backbone network, the model parameter of the convolution kernel of the region proposal network, the region proposal parameter, and a parameter of each classifier based on the respective comparison difference.
The method according to claim 1, wherein the method further comprises: obtaining a system parameter, wherein the system parameter comprises at least one of the following: a quantity of size clusters of to-be-detected objects in the training image and a training computing capability; and determining, based on the system parameter, a quantity of classifiers that are obtained through duplication.
The method according to claim 2, wherein when the system parameter comprises the quantity of size clusters of the to-be-detected objects in the training image, the obtaining a system parameter comprises: performing clustering on sizes of the to-be-detected objects in the training image, to obtain the quantity of size clusters of the to-be-detected objects in the training image.
The method according to any one of claims 1 to 3, wherein the feature maps output by the backbone network comprise at least two feature maps.
An object detection model training apparatus, comprising: an object detection model, configured to: obtain a training image, and establish a backbone network based on the training image, wherein backbone network includes a convolutional network of K convolution layers, K is a positive integer greater than 0, and extracts feature maps of the training image; select, based on a region proposal parameter, a plurality of proposal regions from feature maps output by the backbone network, and input feature submaps corresponding to the plurality of proposal regions into a classifier, wherein the region proposal parameter comprises a length and a width of a proposal region; and detect a to-be-detected object in the training image based on the feature submaps corresponding to the plurality of proposal regions wherein a size of the to-be-detected object is not distinguished; an excitation module, configured to: compare the detection result with a prior result of the to-be-detected object in the training image to obtain a comparison result, and excite, based on the comparison result, at least one of the following: a model parameter of a convolution kernel of the backbone network, a model parameter of a convolution kernel of the region proposal network, the region proposal parameter, and a parameter of the classifier; and an initialization module, configured to duplicate the classifier after the comparing step to obtain copies, wherein P is at least two, thereby obtaining at least two classifiers, wherein the object detection model is further configured to: classify the plurality of proposal regions into proposal region sets, wherein P is at least two, based on sizes of the plurality of proposal regions, wherein each proposal region set comprises at least one proposal region; and input, into the at least two classifiers, feature submaps corresponding to proposal regions comprised in each proposal region set, wherein proposal regions in each proposal region set have approximate sizes, and wherein feature submaps corresponding to the P proposal region sets are separately input into the P classifiers, so that one proposal region set is corresponding to one classifier, and a feature submap corresponding to a proposal region in the proposal region set is input into the corresponding classifier; and each of the at least two classifiers performs the following actions: detecting a to-be-detected object in the training image based on the feature submap corresponding to a proposal region comprised in an obtained proposal region set for detecting to-be-detected objects with approximate sizes; and comparing the detection results output by the at least two classifiers each with the prior result of a size of the to-be-detected object that is in the training image and that is corresponding to the feature submap input into the corresponding classifier, thereby obtaining a respective comparison difference, and exciting at least one of the following: the model parameter of the convolution kernel of the backbone network, the model parameter of the convolution kernel of the region proposal network, the region proposal parameter, and a parameter of each classifier based on the respective comparison difference.
The apparatus according to claim 5, wherein the initialization module is further configured to: obtain a system parameter, wherein the system parameter comprises at least one of the following: a quantity of size clusters of to-be-detected objects in the training image and a training computing capability; and determine, based on the system parameter, a quantity of classifiers that are obtained through duplication and that are in the at least two classifiers.
The apparatus according to claim 6, wherein the initialization module is further configured to perform clustering on sizes of the to-be-detected objects in the training image, to obtain the quantity of size clusters of the to-be-detected objects in the training image.
The apparatus according to any one of claims 5 to 7, wherein the feature maps output by the backbone network comprise at least two feature maps.
A computing device system, comprising at least one computing device, wherein each computing device comprises a processor and a memory, and a processor of the at least one computing device is configured to perform the method according to any one of claims 1 to 4.
A non-transient readable storage medium comprising instructions which, when executed by at least one computing device in a computing device system, the at least one computing device performs the method according to any one of claims 1 to 4.
A computing device program product, wherein when the computing device program product is executed by at least one computing device in a computing device system, the at least one computing device performs the method according to any one of claims 1 to 4.

Description

TECHNICAL FIELD This application relates to the field of computer technologies, and in particular, to an object detection model training method, and an apparatus and a computing device for performing the method. BACKGROUND Object detection is an artificial intelligence technology used for accurately locating and detecting, by type, objects in an image or video. The object detection includes a plurality of segment fields such as general object detection, face detection, pedestrian detection, and text detection. In recent years, with much research in the academic and industrial circles and increasingly mature algorithms, a deep learning based object detection solution has been used for actual products in municipal security protection (pedestrian detection, vehicle detection, license plate detection, and the like), finance (object detection, face scanning login, and the like), the internet (identity verification), intelligent terminals, and the like. Currently, the object detection is widely applied to a plurality of simple/medium-complex scenarios (for example, face detection in an access control scenario or a checkpoint scenario). In an open environment, how to maintain robustness of a trained object detection model against a plurality of adverse factors such as a greatly changeable size, blocking, and distortion of a to-be-detected object and improve detection precision is still a problem to be resolved. Document "SInet: A Scale-insensitive Convolutional Neural Network for Fast vehicle Detection", Xiaowei Hu et al introduces a vehicle detection approaches based on a scale-insensitive convolutional neural network. SUMMARY This application provides an object detection model training method. The method improves detection precision of a trained object detection model. According to a first aspect, an object detection model training method performed by a computing device according to claim 1 is provided. According to the forgoing method, the training image is input into an object detection model twice, to train the object detection model. In a training in a first phase, a size of a to-be-detected object is not distinguished, so that a trained classifier has a global view. In a training in a second phase, each classifier obtained through duplication is responsible for detecting a to-be-detected object in a proposal region set, that is, responsible for detecting to-be-detected objects with approximate sizes, so that each trained classifier is further more sensitive to corresponding to-be-detected objects with different sizes. The trainings in the two phases improve precision of the trained object detection model for detecting the to-be-detected objects with different sizes. In a possible implementation, the method further includes: obtaining a system parameter, where the system parameter includes at least one of the following: a quantity of size clusters of to-be-detected objects in the training image and a training computing capability; and determining, based on the system parameter, a quantity of classifiers that are obtained through duplication and that are in the at least two classifiers. The quantity of classifiers obtained through duplication may be manually configured, or may be calculated based on a condition of the to-be-detected objects in the training image, and the quantity of classifiers obtained through duplication is properly selected. This further improves the precision of the trained object detection model for detecting the to-be-detected objects with different sizes. In a possible implementation, when the system parameter includes the quantity of size clusters of the to-be-detected objects in the training image, the obtaining a system parameter includes: performing clustering on sizes of the to-be-detected objects in the training image, to obtain the quantity of size clusters of the to-be-detected objects in the training image. In a possible implementation, the feature maps output by the backbone network include at least two feature maps. Different convolution layers of the backbone network may be corresponding to different strides. Therefore, to-be-detected objects in proposal regions in feature maps output at the different convolution layers may also have different sizes, and at least two feature maps are extracted by the backbone network. In this way, sources of proposal regions are increased, and the precision of the trained object detection model for detecting the to-be-detected objects with different sizes is further improved. A second aspect of this application provides a detection model training apparatus according to claim 5. In a possible implementation, the initialization module is further configured to: obtain a system parameter, where the system parameter includes at least one of the following: a quantity of size clusters of to-be-detected objects in the training image and a training computing capability; and determine, based on the system parameter, a quantity of classifiers that are