CN-116052219-B - Human body detection model training method, device, equipment and storage medium

CN116052219BCN 116052219 BCN116052219 BCN 116052219BCN-116052219-B

Abstract

The application provides a human body detection model training method, a device, equipment and a storage medium. The method comprises the steps of inputting a training set into a pre-configured student model, carrying out feature extraction on sample images by utilizing a trunk module of the student model, respectively inputting feature images output by the trunk module into a neck module and an auxiliary neck module to carry out feature fusion to obtain a fused feature image, obtaining target values corresponding to each channel of the fused feature image, converting the target values to obtain feature distribution corresponding to the neck module and the auxiliary neck module respectively, calculating distillation loss according to the feature distribution, training the neck module of the student model by utilizing the distillation loss, and sending output of the neck module after training to the head module to carry out training to obtain a complete trained human body detection model. The application realizes the on-line knowledge distillation of the human body detection model, and reduces the time consumption and cost of training the human body detection model.

Inventors

HE XIANG
HUANG ZEYUAN

Assignees

北京龙智数科科技服务有限公司

Dates

Publication Date: 20260508
Application Date: 20230201

Claims (9)

1. A human detection model training method, comprising: Inputting a training set into a pre-configured student model, and extracting features of sample images in the training set by utilizing a trunk module of the student model; The feature images output by the trunk module are respectively input into a neck module and an auxiliary neck module, and feature fusion is respectively carried out by utilizing the neck module and the auxiliary neck module to obtain a fused feature image, wherein the auxiliary neck module keeps the input and output channels the same as those of the student model, and the depth of a midspan stage local structure of the auxiliary neck module is increased to guide the output of the student model neck module; Acquiring a target value corresponding to each channel of the fused feature map, and converting the target value to obtain feature distribution corresponding to the neck module and the auxiliary neck module respectively; determining the number of channels in the fused feature map, acquiring a feature map height H and a feature map width W corresponding to each channel, and taking H x W number values of each channel as the target value corresponding to each channel; calculating distillation loss according to the characteristic distribution, and training a neck module of the student model by utilizing the distillation loss; and sending the output of the neck module after training to the head module for training to obtain a complete human body detection model after training.
2. The method according to claim 1, wherein the inputting the feature map output by the trunk module into a neck module and an auxiliary neck module, respectively, and performing feature fusion by using the neck module and the auxiliary neck module, respectively, includes: respectively taking the three-layer feature images output by the trunk module as the input of the neck module and the auxiliary neck module; And respectively fusing the three layers of feature images by using the neck module and the auxiliary neck module to obtain a fused feature image corresponding to the neck module and a fused feature image corresponding to the auxiliary neck module.
3. The method according to claim 2, wherein said converting the target value to obtain the corresponding feature distributions of the neck module and the auxiliary neck module, respectively, comprises: And respectively carrying out normalization operation on the target value of the feature map after fusion corresponding to the neck module and the target value of the feature map after fusion corresponding to the auxiliary neck module to obtain feature distribution corresponding to the neck module and the auxiliary neck module respectively.
4. The method of claim 1, wherein calculating a distillation loss from the characteristic distribution comprises calculating a distillation loss using the following robust function: Wherein, the Representing the difference between the characteristic profile of the neck module and the characteristic profile of the auxiliary neck module.
5. The method of claim 1, wherein the training the neck module of the student model with the distillation loss comprises: Training the neck module of the student model in a channel-by-channel manner using the distillation loss calculated for each of the channels to minimize the difference between the fused feature map output by the neck module and the fused feature map output by the auxiliary neck module.
6. The method according to any one of claims 1 to 5, wherein the human detection model employs a model created based on YOLOX algorithm.
7. A human detection model training device, comprising: the extraction module is configured to input a training set into a pre-configured student model, and the trunk module of the student model is utilized to extract the characteristics of sample images in the training set; the fusion module is configured to input the feature images output by the trunk module into the neck module and the auxiliary neck module respectively, and perform feature fusion by using the neck module and the auxiliary neck module respectively to obtain a fused feature image, wherein the auxiliary neck module keeps the input and output channels the same as those of the student model, and increases the depth of a midspan stage local structure of the auxiliary neck module so as to guide the output of the student model neck module; The conversion module is configured to acquire a target value corresponding to each channel of the fused feature map, and convert the target value to obtain feature distribution corresponding to the neck module and the auxiliary neck module respectively; determining the number of channels in the fused feature map, acquiring a feature map height H and a feature map width W corresponding to each channel, and taking H x W number values of each channel as the target value corresponding to each channel; a calculation module configured to calculate a distillation loss from the feature distribution, the distillation loss being utilized to train a neck module of the student model; The training module is configured to send the output of the neck module after training to the head module for training, and a complete human body detection model after training is obtained.
8. An electronic device comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the method of any one of claims 1 to 6.
9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 6.

Description

Human body detection model training method, device, equipment and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to a human body detection model training method, apparatus, device, and storage medium. Background The object detection algorithm is widely used in practice, wherein a human body detection algorithm such as YOLOX algorithm is more common, and because of its high-precision detection effect, the object detection algorithm is commonly used in industry such as human body detection projects. Knowledge distillation is a means of improving the generalization of small models by using large models, and has been widely used because it has the advantage of improving model accuracy without damaging model speed. In human body detection tasks, knowledge distillation is a method capable of keeping the reasoning speed and improving the model precision, but the existing human body detection model training method based on knowledge distillation needs to train a large model in advance as a teacher model, and then trains student models by using the teacher model to improve generalization, and the existing human body detection model training method has the problems of long time consumption and high cost due to the fact that one teacher model needs to be additionally trained. Disclosure of Invention In view of the above, the embodiments of the present application provide a human detection model training method, apparatus, device, and storage medium, so as to solve the problems of long time consumption and high cost of the human detection model training method in the prior art. According to a first aspect of the embodiment of the application, a human body detection model training method is provided, which comprises the steps of inputting a training set into a pre-configured student model, carrying out feature extraction on sample images in the training set by utilizing a trunk module of the student model, respectively inputting feature images output by the trunk module into a neck module and an auxiliary neck module, respectively carrying out feature fusion by utilizing the neck module and the auxiliary neck module to obtain a fused feature image, obtaining a target value corresponding to each channel of the fused feature image, converting the target value to obtain feature distribution corresponding to the neck module and the auxiliary neck module, calculating distillation loss according to the feature distribution, training the neck module of the student model by utilizing the distillation loss, and sending output of the neck module after training to the head module for training to obtain a complete human body detection model. The human body detection model training device comprises an extraction module, a fusion module, a conversion module, a calculation module and a training module, wherein the extraction module is used for inputting a training set into a pre-configured student model, the main module of the student model is used for carrying out feature extraction on sample images in the training set, the fusion module is used for respectively inputting feature images output by the main module into a neck module and an auxiliary neck module, the neck module and the auxiliary neck module are used for respectively carrying out feature fusion to obtain a fused feature image, the conversion module is used for obtaining a target value corresponding to each channel of the fused feature image and converting the target value to obtain feature distribution corresponding to the neck module and the auxiliary neck module respectively, the calculation module is used for calculating distillation loss according to the feature distribution, the neck module of the student model is trained by using the distillation loss, and the training module is used for sending output of the trained neck module to the head module for training to obtain a complete human body detection model. In a third aspect of the embodiments of the present application, there is provided an electronic device including a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program. In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method. The above at least one technical scheme adopted by the embodiment of the application can achieve the following beneficial effects: The method comprises the steps of inputting a training set into a pre-configured student model, carrying out feature extraction on sample images in the training set by utilizing a trunk module of the student model, respectively inputting feature images output by the trunk module into a neck module and an auxiliary neck module, respectively carryi