CN-121982756-A - Method, device, equipment and storage medium for identifying expression of driver based on YOLO model

CN121982756ACN 121982756 ACN121982756 ACN 121982756ACN-121982756-A

Abstract

The application discloses a method, a device, equipment and a storage medium for identifying the expression of a driver based on a YOLO model, which relate to the technical field of facial expression identification; the method comprises the steps of carrying out model construction and training based on a YOLO model and the facial expression data set to obtain a GFB-YOLO model, wherein the GFB-YOLO model comprises a main network, a scale edge generator, a dynamic depth separable convolution module and a space flexible mixing module, inputting data to be identified into the GFB-YOLO model, and processing the data to be identified through the main network, the scale edge generator, the dynamic depth separable convolution module and the space flexible mixing module to obtain an expression feature map, and determining the expression category of a driver according to the expression feature map. According to the method, the recognition accuracy and the robustness of the model to the fine expression under the complex scene are improved through the steps.

Inventors

Nong Zhibin
CHEN XUANQI
Yin Minggun
LUO JIA
ZENG XIANDI
LI YINGYING
CHEN ZHONG
WEI QIJUN
LIANG HUIPING

Assignees

东风柳州汽车有限公司

Dates

Publication Date: 20260505
Application Date: 20251222

Claims (10)

1.A method for identifying a driver's expression based on a YOLO model, the method comprising: acquiring data to be identified of a face area of a driver and a facial expression data set; Model construction and training are carried out based on the YOLO model and the facial expression data set to obtain a GFB-YOLO model, wherein the GFB-YOLO model comprises a main network, a scale edge generator, a dynamic depth separable convolution module and a space flexible mixing module; Inputting the data to be identified into the GFB-YOLO model to process the data to be identified through the backbone network, the scale edge generator, the dynamic depth separable convolution module and the space flexible mixing module to obtain an expression feature map; And determining the expression category of the driver according to the expression feature map.
2. The method of claim 1, wherein the step of model building and training based on the YOLO model and the facial expression dataset to obtain a GFB-YOLO model comprises: constructing an initial YOLO network based on a YOLO model, and taking the initial YOLO network as a backbone network; Constructing parallel global edge information transmission paths in the backbone network, wherein the global edge information transmission paths comprise a scale edge generator; Replacing a first module in a neck network of the backbone network with a second module, wherein the second module comprises a dynamic depth separable convolution module and a space flexible mixing module; Training a network comprising the backbone network, the scale edge generator, the dynamic depth separable convolution module and the space flexible mixing module according to the facial expression data set to obtain a GFB-YOLO model.
3. The method of claim 1, wherein the step of processing the data to be identified by the backbone network, the scale edge generator, the dynamic depth separable convolution module, and the spatially flexible blending module to obtain an expression profile comprises: carrying out multi-scale edge feature extraction and fusion on the data to be identified according to the scale edge generator to obtain an edge enhancement feature map; Performing multi-scale convolution processing on the edge enhancement feature map according to the dynamic depth separable convolution module to obtain a dynamic convolution feature map; and carrying out feature fusion and channel weighting processing on the dynamic convolution feature map according to the space flexible mixing module to obtain an expression feature map.
4. A method according to claim 3, wherein the step of extracting and fusing the multi-scale edge features of the data to be identified according to the scale edge generator to obtain an edge enhancement feature map comprises: Extracting the data to be identified according to a maximum pooling layer and a convolution layer in the scale edge generator to obtain each resolution characteristic diagram, wherein each resolution characteristic diagram comprises a first resolution characteristic diagram, a second resolution characteristic diagram and a third resolution characteristic diagram, the first resolution is larger than the second resolution, and the second resolution is larger than the third resolution; Performing edge gradient calculation on the resolution feature images according to a Sobel operator in the scale edge generator to obtain horizontal edge gradient features and vertical edge gradient features; Fusing the horizontal edge gradient features and the vertical edge gradient features with feature graphs with corresponding resolutions to obtain multi-scale edge features; And carrying out channel alignment and splicing on the multi-scale edge features to obtain an edge enhancement feature map.
5. The method of claim 3, wherein the step of performing a multi-scale convolution process on the edge enhancement feature map according to the dynamic depth separable convolution module to obtain a dynamic convolution feature map comprises: Dividing the edge enhancement feature map into three branches of input data along a channel dimension according to the dynamic depth separable convolution module; Respectively carrying out depth separable convolution on the input data according to a rectangular convolution kernel, a transverse banded convolution kernel and a longitudinal banded convolution kernel in the dynamic depth separable convolution module to obtain three-branch characteristics; and carrying out weighted fusion and channel splicing on the three branch features according to a global average pooling layer and a normalization layer in the dynamic depth separable convolution module to obtain a dynamic convolution feature map.
6. The method of claim 3, wherein the step of performing feature fusion and channel weighting processing on the dynamic convolution feature map according to the spatial flexible mixing module to obtain an expression feature map comprises: Dividing the dynamic convolution feature map equally according to channels according to the space flexible mixing module to obtain a first feature group and a second feature group; Checking the first feature group according to a first convolution in the space flexible mixing module to perform dynamic depth separable convolution to obtain global features; checking the second feature group according to a second convolution in the space flexible mixing module to perform dynamic depth separable convolution to obtain local features; Fusing the global features and the local features according to a third convolution kernel in the space flexible mixing module to obtain fusion features, wherein the preset first convolution kernel is larger than the preset second convolution kernel, and the second convolution kernel is larger than the third convolution kernel; And weighting the fusion characteristics channel by channel according to the scaling coefficient in the space flexible mixing module, and carrying out residual connection with the dynamic convolution characteristic map to obtain an expression characteristic map.
7. The method of claim 6, wherein the scaling factors include a first scaling factor and a second scaling factor, and wherein the step of weighting the fused features channel by channel and residual connecting with the dynamic convolution feature map based on the scaling factors in the spatial flexible mixing module comprises: channel-by-channel weighting is carried out on the fusion features according to a first scaling coefficient in the space flexible mixing module, so that first weighted features are obtained; Residual connection is carried out on the first weighted feature and the dynamic convolution feature map, so that a first residual feature is obtained; Performing nonlinear transformation on the first residual characteristic according to a convolution gating linear unit in the space flexible mixing module to obtain a transformation characteristic; Channel-by-channel weighting is carried out on the transformation characteristics according to a second scaling factor in the space flexible mixing module, so that second weighted characteristics are obtained; And carrying out residual connection on the second weighted feature and the first residual feature to obtain an expression feature map.
8. A YOLO model-based driver expression recognition apparatus, the apparatus comprising: The data processing module is used for acquiring data to be identified of the face area of the driver and a facial expression data set; The model construction module is used for carrying out model construction and training based on the YOLO model and the facial expression data set to obtain a GFB-YOLO model, and the GFB-YOLO model comprises a main network, a scale edge generator, a dynamic depth separable convolution module and a space flexible mixing module; The feature extraction module is used for inputting the data to be identified into the GFB-YOLO model so as to process the data to be identified through the backbone network, the scale edge generator, the dynamic depth separable convolution module and the space flexible mixing module to obtain an expression feature map; and the expression recognition module is used for determining the expression category of the driver according to the expression feature map.
9. A YOLO model based driver expression recognition device, characterized in that the device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the YOLO model based driver expression recognition method according to any one of claims 1 to 7.
10. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the YOLO model-based driver expression recognition method according to any one of claims 1 to 7.

Description

Method, device, equipment and storage medium for identifying expression of driver based on YOLO model Technical Field The present application relates to the field of facial expression recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recognizing a driver's expression based on a YOLO model. Background The driver expression recognition based on computer vision is a key technology in the field of driver state monitoring, and has important significance for improving driving safety. Currently, a facial expression recognition method based on deep learning has become a mainstream, in which end-to-end expression recognition is performed using a single-stage object detection model such as YOLO, and attention is paid to the balance of speed and accuracy. However, in an actual driving scenario, this approach still faces many challenges. Firstly, the driving environment is complex and changeable, illumination variation, local shielding (such as glasses and masks) and complex background interference exist, and the model is required to have strong robustness. Secondly, the driver's expression is often fine, especially weak expression changes during fatigue or distraction, and the model is required to accurately capture local detail features. In addition, the vehicle-mounted system has strict requirements on algorithm instantaneity, and the model has lower calculation cost while maintaining high precision. Existing solutions attempt to solve the above problems, but each has limitations. For example, some recognition methods based on improved YOLO models have insufficient feature extraction capability in the face of subtle expression differences, and the recognition accuracy is to be improved. Other schemes adopting complex network structures (such as a combined attention mechanism) have the defects of more network parameters, large calculation amount, difficulty in meeting the real-time processing requirement and weaker generalization capability under the complex background, although the precision is improved to a certain extent. Therefore, how to improve the recognition accuracy and robustness of the model to the fine expression in the complex scene under the condition of ensuring the real-time performance is a problem to be solved in the present urgent need. Disclosure of Invention The application mainly aims to provide a method, a device, equipment and a storage medium for identifying the expression of a driver based on a YOLO model, which aim to solve the technical problem of improving the identification precision and the robustness of the model to the fine expression in a complex scene under the condition of ensuring the real-time performance. In order to achieve the above object, the present application provides a method for identifying an expression of a driver based on a YOLO model, the method comprising: acquiring data to be identified of a face area of a driver and a facial expression data set; Model construction and training are carried out based on the YOLO model and the facial expression data set to obtain a GFB-YOLO model, wherein the GFB-YOLO model comprises a main network, a scale edge generator, a dynamic depth separable convolution module and a space flexible mixing module; Inputting the data to be identified into the GFB-YOLO model to process the data to be identified through the backbone network, the scale edge generator, the dynamic depth separable convolution module and the space flexible mixing module to obtain an expression feature map; And determining the expression category of the driver according to the expression feature map. In one embodiment, the step of obtaining the GFB-YOLO model by performing model construction and training based on the YOLO model and the facial expression dataset includes: constructing an initial YOLO network based on a YOLO model, and taking the initial YOLO network as a backbone network; Constructing parallel global edge information transmission paths in the backbone network, wherein the global edge information transmission paths comprise a scale edge generator; Replacing a first module in a neck network of the backbone network with a second module, wherein the second module comprises a dynamic depth separable convolution module and a space flexible mixing module; Training a network comprising the backbone network, the scale edge generator, the dynamic depth separable convolution module and the space flexible mixing module according to the facial expression data set to obtain a GFB-YOLO model. In an embodiment, the step of processing the data to be identified by the scale edge generator, the dynamic depth separable convolution module, and the spatial flexible mixing module to obtain an expression feature map includes: carrying out multi-scale edge feature extraction and fusion on the data to be identified according to the scale edge generator to obtain an edge enhancement feature map; Performing multi-scale convo