CN-121415262-B - Deep learning-based method and device for identifying disordered materials, electronic equipment and program product

CN121415262BCN 121415262 BCN121415262 BCN 121415262BCN-121415262-B

Abstract

The application discloses a deep learning-based method and device for identifying disordered materials, electronic equipment and a program product. The method is realized based on a trained identification model, a C3k2-ASL module is introduced into a backbone network, ASL Block in the module can model long-range structure dependence in the width direction and the height direction with lower calculation cost, and the perception capability and the identification robustness of the model on the characteristics of slender, scattered and low-contrast disordered materials are enhanced. In order to improve the modeling capability of the model on the whole spatial distribution of the disordered materials, an MGCA module is introduced into a neck network, and the precise recognition capability of the model on the disordered materials of different types in a complex urban scene can be improved by extracting multidimensional global context information and realizing fusion through a dynamic attention mechanism. Aiming at the problem of unbalanced category, the loss function is improved based on the constraint logarithmic re-weighting modulation mechanism, so that the model is degenerated into uniform weighting when data are balanced, and is stably optimized when long tails are distributed, and the generalization performance and training stability of the model are improved.

Inventors

WANG PENG
LIU JIAMEI
Cai Dashi
ZHANG KAI

Assignees

深圳市锐明像素科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251224

Claims (8)

1. The method for identifying the disordered materials based on the deep learning is characterized by comprising the following steps of: extracting features of an image to be identified based on a backbone network of a pre-trained disordered material identification model to obtain image features, wherein the image to be identified comprises urban public space; fusing the image features based on the neck network of the disordered material identification model to obtain target fusion features; Identifying the target fusion characteristics based on an identification network of the disordered material identification model to obtain an identification result of disordered materials in the image to be identified; The neck network comprises an MGCA module, wherein the MGCA module is used for extracting global context information of corresponding first input features in width and height dimensions and enhancing feature response of a stacking area through a dynamic attention mechanism; the MGCA module comprises a segmentation layer, a first branch, a second branch, an activation layer and a first fusion layer, wherein the first input characteristic of the MGCA module is as follows: Executing a segmentation operation on the first input feature through the segmentation layer to obtain a G-group segmentation feature; performing a feature extraction and fusion operation on the segmented features in a height direction by the first branch, and performing a feature extraction and fusion operation on the segmented features in a width direction by the second branch; Executing a first activation operation on a first branch output and a second branch output corresponding to each group of the segmentation features through the activation layer to respectively obtain height direction weights and width direction weights corresponding to each group of the segmentation features; Fusing the height direction weights and the width direction weights corresponding to the segmentation features of each group with the first input features based on residual error transmission through the first fusion layer to obtain corresponding first output features; The backbone network comprises a C3k2-ASL module, wherein the C3k2-ASL module is obtained by improving C3k2 based on ASL Block, the ASL Block comprises an SSA sub-module, an RCA sub-module, a space-channel joint layer, a second splicing layer, a feature extraction structure and a third splicing layer, and the second input feature of the ASL Block is input by the backbone network: performing structural modeling and weighting fusion on the second input feature in the space direction through the SSA sub-module to obtain a corresponding space fusion feature; Executing a channel attention mechanism on the space fusion feature through the RCA submodule, and fusing the obtained channel weight with the space fusion feature to obtain a channel fusion feature; The space-channel fusion layer is used for fusing the space fusion characteristics and the channel fusion characteristics to obtain space-channel fusion characteristics; splicing the space-channel fusion feature processed by the first scaling factor with the second input feature based on the second splicing layer to obtain a one-stage feature; performing batch normalization operation and MPL operation on the one-stage features based on the feature extraction structure to obtain two-stage features; and splicing the two-stage features processed by the second scaling factor with the one-stage features based on the third splicing layer to obtain corresponding second output features.
2. The method of identifying cluttered material of claim 1 wherein the first branch and the second branch have similar branch structures, the branch structures comprising a first global average pooling layer, a global maximum pooling layer, a convolution structure, and a first stitching layer, wherein for each set of the segmented features, feature extraction and fusion operations are performed on the segmented features in a height direction through the first branch, and feature extraction and fusion operations are performed on the segmented features in a width direction through the second branch, comprising: based on the branching structure, for each set of the segmentation features: Respectively executing global average pooling operation and global maximum pooling operation on the segmentation features in the corresponding directions through the first global average pooling layer and the global maximum pooling layer to obtain average pooling features and maximum pooling features; the average pooling feature and the maximum pooling feature are subjected to a first convolution operation, a batch normalization operation and a second activation operation in sequence through the convolution structure to obtain two convolution features; The corresponding direction of the first branch is the height direction, and the corresponding direction of the second branch is the width direction.
3. The method for identifying disordered material according to claim 1, wherein the SSA submodule includes a BN layer, a parallel width-direction separable depth convolution fusion layer, a height-direction depth separable convolution fusion layer, and a fourth splice layer, and the performing structural modeling and weighting fusion on the second input feature in the spatial direction by the SSA submodule to obtain a corresponding spatial fusion feature includes: performing batch normalization operation on the second input features through the BN layer to obtain a batch normalization result; performing a width-direction separable depth convolution operation on the batch of normalized results through a width-direction separable depth convolution layer, and performing weighting fusion on the obtained convolution results through a width-direction leachable fusion weight to obtain width-direction fusion characteristics; Performing height-direction separable depth convolution operation on the batch of normalization results through a height-direction separable depth convolution fusion layer, and weighting and fusing the obtained convolution results through height-direction leachable fusion weights to obtain height-direction fusion characteristics; and splicing the width direction fusion feature and the height direction fusion feature through the fourth splicing layer to obtain the space fusion feature.
4. The method of claim 1, wherein the RCA submodule includes a second global averaging pooling layer, a channel attention mechanism layer, and a second fusion layer, and wherein the performing, by the RCA submodule, a channel attention mechanism on the spatial fusion feature, and fusing the obtained channel weight with the spatial fusion feature to obtain a channel fusion feature includes: executing global average pooling operation on the space fusion features through the second global average pooling layer to obtain pooling results; Sequentially executing a second convolution operation, a third activation operation, a third convolution operation and a fourth activation operation on the pooling result based on the channel attention mechanism layer to obtain channel weights; and carrying out weighted fusion on the spatial fusion features based on the channel weights in the second fusion layer to obtain the channel fusion features.
5. The method for identifying the disordered material according to claim 1 or 2, wherein the Loss function for training the disordered material identification model comprises CLRW-BCE Loss, and the formula of CLRW-BCE Loss is as follows: Wherein the said A bi-class cross entropy Loss CLRW-BCE Loss representing constrained log-weighted modulation; The weight is represented by a weight, wherein, Representing control log sensitivity; Representing the intensity of the inverse frequency modulation when When the value of (2) is 0, the value is degenerated to be only logarithmic weight; representing the total number of categories; Representing class frequency; The highest frequency of the category is represented, L BCE is the loss of cross entropy of the two categories, n is the number of image samples of all the disordered materials, For the true class of the i-th image sample, Is the prediction category of the i-th image sample.
6. A device for identifying a material in a random stack, comprising: The extraction module is used for extracting the characteristics of the image to be identified based on the backbone network of the pre-trained disordered material identification model to obtain the image characteristics, wherein the image to be identified comprises urban public space; The fusion module is used for fusing the image features based on the neck network of the disordered material identification model to obtain target fusion features; the identification module is used for identifying the target fusion characteristics based on an identification network of the disordered material identification model to obtain an identification result of disordered materials in the image to be identified; The neck network comprises an MGCA module, wherein the MGCA module is used for extracting global context information of corresponding first input features in width and height dimensions and enhancing feature response of a stacking area through a dynamic attention mechanism; the MGCA module comprises a segmentation layer, a first branch, a second branch, an activation layer and a first fusion layer, wherein the first input characteristic of the MGCA module is as follows: Executing a segmentation operation on the first input feature through the segmentation layer to obtain a G-group segmentation feature; performing a feature extraction and fusion operation on the segmented features in a height direction by the first branch, and performing a feature extraction and fusion operation on the segmented features in a width direction by the second branch; Executing a first activation operation on a first branch output and a second branch output corresponding to each group of the segmentation features through the activation layer to respectively obtain height direction weights and width direction weights corresponding to each group of the segmentation features; Fusing the height direction weights and the width direction weights corresponding to the segmentation features of each group with the first input features based on residual error transmission through the first fusion layer to obtain corresponding first output features; The backbone network comprises a C3k2-ASL module, the C3k2-ASL module is obtained by improving C3k2 based on ASL Block, the ASL Block comprises an SSA sub-module, an RCA sub-module, a space-channel joint layer, a second splicing layer, a feature extraction structure and a third splicing layer, and the extraction module comprises an extraction unit, wherein the extraction unit is used for: Aiming at a second input feature of the ASL Block, carrying out structural modeling and weighting fusion on the second input feature in the space direction through the SSA sub-module to obtain a corresponding space fusion feature; Executing a channel attention mechanism on the space fusion feature through the RCA submodule, and fusing the obtained channel weight with the space fusion feature to obtain a channel fusion feature; The space-channel fusion layer is used for fusing the space fusion characteristics and the channel fusion characteristics to obtain space-channel fusion characteristics; splicing the space-channel fusion feature processed by the first scaling factor with the second input feature based on the second splicing layer to obtain a one-stage feature; performing batch normalization operation and MPL operation on the one-stage features based on the feature extraction structure to obtain two-stage features; and splicing the two-stage features processed by the second scaling factor with the one-stage features based on the third splicing layer to obtain corresponding second output features.
7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the deep learning-based method of identifying cluttered items of any one of claims 1 to 5 when the computer program is executed.
8. A computer program product comprising a computer program which, when executed by a processor, implements the deep learning based method of identifying cluttered materials of any one of claims 1 to 5.

Description

Deep learning-based method and device for identifying disordered materials, electronic equipment and program product Technical Field The application belongs to the technical field of image processing, and particularly relates to a deep learning-based disordered material identification method, a disordered material identification device, electronic equipment and a computer program product. Background With the improvement of urban construction and fine management level, the problems of building waste, waste furniture, household garbage and other sundries which are randomly piled in public spaces such as roads, sidewalks, green belts and the like are more prominent, so that urban landscapes are affected, and traffic jam, sanitation and potential safety hazards are possibly caused. The traditional mode relying on manual inspection is low in efficiency and high in cost, and real-time and comprehensive supervision on the disordered stacking behavior is difficult to realize. Although deep learning technology has been used for debris stacking identification, the existing method generally adopts a Convolutional Neural Network (CNN), and the convolutional neural network lacks the perception capability of various materials in a complex background, so that the application requirements of high precision and high robustness in actual city management are still difficult to meet in the prior art. Disclosure of Invention The application provides a deep learning-based disordered material identification method, a disordered material identification device, electronic equipment and a computer program product, which can effectively sense key characteristics of a plurality of types of disordered materials under a complex background condition, and improve the identification precision and robustness of a model on the disordered materials, so that the application requirements of high-precision and high-reliability identification in an actual city management scene are met. In a first aspect, the application provides a method for identifying disordered materials based on deep learning, which comprises the following steps: extracting features of an image to be identified based on a backbone network of the pre-trained disordered material identification model to obtain image features, wherein the image to be identified comprises urban public space; fusing the image features based on the neck network of the disordered material identification model to obtain target fusion features; identifying the target fusion characteristics based on an identification network of the disordered material identification model to obtain an identification result of the disordered material in the image to be identified; the neck network comprises an MGCA module, wherein the MGCA module is used for extracting global context information of the corresponding first input features in the width and height dimensions and enhancing the feature response of the stacking area through a dynamic attention mechanism. Further, the MGCA module comprises a splitting layer, a first branch, a second branch, an activating layer and a first fusion layer, wherein aiming at a first input characteristic: Executing a segmentation operation on the first input feature through the segmentation layer to obtain G-group segmentation features; Performing feature extraction and fusion operations on the segmented features in the height direction through a first branch, and performing feature extraction and fusion operations on the segmented features in the width direction through a second branch; Executing a first activating operation on a first branch output and a second branch output corresponding to each group of the component features through an activating layer to respectively obtain a height direction weight and a width direction weight corresponding to each group of the component features; And fusing the height direction weight and the width direction weight corresponding to each group of the separation characteristics with the first input characteristics based on residual error transmission through a first fusion layer to obtain corresponding first output characteristics. Further, the first branch and the second branch have similar branch structures, the branch structures comprise a first global average pooling layer, a global maximum pooling layer, a convolution structure and a first splicing layer, the feature extraction and fusion operation is performed on the segmented features in the height direction through the first branch for each group of segmented features, and the feature extraction and fusion operation is performed on the segmented features in the width direction through the second branch, and the method comprises the following steps: based on the branching structure, the segmentation features for each group: Respectively executing global average pooling operation and global maximum pooling operation on the segmentation features in the corresponding directions through a fir