CN-121999213-A - RGB-D instance segmentation method, equipment and medium for densely stacked industrial scene
Abstract
The application provides a method, equipment and medium for dividing RGB-D examples of densely stacked industrial scenes, which comprises the steps of acquiring RGB and depth images and preprocessing; the method comprises the steps of extracting multi-layer features through a backbone network, carrying out modal arbitration on each scale based on depth confidence assessment, dynamically fusing RGB visual features and depth geometric features to generate a multi-scale arbitration feature map, inputting the feature map into a decoder, adopting an attention modulation module based on confidence condition depth position coding to process the feature map, modulating position coding according to the depth confidence, enhancing reliable geometric perception, suppressing noise, generating a modulated multi-scale fusion feature, and predicting a final example segmentation result through a segmentation head network based on the feature. According to the method, through dynamic arbitration and attention modulation, multi-mode information is effectively fused and utilized, and the example segmentation precision and robustness under deep noise, shielding and densely stacked complex industrial scenes are remarkably improved.
Inventors
- KOU XIRONG
- TANG WENZHONG
- ZHANG HAONAN
Assignees
- 北京航空航天大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260106
Claims (10)
- 1. An RGB-D instance segmentation method for a densely stacked industrial scene is characterized by comprising the following steps: S1, acquiring an RGB image and a depth image of a target scene, and preprocessing to acquire aligned RGB image information and depth image information; S2, extracting multi-level RGB visual features and depth geometric features from the RGB images and the depth images through a backbone convolutional neural network, carrying out modal arbitration operation on the RGB visual features and the depth geometric features on the basis of a depth confidence assessment result on each feature scale, and dynamically fusing to generate a multi-scale arbitration feature map; S3, inputting the multi-scale arbitration feature map into a decoder network, and processing by adopting an attention modulation module based on confidence condition depth position coding, wherein the module modulates the position coding according to the depth confidence degree so as to enhance the perception of a reliable geometric structure and inhibit noise influence, and generates modulated multi-scale fusion features; s4, based on the modulated multi-scale fusion characteristics, a final example segmentation result is obtained through segmentation head network prediction.
- 2. The method of claim 1, wherein the depth image information comprises a depth gradient magnitude map, an invalid pixel mask, and a boundary sharpness mask.
- 3. The method according to claim 1, wherein S2 specifically comprises: s21, RGB visual features from the same hierarchy Geometric features of depth Splicing a group of predefined depth quality priori indexes P; S22, inputting the spliced features into a lightweight multi-scale convolution prediction head, and predicting to obtain a pixel level depth quality confidence coefficient map m of the level, wherein the value of each pixel is in the range of (0, 1), and the reliability of the depth data of the point is represented; S23, carrying out weighted fusion on the RGB features and the depth features through the confidence coefficient map m; s24, repeating the steps S21 to S23, and fusing the characteristics of all the layers to obtain a group of fused multi-scale characteristic diagrams.
- 4. The method of claim 3, wherein the depth quality a priori indicator P comprises a sensor invalidation mask, a Sobel gradient magnitude computed based on a depth map, for providing an explicit hint of depth data reliability for the network.
- 5. A method according to claim 3, wherein the step S23 of weighted fusion of the RGB features and the depth features specifically comprises: The fusion was performed by the following fusion formula: Is that ; Wherein the method comprises the steps of As a residual factor that guarantees a minimum contribution of the RGB stream, Representing the multiplication by element, Representing a layer normalization operation.
- 6. The method according to claim 1, wherein S3 specifically comprises: S31, inputting the multi-scale arbitration feature map into a decoder of an instance segmentation model based on a transducer; s32, in the cross attention layer of the decoder, using modulated confidence condition depth position coding for the key vector, said position coding being composed of standard 2D sinusoidal position coding And the geometric position codes weighted by the confidence coefficient are formed together, and the calculation formula is as follows: ; Wherein, the For the geometric position coding calculated from the original depth map, Is a confidence map.
- 7. The method according to claim 1, wherein S4 specifically comprises: s41, decoding the fusion features from the encoder by a decoder based on the modulated attention mechanism to generate a group of instance-aware feature embedding; S42, embedding the characteristics into an input dividing head, and finally outputting a pixel level dividing mask of each electronic element through matrix multiplication and up-sampling operation.
- 8. A method according to claim 3, wherein in generating the pixel-level depth quality confidence map, a complex loss function is employed for supervised training, the complex loss function comprising: Binary cross entropy loss term based on invalid pixel mask truth values in depth image; entropy regularization penalty term for facilitating confidence decision definition; A total variation regularization penalty term for promoting confidence map spatial smoothness; wherein the composite loss function is expressed as: ; where l represents the feature level, For invalid mask supervision weights that increase over time step t, BCE is a binary cross entropy loss function, H is a binary entropy function, Representing confidence graphs Is used to determine the gradient-norm of (c), And The weight coefficients of entropy regularization and total variation regularization are respectively.
- 9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the RGB-D instance segmentation method towards densely stacked industrial scenes according to any of claims 1-8 when the computer program is executed.
- 10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the RGB-D instance segmentation method towards densely stacked industrial scenes according to any of claims 1-8.
Description
RGB-D instance segmentation method, equipment and medium for densely stacked industrial scene Technical Field The present document relates to the field of instance segmentation technology, and in particular, to a method, an apparatus, and a medium for RGB-D instance segmentation for densely stacked industrial scenes. Background With the deep advancement of intelligent manufacturing, the requirement of an industrial automation production line on the intellectualization and the flexibility of a visual system are increasingly improved. In the fields of electronic manufacturing, automobile assembly and the like, automatic identification and accurate positioning of electronic elements are key links for guaranteeing production efficiency and product quality. Currently, an example segmentation technology based on an RGB-D camera has become a mainstream scheme for unordered grabbing of robots. These techniques aim at separating each individual electronic component from the complex scene and assigning a semantic label to each pixel. In the prior art, a three-dimensional object reconstruction and singulation method and system disclosed in patent CN120451380a involves extracting features from RGB-D data for object segmentation and three-dimensional reconstruction. The depth information is generally regarded as the input which is equal and reliable to the RGB information, and the data of the two modes are combined through simple characteristic splicing or post fusion strategies so as to improve the accuracy of segmentation. But have significant limitations, particularly in industrial settings such as electronic components that have weak textures, high light reflection, and are often in a densely stacked state. The following disadvantages are mainly present: 1. Sensitive to depth data noise, segmentation accuracy drops dramatically at occlusion boundaries. Depth sensors can produce significant noise and artifacts at the edges of objects, transparent or highly reflective surfaces (such as the pins and housings of many electronic components), and in areas that are obscured from each other. When the CN120451380A is in feature fusion, a discrimination mechanism for the reliability of the depth data is lacked, and noise and effective geometric information are input into a network together. This results in unreliable depth features at critical occlusion boundaries that contaminate otherwise reliable RGB features, creating segmentation errors such as boundary blurring, sticking or fracturing; 2. The generalization capability and the robustness of the model in a real industrial scene are insufficient. The traditional fusion strategy is static or blind fusion, and cannot be dynamically adjusted according to the data quality of specific pixels. Its network architecture does not design an adaptive arbitration mechanism for deep channel reliability. Therefore, when facing sensor noise or new shielding situations with different distribution from training data, the model is difficult to make correct judgment, so that the practical application of the model in changeable production environments is limited; 3. Training relies on large-scale, high-quality real-world annotation data. Because of the inherent technical drawbacks described above, such models require a large amount of precisely labeled data covering various noise and occlusion conditions to learn the compensation in order to achieve acceptable performance. However, fabricating pixel-level labeling for densely cluttered electronic component scenes is extremely costly and difficult to exhaust from all real-world complications. Therefore, there is a need for a robust RGB-D instance segmentation method that adaptively evaluates depth information reliability, dynamically fuses multimodal features, and reduces reliance on precisely labeled data to address the challenges of high reflectivity, weak texture, and dense stacking of electronic components in industrial scenarios. Disclosure of Invention The specification provides a method, equipment and medium for dividing RGB-D examples for densely stacked industrial scenes. According to an embodiment of the invention, there is provided an RGB-D instance segmentation method for a densely stacked industrial scene, including: S1, acquiring an RGB image and a depth image of a target scene, and preprocessing to acquire aligned RGB image information and depth image information; S2, extracting multi-level RGB visual features and depth geometric features from the RGB images and the depth images through a backbone convolutional neural network, carrying out modal arbitration operation on the RGB visual features and the depth geometric features on the basis of a depth confidence assessment result on each feature scale, and dynamically fusing to generate a multi-scale arbitration feature map; S3, inputting the multi-scale arbitration feature map into a decoder network, and processing by adopting an attention modulation module based on confidence co