US-12620122-B2 - Target detection method and apparatus

US12620122B2US 12620122 B2US12620122 B2US 12620122B2US-12620122-B2

Abstract

Embodiments of this application provide a target detection method and apparatus, an electronic device, and a computer storage medium. The method includes: obtaining a target image including a target object; performing instance segmentation on the target image, to obtain a segmentation mask corresponding to the target object; obtaining, based on the segmentation mask, position relationship features between target pixels in a target region in which the target object is located in the target image; obtaining position relationship features between standard pixels in a preset region of interest in a standard image, where the standard image includes a standard object corresponding to the target object; and matching the position relationship features between the target pixels and the position relationship features between the standard pixels, to obtain a correspondence between the target pixels and the standard pixels, and obtaining pose information of the target object based on the correspondence.

Inventors

Wei Yao
Dong Li
Chuan Yu ZHANG

Assignees

SIEMENS AKTIENGESELLSCHAFT

Dates

Publication Date: 20260505
Application Date: 20221117
Priority Date: 20211130

Claims (5)

1 . A target detection method comprising: obtaining a target image representing a target object; performing instance segmentation on the target image to obtain a segmentation mask corresponding to the target object; obtaining first position relationship features based on characteristics of target pixels in a target region in which the target object is located in the target image using the segmentation mask; obtaining second position relationship features based on characteristics of standard pixels in a preset region of interest of a standard object in a standard image; matching the first position relationship features of the target pixels to the second position relationship features of the standard pixels to obtain a correspondence between the target pixels and the standard pixels; obtaining pose information of the target object based on the correspondence; inputting the target image into a pre-trained instance segmentation model; and performing instance segmentation on the target image by using the pre-trained instance segmentation model to obtain the segmentation mask corresponding to the target object; wherein the pre-trained instance segmentation model comprises: a feature extraction network, a feature fusion network, a region generation network, a feature alignment layer, a classification and regression network, and a segmentation mask network; wherein performing instance segmentation on the target image by using the pre-trained instance segmentation model comprises: putting the target image into the feature extraction network in the pre-trained instance segmentation model, and performing multi-scale feature extraction on the target image by using the feature extraction network, to obtain a plurality of levels of initial feature maps corresponding to the target image; performing feature fusion on the plurality of levels of initial feature maps using the feature fusion network to obtain fused feature maps; obtaining information about an initial region of the target object based on a resulting fused feature map and using the region generation network; performing feature extraction on the plurality of levels of initial feature maps based on the information about the initial region and using the feature alignment layer, to obtain a region feature map corresponding to the initial region in the plurality of levels of initial feature maps; obtaining category information and position information of the target object based on the region feature map and using the classification and regression network; and obtaining the segmentation mask corresponding to the target object based on the region feature map and using the segmentation mask network; and wherein performing feature fusion on the plurality of levels of initial feature maps using the feature fusion network to obtain the fused feature maps comprises: performing a convolution operation on each of the plurality of levels of initial feature maps by using the feature fusion network to obtain a plurality of levels of initial dimension-reduced feature maps; sequentially performing fusion processing on every two adjacent levels of the plurality of levels of initial dimension-reduced feature maps according to a descending order of levels, to obtain a plurality of initially fused feature maps, and updating an initial dimension-reduced feature map at a lower level in the adjacent levels by using a corresponding initially fused feature map of the plurality of initially fused feature maps, wherein a size of an initial dimension-reduced feature map at an upper level is less than a size of the initial dimension-reduced feature map at the lower level; performing the convolution operation on each of the plurality of initially fused feature maps, to obtain a plurality of levels of dimension-reduced feature maps; and sequentially performing fusion processing on every two adjacent levels of the plurality of levels of dimension-reduced feature maps according to an ascending order of levels, to obtain a transition feature map, performing fusion processing on the transition feature map and a corresponding initial feature map from the plurality of levels of initial feature maps, to obtain a fused feature map, and updating a dimension-reduced feature map at an upper level in the adjacent levels using the fused feature map, wherein a size of the dimension-reduced feature map at the upper level is less than a size of a dimension-reduced feature map at a lower level.
2 . The method according to claim 1 , wherein: obtaining the first position relationship features of the target pixels comprises combining, based on the segmentation mask, the target pixels in the target region in which the target object is located in the target image in pairs, to obtain a plurality of target pixel pairs, and obtaining, for each target pixel pair of the plurality of target pixel pairs, a position relationship feature between two target pixels in the target pixel pair; and obtaining the second position relationship features of the standard pixels in the preset region of interest of the standard object in the standard image comprises obtaining the standard image and the preset region of interest of the standard object in the standard image; and the method further comprises combining the standard pixels in the preset region of interest in pairs, to obtain a plurality of standard pixel pairs, and obtaining, for each standard pixel pair of the plurality of standard pixel pairs, a position relationship feature between two standard pixels in the standard pixel pair.
3 . The method according to claim 2 , wherein: for each target pixel pair of the plurality of target pixel pairs, the position relationship feature between the two target pixels in the target pixel pair is obtained based on a distance between the two target pixels, an angle between normal vectors corresponding to the two target pixels respectively, and angles between the normal vectors corresponding to the two target pixels and a connection line between the two target pixels; and for each standard pixel pair of the plurality of standard pixel pairs, the position relationship feature between the two standard pixels in the standard pixel pair is obtained based on a distance between the two standard pixels, an angle between normal vectors corresponding to the two standard pixels respectively, and angles between the normal vectors corresponding to the two standard pixels and a connection line between the two standard pixels.
4 . The method according to claim 1 , wherein: the feature extraction network comprises two concatenated convolution layers; a size of a convolution kernel of a first of the two concatenated convolution layers is 1*1; a convolution stride of the first convolution layer is 1; and a convolution stride of a second of the two concatenated convolution layers is less than or equal to a size of a convolution kernel of the second convolution layer.
5 . An electronic device comprising: a processor; a memory; a communication interface; and a communication bus providing mutual communication between the processor, the memory, and the communication interface; wherein the memory is configured to store at least one executable instruction, and the at least one executable instruction causes the processor to: obtain a target image comprising a target object; perform instance segmentation on the target image to obtain a segmentation mask corresponding to the target object; obtain position relationship features based on characteristics of target pixels in a target region in which the target object is located in the target image using the segmentation mask; obtain position relationship features based on characteristics of standard pixels in a preset region of interest of a standard object in a standard image; match the position relationship features of the target pixels to the position relationship features of the standard pixels to obtain a correspondence between the target pixels and the standard pixels; obtain pose information of the target object based on the correspondence; put the target image into a pre-trained instance segmentation model; and perform instance segmentation on the target image by using the pre-trained instance segmentation model to obtain the segmentation mask corresponding to the target object; wherein the pre-trained instance segmentation model comprises: a feature extraction network, a feature fusion network, a region generation network, a feature alignment layer, a classification and regression network, and a segmentation mask network; wherein performing instance segmentation on the target image by using the pre-trained instance segmentation model comprises: putting the target image into the feature extraction network in the pre-trained instance segmentation model, and performing multi-scale feature extraction on the target image by using the feature extraction network, to obtain a plurality of levels of initial feature maps corresponding to the target image; performing feature fusion on the plurality of levels of initial feature maps using the feature fusion network to obtain fused feature maps; obtaining information about an initial region of the target object based on a resulting fused feature map and using the region generation network; performing feature extraction on the plurality of levels of initial feature maps based on the information about the initial region and using the feature alignment layer, to obtain a region feature map corresponding to the initial region in the plurality of levels of initial feature maps; obtaining category information and position information of the target object based on the region feature map and using the classification and regression network; and obtaining the segmentation mask corresponding to the target object based on the region feature map and using the segmentation mask network; and wherein performing feature fusion on the plurality of levels of initial feature maps using the feature fusion network to obtain the fused feature maps comprises: performing a convolution operation on each of the plurality of levels of initial feature maps by using the feature fusion network to obtain a plurality of levels of initial dimension-reduced feature maps: sequentially performing fusion processing on every two adjacent levels of the plurality of levels of initial dimension-reduced feature maps according to a descending order of levels, to obtain a plurality of initially fused feature maps, and updating an initial dimension-reduced feature map at a lower level in the adjacent levels by using a corresponding initially fused feature map of the plurality of initially fused feature maps, wherein a size of an initial dimension-reduced feature map at an upper level is less than a size of the initial dimension-reduced feature map at the lower level; performing the convolution operation on each of the plurality of initially fused feature maps, to obtain a plurality of levels of dimension-reduced feature maps; and sequentially performing fusion processing on every two adjacent levels of the plurality of levels of dimension-reduced feature maps according to an ascending order of levels, to obtain a transition feature map, performing fusion processing on the transition feature map and a corresponding initial feature map from the plurality of levels of initial feature maps, to obtain a fused feature map, and updating a dimension-reduced feature map at an upper level in the adjacent levels using the fused feature map, wherein a size of the dimension-reduced feature map at the upper level is less than a size of a dimension-reduced feature map at a lower level.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a U.S. National Stage Application of International Application No. PCT/CN2022/132660 filed Nov. 17, 2022, which designates the United States of America, and claims priority to EP Application No. 21211271.8 filed Nov. 30, 2021, the contents of which are hereby incorporated by reference in their entirety. TECHNICAL FIELD This disclosure generally relates to image processing technologies. Various embodiments of the teachings herein include target detection methods and/or apparatus. BACKGROUND Target detection technologies can be applied to various scenarios as the technologies mature. For example, in the fields of industrial production and the like, through the target detection technologies, workpieces can be automatically picked and assembled by using intelligent robots. Specifically, an image including a workpiece may be first obtained, and then target detection is performed on the image, to obtain pose information (position information and posture information) of the target workpiece, so that the intelligent robot obtains the target workpiece according to the pose information and assembles the target workpiece. An existing target detection method has relatively low detection efficiency during use. Therefore, how to improve target detection efficiency is an urgent problem to be resolved. SUMMARY Teachings of the present disclosure may be used for resolving the technical problem, including target detection methods and apparatus to resolve a defect of relatively low detection efficiency in the related art. For example, some embodiments include a target detection method, including: obtaining a target image including a target object; performing instance segmentation on the target image, to obtain a segmentation mask corresponding to the target object; obtaining, based on the segmentation mask, position relationship features between target pixels in a target region in which the target object is located in the target image; obtaining position relationship features between standard pixels in a preset region of interest of a target object in a standard image; and matching the position relationship features between the target pixels and the position relationship features between the standard pixels, to obtain a correspondence between the target pixels and the standard pixels, and obtaining pose information of the target object based on the correspondence. In some embodiments, obtaining, based on the segmentation mask, position relationship features between target pixels in a target region in which the target object is located in the target image includes: combining, based on the segmentation mask, the target pixels in the target region in which the target object is located in the target image in pairs, to obtain a plurality of target pixel pairs, and obtaining, for each target pixel pair, a position relationship feature between two target pixels in the target pixel pair; and obtaining position relationship features between standard pixels in a preset region of interest in a standard image includes: obtaining the standard image and the preset region of interest of a target object in the standard image; and combining the standard pixels in the preset region of interest in pairs, to obtain a plurality of standard pixel pairs, and obtaining, for each standard pixel pair, a position relationship feature between two standard pixels in the standard pixel pair. In some embodiments, for each target pixel pair, the position relationship feature between the two target pixels in the target pixel pair is obtained based on a distance between the two target pixels, an angle between normal vectors corresponding to the two target pixels respectively, and angles between the normal vectors corresponding to the two target pixels and a connection line between the two target pixels; and for each standard pixel pair, the position relationship feature between the two standard pixels in the standard pixel pair is obtained based on a distance between the two standard pixels, an angle between normal vectors corresponding to the two standard pixels respectively, and angles between the normal vectors corresponding to the two standard pixels and a connection line between the two standard pixels. In some embodiments, performing instance segmentation on the target image, to obtain a segmentation mask corresponding to the target object includes: inputting the target image into a pre-trained instance segmentation model, and performing instance segmentation on the target image by using the instance segmentation model, to obtain the segmentation mask corresponding to the target object. In some embodiments, the instance segmentation model includes: a feature extraction network, a feature fusion network, a region generation network, a feature alignment layer, a classification and regression network, and a segmentation mask network; and inputting the target image into a pre-trained instance segme