US-12626492-B2 - Perception network and data processing method

US12626492B2US 12626492 B2US12626492 B2US 12626492B2US-12626492-B2

Abstract

This disclosure discloses methods, apparatuses, and systems related to perception networks. In an implementation, a method comprises: performing convolution processing on input data to obtain M target feature maps, performing convolution processing on M1 target feature maps in the M target feature maps to obtain M1 first feature maps, wherein M1 is less than M, processing M2 target feature maps in the M target feature maps to obtain M2 second feature maps, wherein M2 is less than M, and concatenating the M1 first feature maps and the M2 second feature maps to obtain a concatenated feature map.

Inventors

Jianyuan GUO
Kai HAN
Yunhe Wang
Chunjing Xu

Assignees

HUAWEI TECHNOLOGIES CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20230825
Priority Date: 20210227

Claims (20)

1 . A data processing method implemented by a feature extraction network, wherein the method comprises: performing convolution processing on input data to obtain M target feature maps; performing convolution processing on M1 target feature maps in the M target feature maps to obtain M1 first feature maps, wherein M1 is less than M; processing M2 target feature maps in the M target feature maps to obtain M2 second feature maps, wherein M2 is less than M; and concatenating the M1 first feature maps and the M2 second feature maps to obtain a concatenated feature map; fusing the feature map output by each second block by using a fusion operation, to obtain a fused feature map, wherein a size of the fused feature map is same as a size of the M2 second feature maps; and performing an addition operation on the fused feature map and the M2 second feature maps, to obtain processed M2 second feature maps, wherein the concatenating the M1 first feature maps and the M2 second feature maps comprises: concatenating the M1 first feature maps and the processed M2 second feature maps, to obtain the concatenated feature map; wherein the feature extraction network is configured to: obtain an input image; perform feature extraction on the input image; and output a feature map of the input image; wherein the method further comprises: processing a corresponding task based on the feature map of the input image by using a task network to obtain a processing result.
2 . The method according to claim 1 , wherein the feature extraction network comprises a first block, at least one second block connected in series, a target operation, and a concatenation operation, wherein the input data is input of the first block, the M target feature maps are outputs of the first block and input of at least one second block, the M1 first feature maps are output of at least one second block and inputs of the target operation's input.
3 . The method according to claim 2 , wherein the first block and M second blocks are blocks in a same stage in the feature extraction network.
4 . The method according to claim 2 , wherein a quantity of parameters of the target operation is less than a quantity of parameters of the M second blocks.
5 . The method according to claim 2 , wherein the target operation is a convolution operation whose quantity of parameters is less than a quantity of parameters of the at least one second block; or the target operation is a residual connection operation between an output of the first block and an output of the concatenation operation.
6 . The method according to claim 2 , wherein an output of a second block that is farthest from the first block in the at least one second block is the M1 first feature maps.
7 . The method according to claim 1 , wherein the fusing the feature map output by each second block by using a fusion operation comprises: performing concatenation and dimension reduction operations on an output of each second block by using the fusion operation, to obtain the fused feature map whose size is same as the size of the M2 second feature maps.
8 . The method according to claim 2 , wherein the first block and the M second blocks are blocks in a target stage in the feature extraction network, and the concatenated feature map is used as an output feature map of the target stage in the feature extraction network; and wherein the method further comprising: performing a convolution operation on the concatenated feature map to obtain an output feature map of the target stage.
9 . The method according to claim 1 , wherein the task comprises target detection, image segmentation, or image classification.
10 . A data processing apparatus applied in a feature extraction network, wherein the apparatus comprises a memory and at least one processor, the memory stores instructions for execution by the at least one processor to perform operations comprising: performing convolution processing on input data to obtain M target feature maps; performing convolution processing on M1 target feature maps in the M target feature maps to obtain M1 first feature maps, wherein M1 is less than M; processing M2 target feature maps in the M target feature maps to obtain M2 second feature maps, wherein M2 is less than M; and concatenating the M1 first feature maps and the M2 second feature maps to obtain a concatenated feature map; fusing the feature map output by each second block by using a fusion operation, to obtain a fused feature map, wherein a size of the fused feature map is same as a size of the M2 second feature maps; and performing an addition operation on the fused feature map and the M2 second feature maps, to obtain processed M2 second feature maps, wherein the concatenating the M1 first feature maps and the M2 second feature maps comprises: concatenating the M1 first feature maps and the processed M2 second feature maps, to obtain the concatenated feature map; wherein the feature extraction network is configured to: obtain an input image; perform feature extraction on the input image; and output a feature map of the input image; wherein the method further comprises: processing a corresponding task based on the feature map of the input image by using a task network to obtain a processing result.
11 . The data processing apparatus according to claim 10 , wherein the feature extraction network comprises a first block, at least one second block connected in series, a target operation, and a concatenation operation, wherein the input data is input of the first block, the M target feature maps are outputs of the first block and input of at least one second block, the M1 first feature maps are output of at least one second block and inputs of the target operation's input.
12 . The data processing apparatus according to claim 11 , wherein the first block and M second blocks are blocks in a same stage in the feature extraction network.
13 . The method according to claim 1 , wherein an intersection set of the M1 target feature maps and the M2 target feature maps is empty, a sum of M1 and M2 is M, and a quantity of channels of the concatenated feature map is M.
14 . The data processing apparatus according to claim 11 , wherein a quantity of parameters of the target operation is less than a quantity of parameters of the M second blocks.
15 . The data processing apparatus according to claim 11 , wherein the target operation is a convolution operation whose quantity of parameters is less than a quantity of parameters of the at least one second block; or the target operation is a residual connection operation between an output of the first block and an output of the concatenation operation.
16 . The data processing apparatus according to claim 11 , wherein an output of a second block that is farthest from the first block in the at least one second block is the M1 first feature maps.
17 . The data processing apparatus according to claim 10 , wherein the fusing the feature map output by each second block by using a fusion operation comprises: performing concatenation and dimension reduction operations on an output of each second block by using the fusion operation, to obtain the fused feature map whose size is same as the size of the M2 second feature maps.
18 . The data processing apparatus according to claim 11 , wherein the first block and the M second blocks are blocks in a target stage in the feature extraction network, and the concatenated feature map is used as an output feature map of the target stage in the feature extraction network; and wherein the operations further comprising: performing a convolution operation on the concatenated feature map, to obtain an output feature map of the target stage.
19 . The data processing apparatus according to claim 10 , wherein an intersection set of the M1 target feature maps and the M2 target feature maps is empty, a sum of M1 and M2 is M, and a quantity of channels of the concatenated feature map is M.
20 . A non-transitory computer storage medium applied in a feature extraction network, wherein the computer storage medium stores one or more instructions, and when the instructions are executed by one or more computers, the one or more computers run operations comprising: performing convolution processing on input data to obtain M target feature maps; performing convolution processing on M1 target feature maps in the M target feature maps to obtain M1 first feature maps, wherein M1 is less than M; processing M2 target feature maps in the M target feature maps to obtain M2 second feature maps, wherein M2 is less than M; and concatenating the M1 first feature maps and the M2 second feature maps to obtain a concatenated feature map; fusing the feature map output by each second block by using a fusion operation, to obtain a fused feature map, wherein a size of the fused feature map is same as a size of the M2 second feature maps; and performing an addition operation on the fused feature map and the M2 second feature maps, to obtain processed M2 second feature maps, wherein the concatenating the M1 first feature maps and the M2 second feature maps comprises: concatenating the M1 first feature maps and the processed M2 second feature maps, to obtain the concatenated feature map; wherein the feature extraction network is configured to: obtain an input image; perform feature extraction on the input image; and output a feature map of the input image; wherein the method further comprises: processing a corresponding task based on the feature map of the input image by using a task network to obtain a processing result.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of International Application No. PCT/CN2022/077881, filed on Feb. 25, 2022, which claims priority to Chinese Patent Application No. 202110221934.8, filed on Feb. 27, 2021. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties. TECHNICAL FIELD This disclosure relates to the artificial intelligence field, and in particular, to a perception network and a data processing method. BACKGROUND Artificial intelligence (AI) is a theory, a method, a technology, or an application system that simulates, extends, and expands human intelligence by using a digital computer or a machine controlled by a digital computer, to perceive an environment, obtain knowledge, and achieve an optimal result based on the knowledge. In other words, artificial intelligence is a branch of computer science and attempts to understand essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is to research design principles and implementation methods of various intelligent machines, so that the machines have perception, inference, and decision-making functions. Computer vision is an integral part of various intelligent/autonomic systems in various application fields such as manufacturing, inspection, document analysis, medical diagnosis, and military affairs. Computer vision is knowledge about how to use a camera/video camera and a computer to obtain required data and information of a photographed subject. To be vivid, an eye (the camera/video camera) and a brain (an algorithm) are installed on the computer to replace a human eye to recognize, track, and measure a target, and the like, so that the computer can perceive an environment. Perceiving may be considered as extracting information from a perceptual signal. Therefore, computer vision may also be considered as science of studying how to make an artificial system perceive an image or multi-dimensional data. Generally, computer vision is to replace a visual organ with various imaging systems to obtain input information, and then replace a brain with a computer to process and interpret the input information. A final study objective of computer vision is to enable a computer to observe and understand the world through vision like a human being does, and have a capability of automatically adapting to an environment. An inference model based on a convolutional neural network is widely applied to various terminal tasks based on computer vision, for example, scenarios such as image recognition, target detection, and instance segmentation. In a conventional basic neural network, various terminal tasks usually cannot be carried out in real time due to a large quantity of parameters and a large amount of calculation. An existing lightweight inference network (for example, mobilenet, efficientnet, or shufflenet) is designed for a mobile device such as a central processing unit (CPU) or an ARM (advanced RISC machine), but has satisfactory performance on a processing unit designed based on a large throughput, for example, a graphics processing unit (GPU) device, a tensor processing unit (TPU) device, or a neural network processing unit (NPU) device, and an inference speed is even slower than that of a conventional convolutional neural network. SUMMARY According to a first aspect, this disclosure provides a perception network. The perception network includes a feature extraction network, the feature extraction network includes a first block, at least one second block connected in series, a target operation, and a concatenation operation, the first block and M second blocks are blocks in a same stage in the feature extraction network, and a quantity of parameters of the target operation is less than a quantity of parameters of the M second blocks. The target operation may also be referred to as a cheap operation, may be a general term of a series of operations with a small quantity of parameters, and is used to be distinguished from a conventional convolution operation. The quantity of parameters may be used to describe a quantity of parameters included in a neural network, and is used to evaluate a size of a model. The concatenation operation (concat) is to concatenate feature maps without changing data of the feature maps. For example, a result of performing a concatenation operation on a feature map 1 and a feature map 2 is (Feature map 1, Feature map 2). A sequence of the feature map 1 and the feature map 2 is not limited. More specifically, a result of performing a concatenation operation on a feature map having three semantic channels and a feature map having five semantic channels is a feature map having eight semantic channels. The first block is configured to perform convolution processing on input data, to obtain M target feature maps. Each target feature map corresponds to one chan