JP-7855089-B2 - Information processing device, information processing method, learning model, program, and storage medium

JP7855089B2JP 7855089 B2JP7855089 B2JP 7855089B2JP-7855089-B2

Inventors

大熊顕至
ナグプレ，ヴィクラント

Assignees

本田技研工業株式会社

Dates

Publication Date: 20260507
Application Date: 20231220
Priority Date: 20221222

Claims (16)

An information processing device that recognizes a target or the state of a target present in an captured image, An acquisition means for acquiring feature quantities at multiple resolutions of the aforementioned image, A feature extraction means that uses multiple transformer encoders to extract notable features based on the features at the multiple resolutions, Includes output means for outputting the target or the state of the target as a recognition result based on the output results of the plurality of transformer encoders, The information processing apparatus is characterized in that the feature extraction means is configured to extract a noteworthy feature from the feature at multiple resolutions by inputting a first feature at one of the multiple resolutions extracted from the image and a second feature at another resolution among the multiple resolutions into a transformer encoder associated with one of the multiple transformer encoders.
The information processing device according to claim 1, characterized in that the feature extraction means inputs the first feature as the key and value of the transformer encoder, inputs the second feature as the query of the transformer encoder, and extracts the first feature that has a high correlation with the second feature.
The information processing apparatus according to claim 1, characterized in that the feature extraction means inputs a feature obtained by concatenating the feature quantities at other resolutions among the feature quantities at the multiple resolutions into a transformer encoder associated with one of the resolutions, as a feature quantity at the other resolutions.
The information processing apparatus according to claim 1, characterized in that each of the multiple transformer encoders is associated with each of the different resolutions of the multiple resolutions.
The information processing apparatus according to claim 1, characterized in that the number of transformer encoders corresponds to the number of resolution types in the plurality of resolutions.
The information processing apparatus according to claim 1, characterized in that the number of transformer encoders is four or less.
The information processing apparatus according to claim 1, characterized in that the plurality of transformer encoders are not connected in series with one another.
The information processing apparatus according to claim 1, characterized in that the output means includes a network layer that is learned to output the target or the state of the target as a recognition result based on the output results of the plurality of transformer encoders.
The information processing apparatus according to claim 8, characterized in that the output means inputs the result of applying a pooling process using the average value to the output results from each of the plurality of transformer encoders to the network layer.
The information processing apparatus according to claim 1, characterized in that the target includes a person's face, and the state of the target includes the direction of gaze in the person's face.
The information processing apparatus according to claim 1, characterized in that the acquisition means includes a second feature extraction means for extracting feature quantities at multiple resolutions of the image using a neural network.
The information processing apparatus according to claim 11, wherein the second feature extraction means uses a high-resolution network that repeatedly extracts features at the highest resolution among the plurality of resolutions while simultaneously extracting features at lower resolutions among the plurality of resolutions, and exchanges features between the resolutions.
An information processing method performed in an information processing device for recognizing a target or the state of a target present in an captured image, An acquisition step to acquire feature quantities at multiple resolutions of the aforementioned image, A feature extraction step that uses multiple transformer encoders to extract notable features based on the features at the multiple resolutions, The output step includes outputting the target or the state of the target as a recognition result based on the output results of the plurality of transformer encoders, An information processing method configured to extract a noteworthy feature from the multiple resolutions by inputting a first feature at one of the multiple resolutions extracted from the image, and a second feature at another resolution, to a transformer encoder associated with one of the multiple transformer encoders.
A learning model configured to recognize an object or the state of an object present in an image , A first neural network including multiple transformer encoders that take feature quantities at multiple resolutions of the aforementioned image as input and extract notable feature quantities based on the feature quantities at multiple resolutions, The network includes a second neural network that is trained to output the target or the state of the target as a recognition result based on the output results of the plurality of transformer encoders, A learning model characterized in that the plurality of transformer encoders cause a computer to function to extract a noteworthy feature from the plurality of resolutions by inputting a first feature at one of the plurality of resolutions extracted from the image and a second feature obtained by concatenating a first feature at one of the plurality of resolutions and a second feature obtained by concatenating a feature at another of the plurality of resolutions into a transformer encoder associated with one of the plurality of transformer encoders.
A program for causing a computer to function as one of the means of an information processing apparatus described in any one of claims 1 to 12.
A storage medium for storing a program for causing a computer to function as one of the means of the information processing apparatus according to any one of claims 1 to 12.

Description

This invention relates to an information processing device, an information processing method, a learning model, a program, and a storage medium. In recent years, techniques have been proposed that use deep neural networks to recognize the state of objects and people (referred to as targets) within an image (for example, the pose of the target or the direction of the person's gaze). Non-Patent Document 1 proposes a technology for recognizing human posture with higher accuracy using a high-resolution network (High-Resolution Net). The high-resolution network exchanges feature information obtained through convolutional processing in parallel high-resolution subnetworks and low-resolution subnetworks. The technology disclosed in Non-Patent Document 1 enables high-accuracy recognition of human posture by using such a high-resolution network. Furthermore, a model (Vision Transformer (ViT)) is known that applies the Transformer model, which exhibits high performance as a module for deep neural networks that process time-series data such as natural language data, to image processing (Non-Patent Literature 2). In Non-Patent Literature 2, the Transformer is applied to image processing by treating the image as sequence data consisting of a series of image patches. Ke Sun, et al., "Deep High-Resolution Representation Learning for Human Pose Estimation", arXiv:1902.09212v1 [cs. CV], February 25, 2019."A confident model for image recognition! A thorough explanation of Vision Transformer (ViT), which has broken away from CNN," [online], [Accessed October 19, 2022], <URL: https://deepsquare.jp/2020/10/vision-transformer/#outline__1> The attached drawings are included in the specification and constitute part thereof, illustrating embodiments of the present invention and are used together with the description to explain the principles of the present invention. Block diagram showing an example of the functional configuration of the vehicle according to this embodiment. This figure illustrates the main configuration for the driver assistance function in the vehicle according to this embodiment. This diagram schematically illustrates an example of the configuration of the deep neural network (DNN) model in the model processing unit according to this embodiment. This diagram schematically illustrates an example configuration of a multi-resolution fusion transformer in a DNN model according to this embodiment. This diagram schematically illustrates the neural architecture search (NAS) used to train the DNN model in the model processing unit according to this embodiment. A flowchart illustrating a series of recognition processing operations in the model processing unit according to this embodiment. A flowchart illustrating a series of operations in the driver assistance process according to this embodiment. The embodiments will be described in detail below with reference to the attached drawings. Note that the following embodiments do not limit the invention as defined in the claims, and not all combinations of features described in the embodiments are essential to the invention. Two or more of the features described in the embodiments may be combined in any way. Furthermore, identical or similar configurations will be given the same reference numeral, and redundant descriptions will be omitted. <Vehicle Configuration> First, with reference to Figure 1, an example of the functional configuration of the vehicle 100 according to this embodiment will be described. Note that each of the functional blocks described with reference to the following figures may be integrated or separated, and the functions described may be implemented in other blocks. Also, what is described as hardware may be implemented in software, and vice versa. In the following example, the case in which the control unit 108 is incorporated into the vehicle 100 will be explained, but the control unit 108 of the vehicle 100 may be configured as a control module or information processing device having the configuration of the control unit 108. That is, the present invention can be realized as a control module or information processing device having the configuration of a processor 110 and a model processing unit 114 included in the control unit 108. The sensor unit 101 includes a camera (imaging means) that outputs images of the area in front of the vehicle 100 (or further to the sides and rear of the vehicle). The sensor unit 101 may further include a Lidar (Light Detection and Ranging) that outputs a distance image obtained by measuring the distance in front of the vehicle (or further to the sides and rear of the vehicle). The sensor unit 101 further includes a camera (imaging means) located inside the vehicle 100 that captures the driver's face. The image of the driver is used, for example, in the inference process of recognizing a target or the state of a target in the model processing unit 114. The sensor unit 101 may also include various sensors that output the