KR-20260062785-A - ELECTRIC DEVICE AND METHOD OF EXECUTION THEREOF

KR20260062785AKR 20260062785 AKR20260062785 AKR 20260062785AKR-20260062785-A

Abstract

An electronic device and a method executed by the electronic device are disclosed. The electronic device may include a memory and a processor that executes instructions stored in the memory. When instructions are executed by the processor, the electronic device may acquire a first modal feature representing a feature extracted from an image acquired through a first sensor and a second modal feature representing a feature extracted from a point cloud acquired through a second sensor different from the first sensor, acquire a first augmented feature by performing feature augmentation processing on the first modal feature using the second modal feature, acquire a second augmented feature by performing feature augmentation processing on the second modal feature using the first modal feature, acquire a fused feature by fusing the first augmented feature and the second augmented feature, and perform a target operation using the acquired fused feature.

Inventors

샤오슈아이 하오
차오 장
후이 장
웨이밍 리
멍촨 웨이

Assignees

삼성전자주식회사

Dates

Publication Date: 20260507
Application Date: 20250312
Priority Date: 20241029

Claims (20)

In electronic devices, Memory; and It includes a processor that executes instructions stored in the above memory, and When the above instructions are executed by the processor, the electronic device, Acquiring a first modal feature representing a feature extracted from an image acquired through a first sensor and a second modal feature representing a feature extracted from a point cloud acquired through a second sensor different from the first sensor, and A first augmented feature is obtained by performing feature augmentation processing on the first modal feature using the second modal feature, and a second augmented feature is obtained by performing feature augmentation processing on the second modal feature using the first modal feature. A fusion feature is obtained by fusing the first augmentation feature and the second augmentation feature, and Performing a target task using the fusion features obtained above, Electronic device.
In paragraph 1, When the above instructions are executed by the processor, the electronic device, Obtaining the first augmented feature obtained by augmenting the first modal feature from the first feature augmentation model that takes as input the second query obtained from the second feature mapping layer of the first modal feature and the second feature augmentation model, and Obtaining the second augmented feature, which augments the second modal feature, from a second feature augmentation model that takes as input the first query output from the first feature mapping layer of the first feature augmentation model and the second modal feature. Electronic device.
In paragraph 2, The above-mentioned first feature enhancement model is, The first feature mapping layer that extracts the input of the first attention layer from the first modal feature; and It includes the first attention layer that outputs features based on the first modal feature and the second modal feature, When the above instructions are executed by the processor, the electronic device, Obtaining a first key and a first value used in the first attention layer from the first feature mapping layer that takes the first modal feature as input, and Obtaining the second query from the second feature mapping layer that takes the second modal feature as input, and Obtaining a first feature from the first attention layer having the second query, the first key, and the first value as inputs, and Acquiring the first augmentation feature based on the first feature and the second query, Electronic device.
In paragraph 3, The above-mentioned first feature enhancement model is, A first normalization layer that normalizes the output of the first attention layer; and It further includes a first multilayer perceptron layer connected to the first normalization layer, and When the above instructions are executed by the processor, the electronic device, A second feature is obtained from a first normalization layer that takes the first feature and the second query as input, and A third feature is obtained from the first multilayer perceptron layer having the second feature as input, and Obtaining the first augmentation feature from a second normalization layer having the second feature and the third feature as inputs, Electronic device.
In paragraph 2, The above second feature enhancement model is, A second feature mapping layer that extracts the input of a second attention layer from the second modal feature; and It includes the second attention layer that outputs features based on the second modal feature and the first modal feature, and When the above instructions are executed by the processor, the electronic device, Obtaining a second key and a second value used in the second attention layer from the second feature mapping layer that takes the second modal feature as input, and Obtaining the first query from the first feature mapping layer that takes the first modal feature as input, and A fourth feature is obtained from the second attention layer having the first query, the second key, and the second value as inputs, and Acquiring the second augmentation feature based on the fourth feature and the first query, Electronic device.
In paragraph 5, The above second feature enhancement model is, A third normalization layer that normalizes the output of the second attention layer; and It further includes a second multilayer perceptron layer connected to the third normalization layer, and When the above instructions are executed by the processor, the electronic device, Obtaining a fifth feature from the third normalization layer that takes the fourth feature and the first query as input, and A sixth feature is obtained from the second multilayer perceptron layer having the fifth feature as input, and Obtaining the second augmentation feature from a fourth normalization layer having the fifth feature and the sixth feature as inputs, Electronic device.
In paragraph 4, The above-mentioned first feature enhancement model includes a plurality of first feature enhancement sub-models, and Each of the above plurality of first feature enhancement sub-models is, It includes the first feature mapping layer, the first attention layer, the first normalization layer, and the first multilayer perceptron layer, The above plurality of first feature enhancement sub-models are connected in series with each other, and The second query obtained from the output of the previous first feature augmentation sub-model and the second feature mapping layer of the second feature augmentation model is the input of the following first feature augmentation sub-model, Electronic device.
In Paragraph 7, The above second feature enhancement model includes a plurality of second feature enhancement sub-models, and Each of the above plurality of second feature enhancement sub-models is, It includes the above-mentioned second feature mapping layer, second attention layer, third normalization layer, and second multilayer perceptron layer, and The above plurality of second feature enhancement sub-models are connected in series with each other, and The output of the aforementioned first feature augmentation sub-model and the second query obtained from the aforementioned second feature mapping layer of the aforementioned second feature augmentation sub-model are the inputs of the following first feature augmentation sub-model, and The aforementioned prior first feature enhancement sub-model is a model corresponding to the aforementioned prior second feature enhancement sub-model, Electronic device.
In paragraph 6, The above second feature enhancement model includes a plurality of second feature enhancement sub-models, and Each of the above plurality of second feature enhancement sub-models is, It includes the second feature mapping layer, the second attention layer, the third normalization layer, and the second multilayer perceptron layer, The above plurality of second feature enhancement sub-models are connected in series with each other, and The first query obtained from the output of the previous second feature augmentation sub-model and the first feature mapping layer of the first feature augmentation model is the input of the following second feature augmentation sub-model, Electronic device.
In Paragraph 9, The above-mentioned first feature enhancement model includes a plurality of first feature enhancement sub-models, and Each of the above plurality of first feature enhancement sub-models is, The above includes the first feature mapping layer, the first attention layer, the first normalization layer, and the first multilayer perceptron layer, and The above plurality of first feature enhancement sub-models are connected in series with each other, and The output of the aforementioned second feature augmentation sub-model and the first query obtained from the aforementioned first feature mapping layer of the aforementioned first feature augmentation sub-model are the inputs of the following second feature augmentation sub-model, and The above second feature enhancement sub-model is a model corresponding to the above previous first feature enhancement sub-model, Electronic device.
In paragraph 2, The first attention layer of the above-mentioned first feature enhancement model is based on a multi-head attention mechanism, and The second attention layer of the above-mentioned second feature augmentation model is based on a multi-head attention mechanism. Electronic device.
In paragraph 1, When the above instructions are executed by the processor, the electronic device, Obtaining the fusion feature from a feature fusion model based on the first augmentation feature and the second augmentation feature, Electronic device.
In Paragraph 12, When the above instructions are executed by the processor, the electronic device, A cascade feature is obtained by cascade the above first augmentation feature and the above second augmentation feature, and A feature extracted from the cascade feature is obtained from the feature fusion model that takes the cascade feature as input, and Obtain sub-fusion features used to generate the fusion feature based on the extracted feature, the first augmentation feature, and the second augmentation feature, and Acquiring the fusion feature by cascading the above sub-fusion features, Electronic device.
In paragraph 1, The first sensor mentioned above is a camera sensor, and The second sensor mentioned above is a LiDAR sensor, Electronic device.
In a method executed by an electronic device, The operation of acquiring a first modal feature representing a feature extracted from an image acquired through a first sensor and a second modal feature representing a feature extracted from a point cloud acquired through a second sensor different from the first sensor; An operation to obtain a first augmented feature by performing feature augmentation processing on the first modal feature using the second modal feature; An operation to obtain a second augmented feature by performing feature augmentation processing on the second modal feature using the first modal feature; An operation of obtaining a fused feature by fusing the first augmentation feature and the second augmentation feature; and A method comprising performing a target task using the aforementioned acquired fusion features, method.
In paragraph 15, The operation of acquiring the above-mentioned first augmentation feature is, The method includes the operation of obtaining the first augmented feature, which is augmented by the first modal feature, from a first feature augmentation model that takes as input the second query obtained from the second feature mapping layer of the first modal feature and the second feature augmentation model, and The operation of acquiring the above-mentioned second augmentation feature is, The operation of obtaining the second augmented feature, which augments the second modal feature, from a second feature augmentation model that takes as input the first query output from the first feature mapping layer of the first feature augmentation model and the second modal feature, method.
In Paragraph 16, The above-mentioned first feature enhancement model is, The first feature mapping layer that extracts the input of the first attention layer from the first modal feature; and It includes the first attention layer that outputs features based on the first modal feature and the second modal feature, The operation of acquiring the above-mentioned first augmentation feature is, An operation of obtaining a first key and a first value used in the first attention layer from the first feature mapping layer having the first modal feature as input; An operation of obtaining the second query from the second feature mapping layer having the second modal feature as input; An operation of obtaining a first feature from the first attention layer having the second query, the first key, and the first value as inputs; and A method comprising acquiring the first augmentation feature based on the first feature and the second query, method.
In Paragraph 16, The above second feature enhancement model is, A second feature mapping layer that extracts the input of a second attention layer from the second modal feature; and It includes the second attention layer that outputs features based on the second modal feature and the first modal feature, and The operation of acquiring the above-mentioned second augmentation feature is, An operation of obtaining a second key and a second value used in the second attention layer from the second feature mapping layer having the second modal feature as input; An operation of obtaining the first query from a first feature mapping layer having the first modal feature as input; An operation to obtain a fourth feature from the second attention layer having the first query, the second key, and the second value as inputs; and A method comprising acquiring the second augmentation feature based on the fourth feature and the first query, method.
In vehicle systems, A first sensor for acquiring an image of a target area; A second sensor for acquiring a point cloud for the above target area; Memory where instructions are stored; and It includes a processor that executes instructions stored in the above memory, and When the above instructions are executed by the processor, the vehicle system A first modal feature representing a feature extracted from an image acquired through the first sensor and a second modal feature representing a feature extracted from a point cloud acquired through the second sensor different from the first sensor are obtained, and A first augmented feature is obtained by performing feature augmentation processing on the first modal feature using the second modal feature, and a second augmented feature is obtained by performing feature augmentation processing on the second modal feature using the first modal feature. A fusion feature is obtained by fusing the first augmentation feature and the second augmentation feature, and Controlling to perform a target task using the aforementioned acquired fusion features, Vehicle system.
In Paragraph 19, When the above instructions are executed by the processor, the vehicle system, Obtaining the first augmented feature obtained by augmenting the first modal feature from the first feature augmentation model that takes as input the second query obtained from the second feature mapping layer of the first modal feature and the second feature augmentation model, and Controlling to obtain the second augmented feature, which augments the second modal feature, from a second feature augmentation model that takes as input the first query output from the first feature mapping layer of the first feature augmentation model and the second modal feature, Vehicle system.

Description

ELECTRIC DEVICE AND METHOD OF EXECUTION THEREOF The following disclosure relates to an electronic device and a method executed by the electronic device. Multimodal feature fusion techniques can be utilized in tasks such as map building and target detection. Multimodal data (or multimodal data) represents different types of data, and multimodal features can represent features extracted from multimodal data. To improve the consistency of the meanings represented by multimodal features, multimodal features obtained from machine learning models can be fused, or multimodal data can be fused and input into machine learning models for feature extraction. FIG. 1 is a flowchart for illustrating methods performed by an electronic device according to one embodiment. FIG. 2 is a drawing for explaining how to obtain an augmented feature according to one embodiment. FIG. 3 is a diagram illustrating the fusion of augmentation features according to one embodiment. FIG. 4 is a diagram illustrating an example in which an electronic device according to one embodiment is used for map construction. FIG. 5 is a block diagram illustrating the configurations of an electronic device according to one embodiment. FIG. 6 is a block diagram illustrating the connection relationships of the components of an electronic device according to one embodiment. FIG. 7 is a block diagram illustrating a vehicle system using an electronic device according to one embodiment. Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be modified and implemented in various forms. Accordingly, actual implementations are not limited to the specific embodiments disclosed, and the scope of this specification includes modifications, equivalents, or substitutions included in the technical concept described by the embodiments. Terms such as "first" or "second" may be used to describe various components, but these terms should be interpreted solely for the purpose of distinguishing one component from another. For example, the first component may be named the second component, and similarly, the second component may be named the first component. When it is stated that a component is "connected" to another component, it should be understood that it may be directly connected to or coupled with that other component, or that there may be other components in between. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this document, phrases such as “A or B,” “at least one of A and B,” “at least one of A or B,” “A, B or C,” “at least one of A, B and C,” and “at least one of A, B, or C” may each include any one of the items listed together with the corresponding phrase, or all possible combinations thereof. In this specification, terms such as “comprising” or “having” are intended to designate the existence of the described feature, number, step, action, component, part, or combination thereof, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof. Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by those skilled in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and should not be interpreted in an ideal or overly formal sense unless explicitly defined in this specification. Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the attached drawings, identical components are given the same reference numeral regardless of the drawing number, and redundant descriptions thereof will be omitted. FIG. 1 is a flowchart for illustrating methods performed by an electronic device according to one embodiment. A multimodal feature fusion method (e.g., features extracted from images and features extracted from point clouds) can be used for map building tasks. An image represents data in which visual information is expressed in a two-dimensional space, and features extracted from an image can represent the features of the data expressed in the two-dimensional space. A point cloud can represent a set of points placed in a three-dimensional space, and features extracted from a point cloud can represent the features of the set of points placed in the three-dimensional space. Map construction can be performed based on a method of predicting map elements from a bird's-eye view (BEV). Map elements can represent components that constitute the map (e.g., crosswalks, lane dividers, road boundaries, etc.). Methods of representing map elements may include vectorized representation and masked representation. Vectorized representation may refer to a method of representing map element