US-12620396-B2 - Electronic device and controlling method of electronic device

US12620396B2US 12620396 B2US12620396 B2US 12620396B2US-12620396-B2

Abstract

Provided are an electronic device and a method of controlling an electronic device. The electronic device includes: a memory storing at least one instruction; and at least one processor configured to execute the at least one instruction, wherein one or more of the at least one processor is configured to: acquire a first vector corresponding to each of a plurality of sections of a voice signal by inputting the voice signal to a common encoder based on acquiring the voice signal; acquire a second vector corresponding to each of the plurality of sections and independent on a context of the voice signal by inputting the first vector into a first individual encoder; acquire a phoneme sequence corresponding to the second vector by inputting the second vector into a first decoder; acquire a third vector corresponding to at least two sections among the plurality of sections and dependent on the context of the voice signal by inputting the first vectors into a second individual encoder; acquire a sub-word sequence corresponding to the third vector by inputting the third vector into a second decoder; and acquire text information corresponding to the plurality of sections by correcting the sub-word sequence based on the phoneme sequence, through a text information acquisition module.

Inventors

Dhananjaya Nagaraja GOWDA
Jiyeon Kim
Abhinav Garg
Chanwoo Kim

Assignees

SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20240520
Priority Date: 20230109

Claims (15)

1 . An electronic device comprising: a memory storing at least one instruction; and at least one processor configured to execute the at least one instruction, wherein one or more of the at least one processor is configured to: acquire a first vector corresponding to each of a plurality of sections of a voice signal by inputting the voice signal to a common encoder based on acquiring the voice signal; acquire a second vector corresponding to each of the plurality of sections and independent on a context of the voice signal by inputting the first vector into a first individual encoder; acquire a phoneme sequence corresponding to the second vector by inputting the second vector into a first decoder; acquire a third vector corresponding to at least two sections among the plurality of sections and dependent on the context of the voice signal by inputting the first vectors into a second individual encoder; acquire a sub-word sequence corresponding to the third vector by inputting the third vector into a second decoder; and acquire text information corresponding to the plurality of sections by correcting the sub-word sequence based on the phoneme sequence, through a text information acquisition module.
2 . The device as claimed in claim 1 , wherein the text information acquisition module comprises circuitry including a spell correction module comprising circuitry, and the spell correction module is configured to acquire the text information by correcting a spelling of the sub-word sequence based on the phoneme sequence based on identifying that the sub-word sequence violates a specified spelling.
3 . The device as claimed in claim 1 , wherein the text information acquisition module comprises circuitry including a named entity correction module comprising circuitry, and the named entity correction module is configured to acquire the text information by correcting a named entity of the sub-word sequence based on the phoneme sequence based on identifying that the sub-word sequence is not included in a plurality of specified named entities.
4 . The device as claimed in claim 1 , wherein the common encoder is configured to learn to acquire the first vector suitable for both the first individual encoder and the second individual encoder without a specified constraint.
5 . The device as claimed in claim 1 , wherein the first individual encoder is configured to learn to acquire the second vector representing a feature of the phoneme sequence based on unlabeled learning data, and the first decoder is configured to learn to acquire the phoneme sequence based on labeled learning data.
6 . The device as claimed in claim 1 , wherein the second individual encoder is configured to learn to acquire the third vector representing a feature of the sub-word sequence based on unlabeled learning data, and the second decoder is configured to learn to acquire the sub-word sequence based on labeled learning data.
7 . The device as claimed in claim 1 , wherein the at least two sections among the plurality of sections include all sections received before a specific time point among the plurality of sections or all the sections received before and after the specific time point among the plurality of sections.
8 . A method of controlling an electronic device, the method comprising: acquiring a first vector corresponding to each of a plurality of sections of a voice signal by inputting the voice signal to a common encoder based on the voice signal being acquired; acquiring a second vector corresponding to each of the plurality of sections and independent on a context of the voice signal by inputting the first vector into a first individual encoder; acquiring a phoneme sequence corresponding to the second vector by inputting the second vector into a first decoder; acquiring a third vector corresponding to at least two sections among the plurality of sections and dependent on the context of the voice signal by inputting the first vectors into a second individual encoder; acquiring a sub-word sequence corresponding to the third vector by inputting the third vector into a second decoder; and acquiring text information corresponding to the plurality of sections by correcting the sub-word sequence based on the phoneme sequence, through a text information acquisition module.
9 . The method as claimed in claim 8 , further comprising acquiring the text information by correcting a spelling of the sub-word sequence based on the phoneme sequence based on identifying that the sub-word sequence violates a specified spelling.
10 . The method as claimed in claim 8 , further comprising acquiring the text information by correcting a named entity of the sub-word sequence based on the phoneme sequence based on identifying that the sub-word sequence is not included in a plurality of specified named entities.
11 . The method as claimed in claim 8 , further comprising learning to acquire the first vector suitable for both the first individual encoder and the second individual encoder without a specified constraint.
12 . The method as claimed in claim 8 , further comprising: learning to acquire the second vector representing a feature of the phoneme sequence based on unlabeled learning data, and learning to acquire the phoneme sequence based on labeled learning data.
13 . The method as claimed in claim 8 , further comprising: learning to acquire the third vector representing a feature of the sub-word sequence based on unlabeled learning data, and learning to acquire the sub-word sequence based on labeled learning data.
14 . The method as claimed in claim 8 , wherein the at least two sections among the plurality of sections include all sections received before a specific time point among the plurality of sections or all the sections received before and after the specific time point among the plurality of sections.
15 . A non-transitory computer-readable recording medium including a program which, when executed by one or more of at least one processor of an electronic device, cause the electronic device to perform operations comprising: acquiring a first vector corresponding to each of a plurality of sections of a voice signal by inputting the voice signal to a common encoder based on the voice signal being acquired; acquiring a second vector corresponding to each of the plurality of sections and independent on a context of the voice signal by inputting the first vector into a first individual encoder; acquiring a phoneme sequence corresponding to the second vector by inputting the second vector into a first decoder; acquiring a third vector corresponding to at least two sections among the plurality of sections and dependent on the context of the voice signal by inputting the first vectors into a second individual encoder; acquiring a sub-word sequence corresponding to the third vector by inputting the third vector into a second decoder; and acquiring text information corresponding to the plurality of sections by inputting the phoneme sequence and the sub-word sequence into a text information acquisition module.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of International Application No. PCT/KR2024/000350 designating the United States, filed on Jan. 8, 2024, in the Korean Intellectual Property Receiving Office and claiming priority to Korean Patent Application No. 10-2023-0002862, filed on Jan. 9, 2023, in the Korean Intellectual Property Office, the disclosures of each of which are incorporated by reference herein in their entireties. BACKGROUND Field The disclosure relates to an electronic device and a controlling method of an electronic device, and for example, to an electronic device which may acquire text information corresponding to a voice signal, and a controlling method of an electronic device. Description of the Related Art In recent years, the development of technology for acquiring text information matching a user's speech intention by performing accurate voice recognition for a user voice has been accelerated in accordance with the development of technology related to an artificial intelligence (AI). However, among the prior arts, there is technology considering a context of a voice signal received before a specific time point or a context of a voice signal of all sections received before and after the specific time point for a voice recognition model (e.g., automatic speech recognition (ASR) model) to fully reflect language information included in the voice signal. However, this prior art may utilize context-dependent encoder and decoder to be strongly biased by a previous word, and thus fail to accurately recognize a foreign word in particular. Meanwhile, among the prior arts, there is technology performing voice recognition using a limited section of the voice signal. However, this prior art may not fully reflect the context of the voice signal to thus acquire a recognition result that does not match the user's speech intention. SUMMARY Embodiments of the disclosure provide an electronic device with improved accuracy of voice recognition by classifying encoders included in a voice recognition model into a context-independent encoder and a context-dependent encoder, and a controlling method of an electronic device. According to various example embodiments of the disclosure, an electronic device includes: a memory storing at least one instruction; and at least one processor configured to execute the at least one instruction, wherein one or more of the at least one processor is configured to: acquire a first vector corresponding to each of a plurality of sections of a voice signal by inputting the voice signal to a common encoder based on acquiring the voice signal; acquire a second vector corresponding to each of the plurality of sections and independent on a context of the voice signal by inputting the first vector into a first individual encoder; acquire a phoneme sequence corresponding to the second vector by inputting the second vector into a first decoder; acquire a third vector corresponding to at least two sections among the plurality of sections and dependent on the context of the voice signal by inputting the first vectors into a second individual encoder; acquire a sub-word sequence corresponding to the third vector by inputting the third vector into a second decoder; and acquire text information corresponding to the plurality of sections by correcting the sub-word sequence based on the phoneme sequence, through a text information acquisition module. The text information acquisition module may include a spell correction module, and the spell correction module may be configured to acquire the text information by correcting a spelling of the sub-word sequence based on the phoneme sequence based on identifying that the sub-word sequence violates a specified spelling. The text information acquisition module may include a named entity correction module, and the named entity correction module may be configured to acquire the text information by correcting a named entity of the sub-word sequence based on the phoneme sequence based on identifying that the sub-word sequence is not included in a plurality of specified named entities. The common encoder may be configured to learn to acquire the first vector suitable for both the first individual encoder and the second individual encoder without a specified constraint. The first individual encoder may be configured to learn to acquire the second vector representing a feature of the phoneme sequence based on unlabeled learning data, and the first decoder may be configured to learn to acquire the phoneme sequence based on labeled learning data. The second individual encoder may be configured to learn to acquire the third vector representing a feature of the sub-word sequence based on unlabeled learning data, and the second decoder may be configured to learn to acquire the sub-word sequence based on labeled learning data. The at least two sections among the plurality of sections may include all the sections received before a