EP-4207192-B1 - ELECTRONIC DEVICE AND METHOD FOR CONTROLLING SAME

EP4207192B1EP 4207192 B1EP4207192 B1EP 4207192B1EP-4207192-B1

Inventors

PARK, SANGJUN
CHOO, KIHYUN
KANG, Taehwa
SUNG, HOSANG
JEONG, JongHoon

Dates

Publication Date: 20260506
Application Date: 20210917

Claims (7)

An electronic device (100) comprising: a memory (110); and a processor (120) configured to: input acoustic data of a first quality into a first encoder model (210); obtain first feature data for estimating a waveform based on inputting the acoustic data of the first quality into the first encoder model (210); input the first feature data into a decoder model (230); and obtain waveform data of a second quality that is higher quality than the first quality based on inputting the first feature data into the decoder model (230), wherein the first encoder model (210) is trained to output second feature data based on training acoustic data of the first quality being input, wherein the processor (120) is further configured to: input training acoustic data of the second quality to a second encoder model (220) of the second quality to obtain feature data for estimating training waveform data of the second quality; train the first encoder model (210) based on an error between the second feature data, obtained by inputting the training acoustic data of the first quality into the first encoder model (210), and the feature data for estimating the training waveform data of the second quality; input the feature data for estimating the training waveform data of the second quality to the decoder model (230) to obtain waveform data; and train the decoder model (230) based on an error between the obtained waveform data and the training waveform data of the second quality.
The electronic device (100) according to claim 1, wherein the processor (120) is further configured to: input the first feature data into a first restoration model (240) for restoring feature data for estimating a waveform to acoustic data; and obtain the acoustic data of the second quality based on inputting the first feature data into the first restoration model (240) for restoring the feature data for estimating the waveform to acoustic data, wherein the first restoration model (240) is to output training acoustic data of the second quality based on training feature data of the first quality being input.
The electronic device (100) according to claim 1, wherein the processor (120) is further configured to: input first acoustic data related to a first domain among the acoustic data of the first quality into the second encoder model (220, 900), wherein a domain of the acoustic data refers to a type of the acoustic and includes at least one of a spectrum, a mel-spectrum, a cepstrum, and pitch data; obtain third feature data based on inputting the first acoustic data related to the first domain among the acoustic data of the first quality to the second encoder model (220, 900); input second acoustic data related to a second domain among the acoustic data of the first quality into a third encoder model (920); obtain fourth feature data based on inputting the second acoustic data related to the second domain among the acoustic data of the first quality to the third encoder model (920); and obtain the waveform data of the second quality corresponding to the first domain based on inputting the third feature data and the fourth feature data into the decoder model (230), wherein the second encoder model (220, 900) is trained to output feature data for estimating the training waveform data of the second quality corresponding to the first domain, based on first training acoustic data related to the first domain among the training acoustic data of the first quality being input.
The electronic device (100) according to claim 1, wherein the processor (120) is further configured to: input the acoustic data of the first quality to an improvement model trained to improve a quality; obtain the acoustic data of the second quality based on inputting the acoustic data of the first quality to an improvement model trained to improve quality of the acoustic data; input the acoustic data of the second quality into the first encoder model (210); obtain feature data having an improved quality as compared with the first feature data based on inputting the acoustic data of the second quality into the first encoder model (210); input the feature data with the improved quality as compared with the first feature data into the decoder model (230); obtain an excitation signal based on inputting the feature data with the improved quality as compared with the first feature data into the decoder model (230); input the excitation signal and the acoustic data of the second quality into a signal processing model; and obtain the waveform data of the second quality based on inputting the excitation signal and the acoustic data of the second quality into the signal processing model.
A control method of an electronic device (100), the method comprising: inputting acoustic data of a first quality into a first encoder model (210); obtaining first feature data for estimating a waveform based on inputting the acoustic data of the first quality into the first encoder model (210); inputting the first feature data into a decoder model (230); and obtaining waveform data of a second quality that is higher quality than the first quality based on inputting the first feature data into the decoder model (230), wherein the first encoder model (210) is trained to output second feature data based on training acoustic data of the first quality being input, and wherein the method further comprises: inputting training acoustic data of the second quality to a second encoder model (220) of the second quality to obtain feature data for estimating training waveform data of the second quality; training the first encoder model (210) based on an error between the second feature data obtained by inputting the training acoustic data of the first quality into the first encoder model (210) and the feature data for estimating the training waveform data of the second quality; inputting the feature data for estimating the training waveform data of the second quality to the decoder model (230) to obtain waveform data; and training the decoder model (230) based on an error between the obtained waveform data and the training waveform data of the second quality.
The control method according to claim 5, further comprising: inputting the first feature data to a first restoration model (240) for restoring feature data for estimating a waveform to acoustic data; and obtaining the acoustic data of the second quality based on inputting the first feature data to the first restoration model (240) for restoring the feature data for estimating the waveform to the acoustic data, wherein the first restoration model (240) is to output training acoustic data of the second quality based on training feature data of the first quality being input.
The control method according to claim 5, further comprising: inputting first acoustic data related to a first domain among the acoustic data of the first quality into the second encoder model (220, 900), wherein a domain of the acoustic data refers to a type of the acoustic and includes at least one of a spectrum, a mel-spectrum, a cepstrum, and pitch data; obtaining third feature data based on inputting the first acoustic data related to the first domain among the acoustic data of the first quality into the second encoder model (220, 900); inputting second acoustic data related to a second domain among the acoustic data of the first quality into a third encoder model (920); obtaining fourth feature data based on inputting the second acoustic data related to the second domain among the acoustic data of the first quality to the third encoder model (920); input the third feature data and the fourth feature data into a decoder model (230); and obtaining the waveform data having the second quality corresponding to the first domain based on inputting the third feature data and the fourth feature data into the decoder model (230, 930), wherein the second encoder model (220, 900) is trained to output feature data for estimating training waveform data having an improved quality corresponding to the first domain, based on first training acoustic data related to the first domain among the training acoustic data of the first quality being input.

Description

[Technical Field] The disclosure relates to an electronic device and a control method thereof, and more particularly, to an electronic device that obtains high-quality waveform data using an artificial intelligence model, and a control method thereof. [Background Art] Conventionally, various algorithms (for example, an algorithm for filtering a noise included in a voice signal, a beamforming-based algorithm, and the like) have been developed and utilized to improve a sound quality of a voice. Recently, an algorithm for improving a sound quality of a voice based on an artificial intelligence system has been developed. The artificial intelligence system refers to a system that performs training and inferring based on a neural network model unlike an existing rule-based system, and has been utilized in various fields such as voice recognition, image recognition, and future prediction. In particular, recently, an artificial intelligence system that solves a given problem through a deep neural network based on deep learning has been developed. Meanwhile, in a case of the deep neural network, the smaller the computational amount, that is, the model complexity, the lower the performance, and the more complex or difficult the task trained to be performed, the lower the performance. Therefore, an approach that lowers the difficulty of the task performed by the deep neural network to improve performance on a limited model complexity is required. CHENG JIAMING ET AL: "A Deep Adaptation Network for Speech Enhancement: Combining a Relativistic Discriminator With Multi-Kernel Maximum Mean Discrepancy", ARXIV: 1806.04885V2 vol. 29, (2020-11-09), pages 41-53, discloses a domain adaptive method for combining two adaptations to improve the generalization of unlabeled noisy speech. JP 2020 149504 A discloses that a learning apparatus executes, with respect to each learning data set, a first training step of training a second encoder and a second metadata identifier such that the identification result by the second metadata identifier matches the metadata, a second training step of training encoders and an estimator such that the result of estimation performed by the estimator matches correct answer data, a third training step of training a first metadata identifier such that the result of identification performed by the first metadata identifier matches the metadata, and a fourth training step of training a first encoder such that the result of identification performed by the first metadata identifier does not match the metadata. The third training step and the fourth training step are alternatingly and repeatedly executed. [Disclosure] [Technical Problem] The disclosure provides an electronic device that obtains waveform data with an improved quality using an artificial intelligence model trained to output high-quality waveform data, and a control method thereof. [Technical Solution] The present invention is defined by the appended set of claims. Preferred embodiments are defined by the dependent claims. According to an aspect of an example embodiment, an electronic device may include a memory; and a processor configured to: input acoustic data of a first quality into a first encoder model; obtain first feature data for estimating a waveform based on inputting the acoustic data of the first quality into the first encoder model; input the first feature data into a decoder model; and obtain waveform data of a second quality that is higher quality than the first quality based on inputting the first feature data into the decoder model, wherein the first encoder model is trained to output second feature data based on training acoustic data of the first quality being input. The processor is further configured to: input training acoustic data of the second quality to a second encoder model of the second quality to obtain feature data for estimating training waveform data of the second quality; train the first encoder model based on an error between the second feature data, obtained by inputting the training acoustic data of the first quality into the first encoder model, and the feature data for estimating the training waveform data of the second quality; input the feature data for estimating the training waveform data of the second quality to the decoder model to obtain waveform data; and train the decoder model based on an error between the obtained waveform data and the training waveform data of the second quality. According to an aspect of an example embodiment, a control method of an electronic device may include inputting acoustic data of a first quality into a first encoder model; obtaining first feature data for estimating a waveform based on inputting the acoustic data of the first quality into the first encoder model; inputting the first feature data into a decoder model; obtaining waveform data of a second quality that is higher quality than the first quality based on inputting the first feature data into the decoder mod