KR-20260067175-A - Packet loss concealment system based on diffusion model and learning method thereof

KR20260067175AKR 20260067175 AKR20260067175 AKR 20260067175AKR-20260067175-A

Abstract

A packet loss hiding system according to a disclosed embodiment includes: a memory for storing an artificial intelligence model composed of a plurality of residual layers; and a processor for training the artificial intelligence model. The processor inputs voice data with Gaussian noise added and a noise level embedding to the artificial intelligence model, generates a first combined value by combining local conditioning information with the voice data and the noise level embedding through a first FiLM (Feature-wise Linear Modulation) layer, generates a second combined value by combining a packet loss embedding with the first combined value through a second FiLM layer, and outputs noise estimated by the residual layer at each time step through a reverse process.

Inventors

장준혁
양다희

Assignees

한양대학교 산학협력단

Dates

Publication Date: 20260512
Application Date: 20241105

Claims (12)

Memory for storing an artificial intelligence model composed of multiple residual layers; A processor for training the above artificial intelligence model; comprising, The above processor is, Voice data with added Gaussian noise and noise level embeddings are input into the above artificial intelligence model, and A first combined value is generated by combining local conditioning information with the voice data and the noise level embedding through a first FiLM (Feature-wise Linear Modulation) layer, and A second combined value is generated by combining the packet loss embedding with the first combined value through the second FiLM layer, and A packet loss hiding system in which the residual layer outputs estimated noise at each time step through a reverse process.
In Paragraph 1, The above processor is, A packet loss recognition system that trains the artificial intelligence model based on the difference between the estimated noise and the noise to be compared.
In Article 1, The above processor is, The above local conditioning information is input into the first 1x1 convolution layer, and The first output value output from the first 1x1 convolution layer is input to the Bi-DilConv layer, and A packet loss hiding system that inputs a second output value output from the above Bi-DilConv layer to the above first FiLM layer.
In Paragraph 1, The above processor is, The above packet loss embedding is input to the second 1x1 convolution layer, and A packet loss hiding system that inputs a third output value output from the second 1x1 convolution layer to the second FiLM layer.
Input speech data with Gaussian noise added and noise level embeddings into the residual layer; Generating a first combined value by combining local conditioning information with the voice data and the noise level embedding through the first FiLM layer; A second combined value is generated by combining the packet loss embedding with the first combined value through the second FiLM layer; A learning method for a packet loss recognition system comprising: the residual layer outputting estimated noise at each time step through a reverse process.
In Paragraph 5, A method for learning a packet loss recognition system, further comprising: learning the artificial intelligence model based on the difference between the estimated noise and the noise to be compared.
In Paragraph 5, Generating the above first combination value is, The above local conditioning information is input into the first 1x1 convolution layer, and The first output value output from the first 1x1 convolution layer is input to the Bi-DilConv layer, and A learning method for a packet loss recognition system comprising inputting a second output value output from the above Bi-DilConv layer to the above first FiLM layer.
In Paragraph 5, Generating the above second combined value is, The above packet loss embedding is input to the second 1x1 convolution layer, and A learning method for a packet loss recognition system comprising inputting a third output value output from the second 1x1 convolution layer to the second FiLM layer.
A computer program stored on a computer-readable storage medium, wherein the computer program, when executed on one or more processors, performs operations for conditioning to a diffusion model, and The above operations are, Input speech data with Gaussian noise added and noise level embeddings into the residual layer; Generating a first combined value by combining local conditioning information with the voice data and the noise level embedding through a first FiLM (Feature-wise Linear Modulation) layer; A second combined value is generated by combining the packet loss embedding with the first combined value through the second FiLM layer; A computer program comprising: the residual layer outputting estimated noise at each time step through a reverse process.
In Article 9, A program further comprising: training the residual layer based on the difference between the estimated noise and the noise to be compared.
In Article 9, Generating the above first combination value is, The above local conditioning information is input into the first 1x1 convolution layer, and The first output value output from the first 1x1 convolution layer is input to the Bi-DilConv layer, and A program comprising inputting a second output value output from the above Bi-DilConv layer into the above first FiLM layer.
In Article 9, Generating the above second combined value is, The above packet loss embedding is input to the second 1x1 convolution layer, and A program comprising inputting a third output value output from the second 1x1 convolution layer to the second FiLM layer.

Description

Packet loss concealment system and learning method thereof The present invention relates to a packet loss hiding system based on a diffusion model and a method for learning the packet loss hiding system. Packet Loss Concealment (PLC) is a technology used to minimize voice quality degradation caused by packet loss during network transmission. In particular, PLC is primarily utilized in real-time voice streaming environments such as voice calls or VoIP (Voice over IP). PLC maintains natural voice quality by predicting or correcting voice data from lost packets. Conventional PLC systems have evolved from classical methods that fill in missing parts with previous frames to models based on deep learning models. For example, prior paper 1 is a representative PLC system based on a predictive model rather than a generative model. It is a model that utilizes the structure of a speech recognition system and was introduced at ICASSP 2023. As another example, prior paper 2 is a PLC system that utilizes a GAN model, which is one of the generative models. [Prior paper 1] Viet-Anh Nguyen, Anh HT Nguyen, and Andy WH Khong, "Improving performance of real-time full-band blind packet-loss concealment with predictive network," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1-5. [Previous Paper 2] J. Wang, Y. Guan, C. Zheng, R. Peng, and X. Li, "A temporal-spectral generative adversarial network based end-to-end packet loss concealment for wideband speech transmission," The Journal of the Acoustical Society of America, vol. 150, no. 4, pp. 2577-2588, 2021. FIG. 1 is a control block diagram of a packet loss hiding system according to a disclosed embodiment. Figure 2 is a flowchart illustrating a learning method for a disclosed packet loss hiding system. Figure 3 is a diagram to supplement the explanation of the learning method of Figure 2. Throughout the specification, the same reference numerals refer to the same components. This specification does not describe all elements of the embodiments, and general content in the art to which the invention pertains or content that overlaps between embodiments is omitted. Throughout the specification, when a part is described as being "connected" to another part, this includes not only cases where they are directly connected but also cases where they are indirectly connected, and indirect connections include connections made via a wireless communication network. Furthermore, when it is stated that a part "includes" a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components. Singular expressions include plural expressions unless there is an obvious exception in the context. In addition, terms such as "~part," "~unit," "~block," "~part," and "~module" may refer to a unit that processes at least one function or operation. For example, the above terms may refer to at least one piece of hardware such as an FPGA (field-programmable gate array) or an ASIC (application specific integrated circuit), at least one piece of software stored in memory, or at least one process processed by a processor. The symbols attached to each step are used to identify each step and do not indicate the order of the steps relative to one another; the steps may be performed differently from the specified order unless a specific order is clearly indicated in the context. Hereinafter, a packet loss hiding system and a learning method for the packet loss hiding system according to the disclosed embodiment will be described in detail with reference to the attached drawings. FIG. 1 is a control block diagram of a packet loss hiding system according to a disclosed embodiment. In the disclosed embodiment, the packet loss hiding system (1) can be implemented as a computer or portable terminal (hereinafter user terminal, 3) capable of connecting to a communication network such as the Internet. Here, the computer includes, for example, a laptop, desktop, laptop, tablet PC, slate PC, etc. equipped with a web browser, and the portable terminal can be implemented as, for example, any type of handheld-based wireless communication device such as a smartphone, etc., as a wireless communication device that ensures portability and mobility. Referring to FIG. 1, a user terminal (3) may include a communication interface (11) for receiving voice data composed of packets and an artificial intelligence model required for learning a disclosed packet loss hiding system from an external server (2), a memory (12) for storing the received voice data and the artificial intelligence model, an input unit (13) for receiving operation commands for learning and inference of the artificial intelligence model from a user, a processor (10) for controlling the overall system, and an output unit (14) for displaying the learning results of the artificial intelligence model performed by the processor (10). However