CN-116074298-B - Packet loss concealment method and device, computing device and storage medium

CN116074298BCN 116074298 BCN116074298 BCN 116074298BCN-116074298-B

Abstract

A packet loss concealment method is described, comprising determining a current frame in a voice data stream of a target object, the voice data stream comprising multi-frame voice data with target voice parameters, performing a first data processing operation on the current frame in response to the current frame being lost, the first data processing operation comprising the steps of determining the current frame as a packet loss frame, determining predicted voice parameters of the packet loss frame from one or more frames preceding the packet loss frame by using a packet loss concealment model, wherein the packet loss concealment model is established based on historical voice data of a plurality of objects, correcting the predicted voice parameters of the packet loss frame by using a correction model to obtain corrected voice parameters of the packet loss frame, wherein the correction model is established based on the historical voice data of the target object, and determining the voice data of the packet loss frame according to the corrected voice parameters of the packet loss frame. The embodiment of the invention can be applied to various scenes such as packet loss concealment, data transmission, voice communication and the like.

Inventors

LIANG JUNBIN

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20211104

Claims (14)

1. A method of packet loss concealment, comprising: Determining a current frame in a voice data stream of a target object, the voice data stream comprising multi-frame voice data having target voice parameters; responding to packet loss of the current frame, performing first data processing operation on the current frame, wherein the first data processing operation comprises the following steps: Determining the current frame as a packet loss frame; determining predicted speech parameters of the lost frame according to one or more frames preceding the lost frame by using a lost packet concealment model, wherein the lost packet concealment model is established based on historical speech data of a plurality of objects; Correcting predicted voice parameters of a lost frame by using a correction model to obtain corrected voice parameters of the lost frame, wherein the correction model is established based on historical voice data of a target object; determining the voice data of the packet loss frame according to the corrected voice parameters of the packet loss frame; The correction model is obtained by training the deep learning model based on the historical voice data of the target object through the following training steps: Determining a target frame in historical voice data of a target object and a target voice parameter of the target frame; determining predicted voice parameters of the target frame according to one or more frames before the target frame by using a packet loss concealment model; correcting predicted voice parameters of the target frame by using a deep learning model to obtain corrected voice parameters of the target frame; and adjusting parameters of the deep learning model to minimize errors of the corrected voice parameters of the target frame and the target voice parameters of the target frame, thereby obtaining the correction model.
2. The method of claim 1, further comprising: And responding to the current frame without packet loss, performing a second data processing operation on the current frame, wherein the second data processing operation comprises the following steps of: Determining predicted voice parameters of the current frame according to one or more frames before the current frame by using a packet loss concealment model; Correcting predicted voice parameters of a current frame by utilizing a correction model to obtain corrected voice parameters of the current frame; The parameters of the correction model are updated so that errors between the corrected speech parameters of the current frame and the target speech parameters that the current frame has are minimized.
3. The method of claim 1, further comprising: Determining the type of the current frame, wherein the type of the current frame represents the change trend of target voice parameters of one or more frames before the current frame; The method for correcting the predicted voice parameters of the lost frame by using the correction model to obtain the corrected voice parameters of the lost frame comprises the following steps: and correcting the predicted voice parameters of the lost frame by using a correction model corresponding to the type of the current frame so as to obtain corrected voice parameters of the lost frame.
4. A method according to claim 3, further comprising: and responding to the current frame without packet loss, performing a third data processing operation on the current frame, wherein the third data processing operation comprises the following steps: Determining predicted voice parameters of the current frame according to one or more frames before the current frame by using a packet loss concealment model; Correcting the predicted voice parameters of the current frame by utilizing a correction model corresponding to the type of the current frame to obtain corrected voice parameters of the current frame; parameters of the correction model corresponding to the type of the current frame are updated so that an error between the corrected voice parameters of the current frame and the target voice parameters possessed by the current frame is minimized.
5. The method of claim 4, wherein for a correction model corresponding to a type of the current frame, there are a plurality of historical prediction errors, each historical prediction error being an error between a predicted speech parameter for a historical frame obtained by a packet loss concealment model and a target speech parameter the historical frame has, the historical frame being of a same type as the current frame; Wherein updating parameters of the correction model corresponding to the type of the current frame so that an error between the corrected voice parameters of the current frame and the target voice parameters of the current frame is minimized, comprises: Updating parameters of a correction model corresponding to the type of the current frame using the plurality of historical prediction errors so that errors between corrected speech parameters of the current frame and target speech parameters possessed by the current frame are minimized.
6. The method of claim 5, wherein updating parameters of a correction model corresponding to a type of a current frame using the plurality of historical prediction errors such that an error between corrected speech parameters of the current frame and target speech parameters possessed by the current frame is minimized, comprising: Updating parameters of the correction model corresponding to the type of the current frame such that an error between the corrected voice parameter of the current frame and the predicted voice parameter of the current frame, and a difference between center errors of the plurality of historical prediction errors are minimized so that an error between the corrected voice parameter of the current frame and the target voice parameter possessed by the current frame is minimized.
7. The method of claim 6, wherein a center error of the plurality of historical prediction errors is determined by an error determining step comprising: establishing a point group according to the plurality of historical prediction errors, wherein each point in the point group represents one historical prediction error; clustering the point clusters to obtain a target point set; Determining a central value of a historical prediction error corresponding to a point in the target point set; The central value is determined as a central error of the plurality of historical prediction errors.
8. The method of claim 7, wherein clustering the cluster of points to obtain a set of target points comprises: Removing outliers of the point group, wherein the outliers represent points, in the point group, with average distances from other points in the point group greater than a preset threshold value; A set of target points is determined such that the set of target points includes points other than outliers in the cluster of points.
9. The method of claim 1, wherein the target speech parameters include one or more of pitch frequency, line spectrum pair, gain of speech data.
10. An apparatus for packet loss concealment, comprising: A determination module configured to determine a current frame in a speech data stream of a target object, the speech data stream comprising multi-frame speech data having target speech parameters, A processing module configured to perform a first data processing operation on the current frame in response to a packet loss occurring in the current frame, the first data processing operation comprising the steps of: Determining the current frame as a packet loss frame; determining predicted speech parameters of the lost frame according to one or more frames preceding the lost frame by using a lost packet concealment model, wherein the lost packet concealment model is established based on historical speech data of a plurality of objects; Correcting predicted voice parameters of a lost frame by using a correction model to obtain corrected voice parameters of the lost frame, wherein the correction model is established based on historical voice data of a target object; determining the voice data of the packet loss frame according to the corrected voice parameters of the packet loss frame; wherein the processing module is further configured to: Determining a target frame in historical voice data of a target object and a target voice parameter of the target frame; determining predicted voice parameters of the target frame according to one or more frames before the target frame by using a packet loss concealment model; correcting predicted voice parameters of the target frame by using a deep learning model to obtain corrected voice parameters of the target frame; and adjusting parameters of the deep learning model to minimize errors of the corrected voice parameters of the target frame and the target voice parameters of the target frame, thereby obtaining the correction model.
11. The apparatus of claim 10, wherein the processing module is further configured to: And responding to the current frame without packet loss, performing a second data processing operation on the current frame, wherein the second data processing operation comprises the following steps of: Determining predicted voice parameters of the current frame according to one or more frames before the current frame by using a packet loss concealment model; Correcting predicted voice parameters of a current frame by utilizing a correction model to obtain corrected voice parameters of the current frame; The parameters of the correction model are updated so that errors between the corrected speech parameters of the current frame and the speech parameters that the current frame has are minimized.
12. A computing device, comprising: a memory configured to store computer-executable instructions; A processor configured to perform the method according to any of claims 1-9 when the computer executable instructions are executed by the processor.
13. A computer readable storage medium storing computer executable instructions which, when executed, perform the method of any one of claims 1-9.
14. A computer program product comprising computer executable instructions which when executed by a processor perform the method according to any of claims 1-9.

Description

Packet loss concealment method and device, computing device and storage medium Technical Field The disclosure relates to the technical field of network communication, and in particular relates to a packet loss concealment method and device, a computing device and a storage medium. Background The problem of packet loss, which is one of the main reasons for affecting the quality of voice communication, inevitably occurs in the network transmission process. The packet loss concealment technique (PLC: packet loss concealment for short) is a method for reconstructing a packet loss position signal at the packet loss position according to the audio signal information before and after the packet loss position, thereby reducing the influence of packet loss on voice call quality in the network transmission process. In the related packet loss concealment scheme, when the occurrence of packet loss is found, the voice parameters of the packet loss frame are predicted according to the voice parameters of the normal voice frame of the frame or frames before the packet loss, so that the packet loss frame signal is recovered. Because the pronunciation model and pronunciation habit of each person are different greatly, the pronunciation modes of different languages are more different obviously, so the packet loss hiding scheme leads to poor recovery effect (namely packet loss hiding effect) on the voice. Disclosure of Invention In view of the above, the present disclosure provides methods and apparatus for packet loss concealment, which desirably overcome some or all of the above-mentioned drawbacks, as well as other possible drawbacks. According to a first aspect of the disclosure, a method for packet loss concealment is disclosed, comprising determining a current frame in a voice data stream of a target object, the voice data stream comprising multi-frame voice data having target voice parameters, performing a first data processing operation on the current frame in response to the current frame being lost, the first data processing operation comprising the steps of determining the current frame as a packet loss frame, determining predicted voice parameters of the packet loss frame according to one or more frames preceding the packet loss frame by using a packet loss concealment model, wherein the packet loss concealment model is established based on historical voice data of a plurality of objects, correcting the predicted voice parameters of the packet loss frame by using a correction model to obtain corrected voice parameters of the packet loss frame, wherein the correction model is established based on the historical voice data of the target object, and determining the voice data of the packet loss frame according to the corrected voice parameters of the packet loss frame. In some embodiments, the correction model is obtained by training a deep learning model based on historical voice data of a target object, wherein the training step comprises the steps of determining a target frame in the historical voice data of the target object and target voice parameters of the target frame, determining predicted voice parameters of the target frame according to one or more frames before the target frame by using a packet loss concealment model, correcting the predicted voice parameters of the target frame by using the deep learning model to obtain corrected voice parameters of the target frame, and adjusting parameters of the deep learning model to minimize errors of the corrected voice parameters of the target frame and the target voice parameters of the target frame so as to obtain the correction model. In some embodiments, the method further comprises performing a second data processing operation on the current frame in response to no packet loss occurring in the current frame, the second data processing operation comprising the steps of determining predicted speech parameters of the current frame from one or more frames preceding the current frame using a packet loss concealment model, modifying the predicted speech parameters of the current frame using a modification model to obtain modified speech parameters of the current frame, and updating parameters of the modification model such that an error between the modified speech parameters of the current frame and target speech parameters possessed by the current frame is minimized. In some embodiments, the method further comprises determining a type of a current frame, the type of the current frame representing a trend of change in target speech parameters of one or more frames preceding the current frame, wherein correcting the predicted speech parameters of the lost frame with a correction model to obtain corrected speech parameters of the lost frame comprises correcting the predicted speech parameters of the lost frame with a correction model corresponding to the type of the current frame to obtain corrected speech parameters of the lost frame. In some embodiments, the metho