US-12626728-B2 - Method and device for timing alignment of audio signals

US12626728B2US 12626728 B2US12626728 B2US 12626728B2US-12626728-B2

Abstract

A method and device for timing alignment of audio signals. The method includes: generating frequency domain images respectively for an audio signal to be aligned and a template audio signal (S 110 ); inputting the frequency domain images into a twin neural network of a timing offset prediction model respectively, to obtain two frequency domain features output by the twin neural network (S 120 ); fusing the two frequency domain features to obtain a fused feature (S 130 ); inputting the fused features into a prediction network of the timing offset prediction model to obtain a timing offset output by the prediction network (S 140 ); and performing timing alignment processing on the audio signal to be aligned according to the timing offset (S 150 ). The technical solution is more robust, and especially in a noisy environment, features extracted by a deep neural network are more intrinsic and more stable. An end-to-end timing offset prediction model is more accurate and faster.

Inventors

Libing Zou
Yifan Zhang
Xueqiang WANG
Fuqiang Zhang

Assignees

GOERTEK INC.

Dates

Publication Date: 20260512
Application Date: 20211020
Priority Date: 20201209

Claims (15)

1 . A method for timing alignment of audio signals, comprising: generating frequency domain images respectively for an audio signal to be aligned and a template audio signal; inputting the frequency domain images into a twin neural network of a timing offset prediction model respectively, to obtain two frequency domain features output by the twin neural network; fusing the two frequency domain features to obtain a fused feature; inputting the fused feature into a prediction network of the timing offset prediction model to obtain a timing offset output by the prediction network; and performing timing alignment processing on the audio signal to be aligned according to the timing offset.
2 . The method according to claim 1 , wherein said “generating frequency domain images respectively for an audio signal to be aligned and a template audio signal” comprises: cutting the audio signal to be aligned according to a duration of the template audio signal, so that a duration of the audio signal to be aligned after cutting equals to the duration of the template audio signal; and generating frequency domain images respectively for the audio signal to be aligned after cutting and the template audio signal.
3 . The method according to claim 1 , wherein said “generating frequency domain images respectively for an audio signal to be aligned and a template audio signal” comprises: generating frequency domain images respectively for the audio signal to be aligned and the template audio signal by using a Fast Fourier Transform method.
4 . The method according to claim 1 , wherein said “fusing the two frequency domain features to obtain a fused feature” comprises: concatenating the two frequency domain features to obtain the fused feature; and said “performing timing alignment processing on the audio signal to be aligned according to the timing offset” comprises: determining a way of using the timing offset according to an order of the two frequency domain features during concatenating.
5 . The method according to claim 1 , wherein said “inputting the fused feature into a prediction network of the timing offset prediction model to obtain a timing offset output by the prediction network” comprises: performing fully connected processing on the fused feature by a fully-connected layer of the prediction network to obtain a fully connected feature; performing flattening processing on the fully connected feature by a Flat layer of the prediction network to obtain a flattened one-dimensional feature; and outputting a predicted timing offset by an output layer of the prediction network according to the one-dimensional feature.
6 . The method according to claim 1 , wherein the timing offset prediction model is obtained by training in the following manner: inputting a group of training sample images into the twin neural network of the timing offset prediction model, to obtain two frequency domain features output by the twin neural network; fusing the two frequency domain features to obtain a fused feature; inputting the fused feature into the prediction network of the timing offset prediction model, to obtain the timing offset output by the prediction network as a sample predictive value; and calculating a training loss value according to the sample predictive value and a sample actual value of the group of training sample images, and updating parameters of the twin neural network and parameters of the prediction network according to the training loss value.
7 . The method of claim 6 , further comprising: generating a first frequency domain image of a sample signal; processing the first frequency domain image to obtain a second frequency domain image to simulate a signal to be aligned of the sample signal; and using the first frequency domain image and the second frequency domain image as a group of training sample images to perform online learning and training on the timing offset prediction model.
8 . The method of claim 7 , wherein said “processing said first frequency domain image” comprises: performing offset processing on the first frequency domain image, and an offset used in the offset processing is used as the sample actual value of the group of training sample images.
9 . The method of claim 7 , wherein said “processing the first frequency domain image” comprises: adding noise to the first frequency domain image to simulate noise interference in an actual scene.
10 . A device for timing alignment of audio signals, comprising: a processor; and a memory storing computer-executable instructions that, when executed by the processor, cause the processor to: generate frequency domain images respectively for an audio signal to be aligned and a template audio signal; input the frequency domain image into a twin neural network of a timing offset prediction model respectively, to obtain two frequency domain features output by the twin neural network; fuse the two frequency domain features to obtain a fused feature; and input the fused feature to a prediction network of the timing offset prediction model to obtain a timing offset output by the prediction network; and perform timing alignment processing on the audio signal to be aligned according to the timing offset.
11 . The device of claim 10 , wherein the computer-executable instructions, when executed by the processor, cause the processor to: cut the audio signal to be aligned according to a duration of the template audio signal, so that a duration of the audio signal to be aligned after cutting equals to the duration of the template audio signal; and generate frequency domain images respectively for the audio signal to be aligned after cutting and the template audio signal.
12 . The device of claim 10 , wherein the computer-executable instructions, when executed by the processor, cause the processor to: concatenate the two frequency domain features to obtain the fused feature; and perform the timing alignment processing on the audio signal to be aligned according to the timing offset by determining a way of using the timing offset according to an order of the two frequency domain features during concatenating.
13 . The device of claim 10 , wherein the computer-executable instructions, when executed by the processor, cause the processor to: perform fully connected processing on the fused feature by a fully connected layer of the prediction network to obtain a fully-connected feature; perform flattening processing on the fully connected feature by a Flat layer of the prediction network to obtain a flattened one-dimensional feature; and output predicted timing offset by an output layer of the prediction network according to the one-dimensional features.
14 . The device of claim 10 , wherein the computer-executable instructions, when executed by the processor, further cause the processor to train the timing offset prediction model by: inputting a group of training sample images into the twin neural network of the timing offset prediction model, to obtain the two frequency domain features output by the twin neural network; fusing the two frequency domain features to obtain a fused feature; inputting the fused feature into the prediction network of the timing offset prediction model, to obtain the timing offset output by the prediction network as a sample predictive value; and calculating a training loss value according to the sample predictive value and a sample actual value of the group of training sample images, and updating parameters of the twin neural network and parameters of the prediction network according to the training loss value.
15 . A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores one or more programs, and when the one or more programs are executed by an electronic apparatus comprising a plurality of applications, the electronic apparatus executes the following method for timing alignment of audio signals: generating frequency domain images respectively for an audio signal to be aligned and a template audio signal; inputting the frequency domain images into a twin neural network of a timing offset prediction model respectively, to obtain two frequency domain features output by the twin neural network; fusing the two frequency domain features to obtain a fused feature; inputting the fused feature into a prediction network of the timing offset prediction model to obtain a timing offset output by the prediction network; and performing timing alignment processing on the audio signal to be aligned according to the timing offset.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a National Stage of International Application No. PCT/CN2021/124876, filed on Oct. 20, 2021, which claims priority to Chinese Patent Application No. 202011447392.8 filed on Dec. 9, 2020, both of which are hereby incorporated by reference in their entireties. TECHNICAL FIELD The present application relates to the technical field of audio signal processing, and in particular to a method and device for timing alignment of audio signals. BACKGROUND In the process of industrial production, by analyzing whether an audio signal generated by a production apparatus is abnormal, failure of the apparatus can be discovered in time to avoid accidents. For example, in the process of railway transportation, by detecting abnormal sound of a wheel and a track during operating, damage to the track or wheel can be discovered in time, so that the damaged apparatus can be replaced in time to avoid accidents of wheels during operating. In addition, in the production process of an acoustic apparatus, by playing specific sound signals of different frequency bands, it is possible to analyze and determine whether the acoustic apparatus is faulty, and analyze the frequency band and time in which the fault occurs, so as to improve the production process and improve the overall quality of products. Generally, a section of an audio signal sequence generated by an apparatus under inspection is acquired, and compared with a standard signal sequence, a position where an abnormal signal is generated can be determined. However, due to reasons of an acquisition apparatus or operation, a timing of a signal acquired by the apparatus usually does not match with a timing of an original signal, so it is necessary to align the acquired signal sequence with the standard signal sequence to facilitate subsequent further processes. SUMMARY The present application provide a method and device for timing alignment of audio signals, so as to use the powerful feature expression capability of deep neural networks to filter noise signals, and finally achieve end-to-end timing alignment of audio signals. The embodiment of the application uses the following technical solutions. In a first aspect, an embodiment of the present application provides a method for timing alignment of audio signals, including: generating frequency domain images respectively for an audio signal to be aligned and a template audio signal; inputting the frequency domain images into a twin neural network of a timing offset prediction model respectively, to obtain two frequency domain features output by the twin neural network; fusing the two frequency domain features to obtain a fused feature; inputting the fused feature into a prediction network of the timing offset prediction model to obtain a timing offset output by the prediction network; and performing timing alignment processing on the audio signal to be aligned according to the timing offset. In a second aspect, the embodiment of the present application also provides device for timing alignment of audio signals including: an image generating unit, configured to generate frequency domain images for an audio signal to be aligned and a template audio signal respectively;a predicting unit, configured to input the frequency domain image into a twin neural network of a timing offset prediction model respectively, to obtain two frequency domain features output by the twin neural network; fuse the two frequency domain features to obtain a fused feature; and input the fused feature to a prediction network of the timing offset prediction model to obtain a timing offset output by the prediction network; andan aligning unit, configured to perform timing alignment processing on the audio signal to be aligned according to the timing offset. In a third aspect, the embodiment of the present application also provides an electronic apparatus, including: a processor; and a memory arranged to store computer-executable instructions, and when executed, the executable instructions enable the processor to perform the above method for timing alignment of audio signals. In a fourth aspect, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores one or more programs, and when the one or more programs are executed by an electronic apparatus including a plurality of applications, the electronic apparatus executes the above method for timing alignment of audio signals. The above-mentioned at least one technical solution adopted in the embodiments of the present application can achieve the following beneficial effects: extracting features from frequency domain images of an audio signal to be aligned and a template audio signal by using a deep neural network, compared with the traditional artificial feature method, better robust can be obtained, and especially in a noisy environment, the features extracted by the deep neural n