US-20260128051-A1 - SYSTEM AND METHOD FOR AUTOMATIC ALIGNMENT OF PHONETIC CONTENT FOR REAL-TIME ACCENT CONVERSION

US20260128051A1US 20260128051 A1US20260128051 A1US 20260128051A1US-20260128051-A1

Abstract

The disclosed technology relates to methods, accent conversion systems, and non-transitory computer readable media for real-time accent conversion. In some examples, a set of phonetic embedding vectors is obtained for phonetic content representing a source accent and obtained from input audio data. A trained machine learning model is applied to the set of phonetic embedding vectors to generate a set of transformed phonetic embedding vectors corresponding to phonetic characteristics of speech data in a target accent. An alignment is determined by maximizing a cosine distance between the set of phonetic embedding vectors and the set of transformed phonetic embedding vectors. The speech data is then aligned to the phonetic content based on the determined alignment to generate output audio data representing the target accent. The disclosed technology transforms phonetic characteristics of a source accent to match the target accent more closely for efficient and seamless accent conversion in real-time applications.

Inventors

Lukas PFEIFENBERGER
Shawn Zhang

Assignees

Sanas.ai Inc.

Dates

Publication Date: 20260507
Application Date: 20251121

Claims (20)

1 . A system, comprising an audio interface, a communication interface, memory having instructions stored thereon, and one or more processors coupled to the memory and configured to execute the instructions to: receive output audio data representing a target accent and comprising speech data aligned to phonetic content representing a source accent based on a differentiable alignment determined based on first and second phonetic embedded vectors, wherein: the first phonetic embedding vectors are generated from input audio data and are for the phonetic content; and the second embedding vectors are generated based on an application of a trained neural network to the first phonetic embedding vectors and correspond to first phonetic characteristics of speech data in the target accent; store the output audio data in the memory; and output the output audio data from the memory and via the audio interface.
2 . The system of claim 1 , wherein the first phonetic embedding vectors represent second phonetic characteristics of input speech in the input audio data in a numerical format.
3 . The system of claim 1 , wherein the neural network comprises an encoder layer configured to encode the first phonetic embedding vectors into a latent representation and a decoder layer configured to decode the latent representation to generate the second phonetic embedding vectors.
4 . The system of claim 1 , wherein the differentiable alignment is determined by jointly maximizing a cosine distance between the first phonetic embedding vectors and the second phonetic embedding vectors.
5 . The system of claim 4 , wherein the cosine distance is determined based on a generated dot product of a normalization of the first and second phonetic embedding vectors based on a scaling of the first and second phonetic embedding vectors to have a magnitude of one and a preservation of a relative direction of the first and second phonetic embedding vectors.
6 . The system of claim 4 , wherein the joint maximization of the cosine distance is optimized based on an application of a gradient-based optimization algorithm.
7 . The system of claim 1 , wherein the neural network is trained to learn a mapping between the first phonetic embedding vectors and the second phonetic embedding vectors using a labeled dataset comprising paired samples of course accent phonetic embedding vectors and corresponding target accent phonetic embedding vectors.
8 . One or more non-transitory computer-readable media having stored thereon output audio data representing a target accent and comprising speech data aligned to phonetic content representing a source accent based on a differentiable alignment determined based on first and second phonetic embedded vectors, wherein: the first phonetic embedding vectors are generated from input audio data and are for the phonetic content; and the second embedding vectors are generated based on an application of a trained neural network to the first phonetic embedding vectors and correspond to first phonetic characteristics of speech data in the target accent.
9 . The one or more non-transitory computer-readable media of claim 8 , wherein the first phonetic embedding vectors encode one or more of phonetic features, patterns, phonemes, pronunciation, intonation, speech sounds, or phonetic units present in input speech in the input audio data.
10 . The one or more non-transitory computer-readable media of claim 8 , wherein the output audio data is further generated based on an alignment of first frames of the speech data with corresponding second frames of the phonetic content.
11 . The one or more non-transitory computer-readable media of claim 8 , wherein the output audio data is further generated based on an application of one or more techniques comprising prosody modeling, intonation adjustment, or accent-specific acoustic modeling.
12 . The one or more non-transitory computer-readable media of claim 8 , wherein the output audio data is further generated based on an adjustment of a speech rate, pitch, or gender.
13 . The one or more non-transitory computer-readable media of claim 8 , wherein the output audio data preserves linguistic content of the input audio data.
14 . A method, comprising: receiving output audio data representing a target accent and comprising speech data aligned to phonetic content representing a source accent based on a differentiable alignment determined based on first and second phonetic embedded vectors, wherein: the first phonetic embedding vectors are generated from input audio data and are for the phonetic content; and the second embedding vectors are generated based on an application of a trained neural network to the first phonetic embedding vectors and correspond to first phonetic characteristics of speech data in the target accent; and outputting the output audio data via an audio interface, wherein the output audio data represents an accent-converted version of the input audio data.
15 . The method of claim 14 , wherein the first and second phonetic embedding vectors are pre-processed based on an application of one or more dimensionality reduction techniques.
16 . The method of claim 14 , wherein the first phonetic embedding vectors represent second phonetic characteristics of input speech in the input audio data in a numerical format.
17 . The method of claim 14 , wherein the neural network comprises an encoder layer configured to encode the first phonetic embedding vectors into a latent representation and a decoder layer configured to decode the latent representation to generate the second phonetic embedding vectors.
18 . The method of claim 14 , wherein the differentiable alignment is determined by jointly maximizing a cosine distance between the first phonetic embedding vectors and the second phonetic embedding vectors.
19 . The method of claim 14 , wherein the neural network is trained to learn a mapping between the first phonetic embedding vectors and the second phonetic embedding vectors using a labeled dataset comprising paired samples of course accent phonetic embedding vectors and corresponding target accent phonetic embedding vectors.
20 . The method of claim 14 , wherein the output audio data preserves linguistic content of the input audio data.

Description

This application is a continuation of U.S. patent application Ser. No. 18/905,439, filed Oct. 3, 2024, which is a continuation of U.S. patent application Ser. No. 18/754,280, filed Jun. 26, 2024 (now U.S. Pat. No. 12,131,745, issued Oct. 29, 2024), which claims priority to U.S. Provisional Patent Application Ser. No. 63/510,487, filed Jun. 27, 2023, each which is hereby incorporated herein by reference in its entirety. FIELD This technology generally relates to audio analysis and, more particularly, to methods and systems for automatic alignment of phonetic content for real-time accent conversion. BACKGROUND Real-time accent conversion relates to the process of transforming speech from one accent to another accent in real-time. For instance, a speaker with an Indian accent could have their speech automatically converted into an American accent while they are speaking. This transformation process involves aligning phonetically dissimilar audio of two accents, which can be challenging due to the unique pronunciation styles of each speaker and associated accent. One approach to aligning two audio sequences uses a dynamic time warping (DTW) algorithm. DTW finds optimal temporal alignment of two sequences by stretching or compressing them in time. However, DTW has limitations, such as being non-differentiable and not providing gradient information. As a result, training an accent conversion model of an accent conversion system using DTW requires two separate steps. The first step involves using DTW to align the audio of the two accents and the second step involves training the accent conversion model using the aligned data. This approach can limit the overall performance of the accent conversion system since the accent conversion model can only learn from the aligned data and not from the original audio. Non-differentiability also is a significant issue that makes it difficult to train an accent conversion model effectively using DTW, thereby limiting its performance in real-world scenarios. Specifically, the non-differentiability of DTW makes it challenging to optimize current accent conversion systems using gradient-based methods, which are widely used in deep learning models. This limitation can lead to inaccuracies and errors in the accent conversion process and resulting poor-quality audio signals. Non-monotonicity and instability are other significant issues that lead to alignment errors and negatively impact the accuracy of current accent conversion systems. Non-monotonicity refers to the fact that some alignment algorithms, including DTW, do not always guarantee that the alignment will be strictly increasing in time. This may lead to alignment errors and result in inaccurate accent conversions. Instability refers to the fact that the alignment algorithm may produce different results when the input signals are slightly perturbed, leading to inconsistencies in the accent conversion process. Other deficiencies of existing accent conversion methods is that they do not handle complex accents that deviate significantly from the data used to train the accent conversion model. In such cases, current accent conversion systems may produce inaccurate or inconsistent results. Additionally, existing accent conversion methods are not able to capture the nuances and variations of different accents accurately, which may affect the naturalness and intelligibility of the converted speech. Furthermore, existing accent conversion methods require a significant amount of training data, which may be a challenge to collect and annotate, limiting the scalability of current systems and making it challenging for current systems to adapt to new accents or languages. These and other limitations make it challenging to develop and deploy effective real-time accent conversion models and systems to accurately convert accented speech in different audio signals. Accordingly, current accent conversion systems have limited performance, accuracy, and effectiveness for real-time accent conversion. BRIEF DESCRIPTION OF THE DRAWINGS The disclosed technology is illustrated by way of example and not limitation in the accompanying figures, in which like references indicate similar elements: FIG. 1 is a block diagram of an exemplary network environment that includes an accent conversion system; FIG. 2 is a block diagram of an exemplary storage device of the accent conversion system of FIG. 1; and FIG. 3 is a flowchart of an exemplary method for automatic alignment of phonetic content for real-time accent conversion. DETAILED DESCRIPTION Examples described below may be used to provide a method, a device (e.g., non-transitory computer readable medium), an apparatus, and/or a system for automatic alignment of phonetic content for real-time accent conversion. Although the technology has been described with reference to specific examples, various modifications may be made to these examples without departing from the broader spirit and scope of the vario