CN-116524902-B - Word pronunciation scoring method and device, electronic equipment and storage medium

CN116524902BCN 116524902 BCN116524902 BCN 116524902BCN-116524902-B

Abstract

The application provides a word pronunciation scoring method, a device, electronic equipment and a storage medium, wherein the method comprises the steps of inputting audio characteristics of each audio frame in pronunciation audio into a phoneme detection model to obtain pronunciation probability of each audio frame for each standard pronunciation phoneme of a word; the method comprises the steps of constructing a two-dimensional table, taking a first row and a first column of grids in the two-dimensional table as starting points, directly moving from the latest starting point to the grid on the right of the latest starting point, taking the current grid as the starting point, repeating the steps until the current grid cannot continue to move, taking the final grid as the end point, determining a moving path between the initial starting point and the end point, and determining a preliminary pronunciation score based on pronunciation probability and the number of standard pronunciation phonemes written in each grid passed by a target moving path. The application can realize the scoring of the word pronunciation by the user on the phoneme level, thereby correcting the false pronunciation of the user more accurately.

Inventors

LUO JUNSONG
LIANG DENG
HAO XUEYUAN
ZHENG CHEN
TANG JIAQIANG
DONG JINPENG
LI MINGMING

Assignees

北京外研在线数字科技有限公司

Dates

Publication Date: 20260508
Application Date: 20230506

Claims (10)

1. A method of scoring word pronunciation, the method comprising: Receiving pronunciation audio of a target word by a user; Inputting the audio characteristics of each audio frame in the pronunciation audio into a trained phoneme detection model to carry out phoneme recognition, so as to obtain the pronunciation probability of each standard pronunciation phoneme of each audio frame for the target word; Constructing a target two-dimensional table, wherein the target two-dimensional table comprises n multiplied by m grids, n is the number of audio frames currently existing, m is the number of standard pronunciation phonemes, and the grid of the ith row and the jth column in the target two-dimensional table is written with the pronunciation probability of the ith standard pronunciation phonemes of the target word, which are arranged according to time sequence, of the audio frames currently existing, wherein i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to m; Taking the first row of the grids in the target two-dimensional table as a starting point, directly moving from the latest starting point to the grid immediately to the right of the latest starting point or the grid immediately below the latest starting point, taking the currently located grid as the starting point, repeating the steps, directly moving from the latest starting point to the grid immediately to the right of the latest starting point or the grid immediately below the latest starting point, taking the currently located grid as the starting point until the movement cannot be continued, taking the finally located grid as the end point, and determining a moving path between the initial starting point and the end point; And determining a preliminary pronunciation score for representing the pronunciation accuracy of the user on the target word based on the pronunciation probability written by each lattice passed by a target moving path and the number of standard pronunciation phonemes, wherein the target moving path is the sum of the probability corresponding to the moving path and the pronunciation probability written by each lattice passed by the moving path for each moving path, and the higher the preliminary pronunciation score is, the higher the pronunciation accuracy of the user on the target word is represented.
2. The word pronunciation scoring method of claim 1, wherein the phoneme detection model is a CTC model, and the phoneme detection model comprises, in order, a batch normalization layer, a zero-padding layer, a convolutional neural network layer, a max-pooling layer, a batch normalization layer, a gated loop unit layer, a time distribution dense layer, and a CTC output layer.
3. The method of claim 1, wherein inputting the audio features of each audio frame in the pronunciation audio into a trained phoneme detection model for phoneme recognition to obtain a pronunciation probability for each standard pronunciation phoneme of the target word for each audio frame, comprises: Inputting the audio characteristics of each audio frame in the pronunciation audio into the phoneme detection model for phoneme recognition to obtain pronunciation probability of each audio frame for each phoneme in all phonemes; after determining a preliminary pronunciation score for characterizing the pronunciation accuracy of the user for the target word based on the pronunciation probability and the number of standard pronunciation phonemes written in each lattice traversed by the target movement path, the method further includes: For each currently existing audio frame, determining a phoneme with the highest pronunciation probability in all phonemes in the audio frame as an actual pronunciation phoneme of the audio frame; Determining the pronunciation status of the user for the target word according to the respective actual pronunciation phonemes of each audio frame and the standard pronunciation phonemes of the target word, which are arranged according to the time sequence, wherein the pronunciation status comprises misreading, multi-reading and consistency; And inputting the standard pronunciation phonemes of the target word, the preliminary pronunciation scores, pronunciation probabilities written by each lattice passed by the target moving path and the pronunciation conditions into a trained word pronunciation scoring model to obtain a final pronunciation score of the user for the target word, wherein the word pronunciation scoring model is a XGBoost regression model.
4. A method of scoring pronunciation of words according to claim 3 wherein after inputting the audio characteristics of each audio frame in the pronunciation audio into a trained phoneme detection model for phoneme recognition, the method further comprises: for each audio frame, if the actual pronunciation phoneme of the audio frame is a blank phoneme, removing the audio frame; if at least two continuous reference audio frames with the same actual pronunciation phonemes appear, removing each reference audio frame after the first reference audio frame; For each audio frame, if the actual pronunciation phoneme of the audio frame is a null phoneme, the audio frame is removed.
5. A word pronunciation scoring apparatus, the apparatus comprising: the receiving module is used for receiving pronunciation audio of a target word by a user; The input module is used for inputting the audio characteristics of each audio frame in the pronunciation audio into the trained phoneme detection model to carry out phoneme recognition, so as to obtain the pronunciation probability of each standard pronunciation phoneme of each audio frame for the target word; The system comprises a table construction module, a table judgment module and a table judgment module, wherein the table construction module is used for constructing a target two-dimensional table, the target two-dimensional table comprises n multiplied by m lattices, n is the number of audio frames currently existing, m is the number of standard pronunciation phonemes, the lattice of the ith row and the jth column in the target two-dimensional table is written with the pronunciation probability of the ith standard pronunciation phoneme of the target word, which is arranged according to the time sequence, of the audio frames currently existing, i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to m; A path determining module, configured to directly move from the latest starting point to the immediately right lattice of the latest starting point or the immediately lower lattice of the latest starting point with the first row of the first column of the target two-dimensional table as the starting point, and to directly move from the latest starting point to the immediately right lattice of the latest starting point or the immediately lower lattice of the latest starting point with the current lattice of the immediately lower lattice of the latest starting point as the starting point, until the movement cannot be continued, to take the finally located lattice as the end point, and to determine a movement path between the initial starting point and the end point; And the preliminary scoring module is used for determining a preliminary pronunciation score for representing the pronunciation accuracy of the user on the target word based on the pronunciation probability written by each grid passed by a target moving path and the number of standard pronunciation phonemes, wherein the target moving path is the sum of the probability corresponding to the moving path and the pronunciation probability written by each grid passed by the moving path for each moving path, and the higher the preliminary pronunciation score is, the higher the pronunciation accuracy of the user on the target word is represented.
6. The word pronunciation scoring device of claim 5, wherein the phoneme detection model is a CTC model, and the phoneme detection model comprises, in order, a batch normalization layer, a zero-padding layer, a convolutional neural network layer, a max-pooling layer, a batch normalization layer, a gated loop unit layer, a time distribution dense layer, and a CTC output layer.
7. The word pronunciation scoring device of claim 5, wherein the input module is specifically configured to: Inputting the audio characteristics of each audio frame in the pronunciation audio into the phoneme detection model for phoneme recognition to obtain pronunciation probability of each audio frame for each phoneme in all phonemes; The apparatus further comprises: The phoneme determining module is used for determining a preliminary pronunciation score for representing the pronunciation accuracy degree of the user on the target word based on the pronunciation probability written in each lattice passed by the target moving path and the number of standard pronunciation phonemes, and then determining a phoneme with the largest pronunciation probability in all phonemes in the audio frame as an actual pronunciation phoneme of the audio frame for each currently existing audio frame; The pronunciation status determining module is used for determining the pronunciation status of the user for the target word according to the respective actual pronunciation phonemes of each audio frame which are arranged according to the time sequence and the standard pronunciation phonemes of the target word, wherein the pronunciation status comprises misreading, multi-reading and consistency; And the final scoring module is used for inputting the standard pronunciation phonemes of the target word, the preliminary pronunciation scores, the pronunciation probabilities written by each lattice passed by the target moving path and the pronunciation conditions into a trained word pronunciation scoring model to obtain the final pronunciation scores of the user on the target word, wherein the word pronunciation scoring model is a XGBoost regression model.
8. The word pronunciation scoring device of claim 7, further comprising: The first removing module is used for inputting the audio characteristics of each audio frame in the pronunciation audio into the trained phoneme detection model to carry out phoneme recognition to obtain the pronunciation probability of each standard pronunciation phoneme of each audio frame for the target word, and removing each reference audio frame after the first reference audio frame if at least two continuous reference audio frames with the same actual pronunciation phoneme appear; And the second removing module is used for removing each audio frame if the actual pronunciation phoneme of the audio frame is a null phoneme.
9. An electronic device comprising a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor in communication with the storage medium via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the word pronunciation scoring method according to any one of claims 1 to 4.
10. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the word pronunciation scoring method according to any one of claims 1 to 4.

Description

Word pronunciation scoring method and device, electronic equipment and storage medium Technical Field The present application relates to the field of natural language processing, and in particular, to a word pronunciation scoring method, device, electronic apparatus, and storage medium. Background With the continuous development and progress of technology, the pronunciation of words and sentences by a user can be scored nowadays to reflect the accuracy of the pronunciation of words and sentences by the user, so that the user can learn language better. However, the existing scoring mode is generally rough in scoring, and insufficient in fine granularity, only can accurately reflect the accuracy of the user for pronunciation of the sentence, but cannot accurately reflect the accuracy of the user for pronunciation of the word, in other words, the existing mode cannot accurately judge the quality of the user for pronunciation of the word on a phoneme level. Disclosure of Invention In view of the above, an object of the present application is to provide a word pronunciation scoring method, device, electronic apparatus, and storage medium, which can implement scoring of word pronunciation by a user on a phoneme level, thereby correcting false pronunciation of the user more accurately. In a first aspect, an embodiment of the present application provides a method for scoring pronunciation of words, the method including: Receiving pronunciation audio of a target word by a user; Inputting the audio characteristics of each audio frame in the pronunciation audio into a trained phoneme detection model to carry out phoneme recognition, so as to obtain the pronunciation probability of each standard pronunciation phoneme of each audio frame for the target word; Constructing a target two-dimensional table, wherein the target two-dimensional table comprises n multiplied by m grids, n is the number of audio frames currently existing, m is the number of standard pronunciation phonemes, and the grid of the ith row and the jth column in the target two-dimensional table is written with the pronunciation probability of the ith standard pronunciation phonemes of the target word, which are arranged according to time sequence, of the audio frames currently existing, wherein i is more than or equal to 1 and less than or equal to n, and j is more than or equal to 1 and less than or equal to m; Taking the first row of the grids in the target two-dimensional table as a starting point, directly moving from the latest starting point to the grid immediately to the right of the latest starting point or the grid immediately below the latest starting point, taking the currently located grid as the starting point, repeating the steps, directly moving from the latest starting point to the grid immediately to the right of the latest starting point or the grid immediately below the latest starting point, taking the currently located grid as the starting point until the movement cannot be continued, taking the finally located grid as the end point, and determining a moving path between the initial starting point and the end point; And determining a preliminary pronunciation score for representing the pronunciation accuracy of the user on the target word based on the pronunciation probability written by each lattice passed by a target moving path and the number of standard pronunciation phonemes, wherein the target moving path is the sum of the probability corresponding to the moving path and the pronunciation probability written by each lattice passed by the moving path for each moving path, and the higher the preliminary pronunciation score is, the higher the pronunciation accuracy of the user on the target word is represented. In one possible implementation, the phoneme detection model is a CTC model, and the phoneme detection model sequentially comprises a batch normalization layer, a zero filling layer, a convolutional neural network layer, a maximum pooling layer, a batch normalization layer, a gating circulation unit layer, a time distribution dense layer and a CTC output layer. In one possible implementation manner, inputting the audio feature of each audio frame in the pronunciation audio into a trained phoneme detection model to perform phoneme recognition to obtain a pronunciation probability of each standard pronunciation phoneme of each audio frame for the target word, where the method includes: Inputting the audio characteristics of each audio frame in the pronunciation audio into the phoneme detection model for phoneme recognition to obtain pronunciation probability of each audio frame for each phoneme in all phonemes; after determining a preliminary pronunciation score for characterizing the pronunciation accuracy of the user for the target word based on the pronunciation probability and the number of standard pronunciation phonemes written in each lattice traversed by the target movement path, the method further includes: For each c