CN-122024774-A - Pronunciation assessment method, pronunciation assessment device, electronic equipment and storage medium

CN122024774ACN 122024774 ACN122024774 ACN 122024774ACN-122024774-A

Abstract

The invention provides a pronunciation assessment method, a pronunciation assessment device, electronic equipment and a storage medium, wherein the method comprises the steps of encoding a voice vector of speaking voice, obtaining a phoneme sequence of a speaking text corresponding to the speaking voice, encoding a sequence vector of the phoneme sequence, fusing the voice vector and the sequence vector based on similarity between each unit in the voice vector and each unit in the sequence vector to obtain a fusion vector, and carrying out pronunciation assessment on the speaking voice based on the fusion vector. The method, the device, the electronic equipment and the storage medium provided by the invention realize soft alignment of the phonemes in the speakable voice and the phoneme sequence in the vector space, and perform pronunciation assessment based on the fusion vector, so that pronunciation assessment without forced alignment is realized, thereby avoiding pronunciation assessment errors caused by inaccuracy of forced alignment, ensuring accuracy and reliability of pronunciation assessment, and improving pronunciation assessment efficiency without building a complex decoding network or dynamically planning global decoding.

Inventors

Wang Bingjue
WU KUI
ZHANG KAIBO
SHENG ZHICHAO
WANG SHIJIN
LIU CONG
HU GUOPING

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260325

Claims (10)

1. A pronunciation assessment method, comprising: Encoding a speech vector of the speakable speech; Acquiring a phoneme sequence of a reading text corresponding to the reading voice, and encoding a sequence vector of the phoneme sequence; based on the similarity between each unit in the voice vector and each unit in the sequence vector, fusing the voice vector and the sequence vector to obtain a fused vector; And based on the fusion vector, performing pronunciation assessment on the reading voice.
2. The pronunciation assessment method of claim 1, wherein the fusing the speech vector and the sequence vector based on the similarity between each unit in the speech vector and each unit in the sequence vector to obtain a fused vector comprises: and performing cross attention interaction on the voice vector and the sequence vector based on the similarity between each unit in the voice vector and each unit in the sequence vector to obtain the fusion vector.
3. The pronunciation assessment method as claimed in claim 1, further comprising: performing self-attention interaction on the voice vector based on the similarity between units in the voice vector; and performing self-attention interaction on the sequence vector based on the similarity between units in the sequence vector.
4. A pronunciation assessment method according to any one of claims 1 to 3, wherein said encoding of speech vectors of speakable speech comprises: Encoding the speaking voice to obtain an initial vector of the speaking voice; and performing self-attention interaction on the initial vector based on the similarity among units belonging to the same window in the initial vector to obtain the speech vector of the reading speech.
5. A pronunciation assessment method according to any one of claims 1 to 3, wherein said obtaining a phoneme sequence of a speakable text corresponding to a speakable voice comprises: acquiring a plurality of candidate phoneme sequences of a speakable text corresponding to the speakable speech, wherein different candidate phoneme sequences correspond to different pronunciations of a multi-word in the speakable text; encoding a candidate sequence vector for each of the candidate phoneme sequences; And determining a phoneme sequence of the reading text from the plurality of candidate phoneme sequences based on the similarity between the voice vector and each candidate sequence vector.
6. A pronunciation assessment method according to any one of claims 1 to 3, wherein said pronunciation assessment of said speakable speech based on said fusion vector comprises: And performing pronunciation assessment on the reading voice based on the fusion vector and a pronunciation assessment scene corresponding to the reading voice.
7. A pronunciation assessment device, comprising: the voice coding unit is used for coding voice vectors of the read-aloud voice; The phoneme coding unit is used for obtaining a phoneme sequence of the speakable text corresponding to the speakable voice and coding a sequence vector of the phoneme sequence; A fusion unit, configured to fuse the speech vector and the sequence vector based on a similarity between each unit in the speech vector and each unit in the sequence vector, to obtain a fusion vector; and the evaluation unit is used for carrying out pronunciation evaluation on the reading voice based on the fusion vector.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the pronunciation assessment method of any one of claims 1 to 6 when executing the computer program.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the pronunciation assessment method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the pronunciation assessment method as claimed in any one of claims 1 to 6.

Description

Pronunciation assessment method, pronunciation assessment device, electronic equipment and storage medium Technical Field The present invention relates to the field of speech processing technologies, and in particular, to a pronunciation assessment method, a pronunciation assessment device, an electronic device, and a storage medium. Background Pronunciation assessment is an important item in the field of computer-aided language learning (Computer Aided Language Learning, CALL), and is widely applied to various spoken exams and learning scenes. Currently, the implementation of the mainstream pronunciation assessment method depends on Forced Alignment (FA) results of the speakable speech and the speakable text. However, in actual examination and learning scenarios, the speaker may not read strictly according to a given reading text, and the conditions such as missed reading, read-back, and read-up may affect the accuracy of the forced alignment result, thereby affecting the accuracy of the pronunciation assessment result. Disclosure of Invention The invention provides a pronunciation assessment method, a pronunciation assessment device, electronic equipment and a storage medium, which are used for solving the defect of low accuracy caused by dependence on forced alignment results of pronunciation assessment in the related technology. The invention provides a pronunciation assessment method, which comprises the following steps: Encoding a speech vector of the speakable speech; Acquiring a phoneme sequence of a reading text corresponding to the reading voice, and encoding a sequence vector of the phoneme sequence; based on the similarity between each unit in the voice vector and each unit in the sequence vector, fusing the voice vector and the sequence vector to obtain a fused vector; And based on the fusion vector, performing pronunciation assessment on the reading voice. According to the pronunciation assessment method provided by the invention, the fusion of the speech vector and the sequence vector based on the similarity between each unit in the speech vector and each unit in the sequence vector to obtain a fusion vector comprises the following steps: and performing cross attention interaction on the voice vector and the sequence vector based on the similarity between each unit in the voice vector and each unit in the sequence vector to obtain the fusion vector. The pronunciation assessment method provided by the invention further comprises the following steps: performing self-attention interaction on the voice vector based on the similarity between units in the voice vector; and performing self-attention interaction on the sequence vector based on the similarity between units in the sequence vector. According to the pronunciation assessment method provided by the invention, the voice vector of the encoded reading voice comprises the following steps: Encoding the speaking voice to obtain an initial vector of the speaking voice; and performing self-attention interaction on the initial vector based on the similarity among units belonging to the same window in the initial vector to obtain the speech vector of the reading speech. According to the pronunciation assessment method provided by the invention, the method for acquiring the phoneme sequence of the speakable text corresponding to the speakable voice comprises the following steps: acquiring a plurality of candidate phoneme sequences of a speakable text corresponding to the speakable speech, wherein different candidate phoneme sequences correspond to different pronunciations of a multi-word in the speakable text; encoding a candidate sequence vector for each of the candidate phoneme sequences; And determining a phoneme sequence of the reading text from the plurality of candidate phoneme sequences based on the similarity between the voice vector and each candidate sequence vector. According to the pronunciation assessment method provided by the invention, the pronunciation assessment of the reading voice based on the fusion vector comprises the following steps: And performing pronunciation assessment on the reading voice based on the fusion vector and a pronunciation assessment scene corresponding to the reading voice. The invention also provides a pronunciation assessment device, which comprises: the voice coding unit is used for coding voice vectors of the read-aloud voice; The phoneme coding unit is used for obtaining a phoneme sequence of the speakable text corresponding to the speakable voice and coding a sequence vector of the phoneme sequence; A fusion unit, configured to fuse the speech vector and the sequence vector based on a similarity between each unit in the speech vector and each unit in the sequence vector, to obtain a fusion vector; and the evaluation unit is used for carrying out pronunciation evaluation on the reading voice based on the fusion vector. The invention also provides an electronic device comprising a memory, a processor and