CN-122024734-A - Data calibration method, device, equipment and medium

CN122024734ACN 122024734 ACN122024734 ACN 122024734ACN-122024734-A

Abstract

The invention relates to the technical field of data processing and discloses a data calibration method, device, equipment and medium, which comprise the steps of carrying out phoneme recognition on target audio through a target phoneme recognition model to obtain a predicted phoneme and a predicted probability of each audio frame, determining a mapped phoneme of each sub word in a transcribed text corresponding to the target audio according to a preset mapping table, respectively matching the audio frame with each sub word according to the predicted phoneme of the audio frame, the predicted probability of the corresponding predicted phoneme and the mapped phoneme of each sub word for any audio frame, determining a target sub word corresponding to the audio frame according to a matching result, calibrating the predicted phonemes of the audio frame according to the mapped phonemes of the target sub words, and fusing calibration results of the predicted phonemes of all the audio frames to form a final calibration result. The invention can be applied to the field of financial science and technology, improves the accuracy of phoneme recognition, and provides a more reliable acoustic characterization basis for a downstream voice processing task of phoneme recognition.

Inventors

SHI YAN
CHEN MINCHUAN

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260512
Application Date: 20260115

Claims (10)

1. A method of data calibration, comprising: acquiring target audio and a transcribed text corresponding to the target audio, and carrying out phoneme recognition on the target audio through a target phoneme recognition model to obtain a predicted phoneme of each audio frame in the target audio and a prediction probability of a corresponding predicted phoneme; Dividing the transcribed text according to pronunciation units to obtain sub words, and determining mapping phonemes of each sub word according to a preset mapping table, wherein the preset mapping table comprises mapping relations between each sub word and the corresponding mapping phonemes; For any audio frame, respectively matching the audio frame with each sub word according to the prediction phoneme of the audio frame, the prediction probability of the corresponding prediction phoneme and the mapping phoneme of each sub word to obtain a matching result, and determining a target sub word corresponding to the audio frame according to the matching result; and calibrating the predicted phonemes of the audio frames according to the mapped phonemes of the target subwords to obtain calibration results of the predicted phonemes of the audio frames, and fusing the calibration results of the predicted phonemes of all the audio frames to form final calibration results.
2. The data alignment method according to claim 1, wherein the performing phoneme recognition on the target audio through a target phoneme recognition model to obtain a predicted phoneme of each audio frame in the target audio and a prediction probability of a corresponding predicted phoneme comprises: Inputting the target audio into an encoder of the target phoneme recognition model, and extracting features of the target audio through the encoder to obtain acoustic features corresponding to the target audio; inputting the acoustic features into a decoder of the target phoneme recognition model, and decoding by the decoder to obtain a predicted phoneme of each audio frame and a prediction probability of a corresponding predicted phoneme.
3. The data alignment method according to claim 1, wherein for any audio frame, the matching between the audio frame and each sub-word is performed according to the predicted phoneme of the audio frame, the predicted probability of the corresponding predicted phoneme, and the mapped phoneme of each sub-word, so as to obtain a matching result, and determining the target sub-word corresponding to the audio frame according to the matching result includes: For any sub-word, calculating to obtain a prediction probability score representing the mapping phoneme of the sub-word on the audio frame by the target phoneme recognition model according to the prediction phoneme of the audio frame, the prediction probability of the corresponding prediction phoneme and the mapping phoneme of the sub-word; And determining the sub-word corresponding to the highest predictive probability score from the matching result as the target sub-word corresponding to the audio frame.
4. A data alignment method as claimed in claim 3, wherein for any subword, calculating a prediction probability score characterizing the target phoneme recognition model on the audio frame for the mapped phonemes of the subword based on the predicted phonemes of the audio frame, the prediction probabilities of the corresponding predicted phonemes and the mapped phonemes of the subword comprises: Respectively carrying out consistency comparison on each mapping phoneme of the sub word and each prediction phoneme of the audio frame, if mapping phonemes consistent with the prediction phonemes of the audio frame exist in the mapping phonemes of the sub word, determining the mapping phonemes consistent with the mapping phonemes in the sub word as target mapping phonemes, and determining the prediction phonemes consistent with the corresponding target mapping phonemes in the audio frame as target prediction phonemes; And adding the prediction probabilities of the target prediction phonemes corresponding to all the target mapping phonemes in the subwords to obtain the prediction probability score of the mapping phonemes of the subwords, which characterizes the target phoneme recognition model, on the audio frame.
5. The data alignment method of claim 4, further comprising, after said comparing each mapped phoneme of said subword with each predicted phoneme of said audio frame, respectively: If no mapped phonemes consistent with the predicted phoneme comparison of the audio frame exist in the mapped phonemes of the subword, determining And the predictive probability score of the target phoneme recognition model on the audio frame for the mapped phonemes of the subword is zero.
6. The method for data alignment according to claim 4, wherein the aligning predicted phonemes of the audio frame according to the mapped phonemes of the target subword to obtain the alignment result of the predicted phonemes of the audio frame includes: determining a predicted phoneme to be calibrated with the highest predicted probability from the predicted phonemes of the audio frame according to the predicted probability of the predicted phoneme corresponding to the audio frame; If the predicted phoneme to be calibrated is not the target predicted phoneme with the same comparison of the corresponding target mapping phonemes in the audio frame and the target subword, determining the target predicted phoneme with the highest prediction probability as a calibration result of the predicted phoneme of the audio frame from the target predicted phonemes with the same comparison of the corresponding target mapping phonemes in the audio frame and the target subword.
7. The data alignment method of claim 6, further comprising, after the determining the predicted phoneme to be aligned with the highest prediction probability from among the predicted phonemes of the audio frame: And if the predicted phoneme to be calibrated is a target predicted phoneme with consistent comparison between the audio frame and the corresponding target mapping phoneme in the target subword, determining the predicted phoneme to be calibrated as a calibration result of the predicted phoneme of the audio frame.
8. A data alignment apparatus, comprising: The system comprises an acquisition module, a target audio generation module and a target audio generation module, wherein the acquisition module is used for acquiring target audio and a transcription text corresponding to the target audio, and carrying out phoneme recognition on the target audio through a target phoneme recognition model to obtain a predicted phoneme of each audio frame in the target audio and a prediction probability of a corresponding predicted phoneme; The mapping module is used for dividing the transcribed text according to pronunciation units to obtain sub words, and determining mapping phonemes of each sub word according to a preset mapping table, wherein the preset mapping table comprises mapping relations between each sub word and the corresponding mapping phonemes; the matching module is used for matching the audio frame with each sub word according to the prediction phoneme of the audio frame, the prediction probability of the corresponding prediction phoneme and the mapping phoneme of each sub word to obtain a matching result, and determining a target sub word corresponding to the audio frame according to the matching result; and the calibration module is used for calibrating the predicted phonemes of the audio frames according to the mapped phonemes of the target subwords to obtain calibration results of the predicted phonemes of the audio frames, and fusing the calibration results of the predicted phonemes of all the audio frames to form final calibration results.
9. A computer device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor implements the data calibration method according to any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the data calibration method according to any one of claims 1 to 7 when the processor executes the computer program.

Description

Data calibration method, device, equipment and medium Technical Field The present invention relates to the field of data processing technologies, and in particular, to a data calibration method, apparatus, device, and medium. Background Phoneme recognition is used as a basic link of speech processing, and aims to convert continuous speech signals into a discrete phoneme unit sequence, and an end-to-end deep learning model is widely used for phoneme recognition at present so as to provide a finer acoustic characterization basis for downstream speech processing tasks such as speech recognition, speaker verification or speech emotion analysis. For example, in a scenario where security and compliance are highly emphasized such as financial insurance, automatic quality inspection of customer service records, voiceprint authentication or fraud risk voice analysis all depend on accuracy of front-end phoneme recognition, however, due to complex acoustic environment interference, dialect and accent differences of user pronunciation, and co-pronunciation and ambiguity of voice signals, the phoneme recognition accuracy of the existing end-to-end model still has significant limitations, especially under low signal-to-noise ratio, fast voice or non-standard pronunciation conditions, errors such as insertion, deletion or confusion exist in the output phoneme sequence, which directly impair and restrict performance and reliability of downstream voice processing tasks. Therefore, how to perform calibration on data after phoneme recognition to improve accuracy of phoneme recognition is a problem to be solved. Disclosure of Invention The embodiment of the invention provides a data calibration method, a device, equipment and a medium, which are used for solving the problem of how to execute calibration on data after phoneme recognition so as to improve the accuracy of phoneme recognition. In a first aspect, a data calibration method includes: acquiring target audio and a transcribed text corresponding to the target audio, and carrying out phoneme recognition on the target audio through a target phoneme recognition model to obtain a predicted phoneme of each audio frame in the target audio and a prediction probability of a corresponding predicted phoneme; Dividing the transcribed text according to pronunciation units to obtain sub words, and determining mapping phonemes of each sub word according to a preset mapping table, wherein the preset mapping table comprises mapping relations between each sub word and the corresponding mapping phonemes; For any audio frame, respectively matching the audio frame with each sub word according to the prediction phoneme of the audio frame, the prediction probability of the corresponding prediction phoneme and the mapping phoneme of each sub word to obtain a matching result, and determining a target sub word corresponding to the audio frame according to the matching result; and calibrating the predicted phonemes of the audio frames according to the mapped phonemes of the target subwords to obtain calibration results of the predicted phonemes of the audio frames, and fusing the calibration results of the predicted phonemes of all the audio frames to form final calibration results. In a second aspect, there is provided a data calibration device comprising: The system comprises an acquisition module, a target audio generation module and a target audio generation module, wherein the acquisition module is used for acquiring target audio and a transcription text corresponding to the target audio, and carrying out phoneme recognition on the target audio through a target phoneme recognition model to obtain a predicted phoneme of each audio frame in the target audio and a prediction probability of a corresponding predicted phoneme; The mapping module is used for dividing the transcribed text according to pronunciation units to obtain sub words, and determining mapping phonemes of each sub word according to a preset mapping table, wherein the preset mapping table comprises mapping relations between each sub word and the corresponding mapping phonemes; the matching module is used for matching the audio frame with each sub word according to the prediction phoneme of the audio frame, the prediction probability of the corresponding prediction phoneme and the mapping phoneme of each sub word to obtain a matching result, and determining a target sub word corresponding to the audio frame according to the matching result; and the calibration module is used for calibrating the predicted phonemes of the audio frames according to the mapped phonemes of the target subwords to obtain calibration results of the predicted phonemes of the audio frames, and fusing the calibration results of the predicted phonemes of all the audio frames to form final calibration results. In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the p