CN-121747549-B - Voice transcription method and system based on multi-feature fusion and scene semantic association

CN121747549BCN 121747549 BCN121747549 BCN 121747549BCN-121747549-B

Abstract

The embodiment of the application discloses a voice transcription method and a voice transcription system based on multi-feature fusion and scene semantic association. The method comprises the steps of obtaining audio data to be identified, obtaining a trained voice identification model, utilizing the feature extraction module to extract multi-dimensional core features, multi-dimensional low-frequency part features and short-time energy entropy features of the audio data to be identified, fusing the multi-fusion acoustic features to obtain multi-fusion acoustic features, and carrying out fragment transcription on the multi-fusion acoustic features based on the language model layer with the scenerization semantic association model to obtain texts of the audio data to be identified. The embodiment of the application solves the technical problem of how to improve the voice transcription accuracy in the target scene.

Inventors

DENG GANG
ZHAO HONGLIANG
WANG XINYAO

Assignees

深圳市长丰影像器材有限公司

Dates

Publication Date: 20260508
Application Date: 20260226

Claims (9)

1. A speech transcription method based on multi-feature fusion and scenerized semantic association, the method comprising: the method comprises the steps of obtaining audio data to be recognized, and obtaining a trained voice recognition model, wherein the voice recognition model is obtained by carrying out dynamic self-adaptive weighted training on the basis of a training set constructed by general spoken language materials and professional language materials of various industries; extracting the multi-dimensional core features, the multi-dimensional low-frequency part features and the short-time energy entropy features of the audio data to be identified by using the feature extraction module, and fusing to obtain multi-fusion acoustic features; the method comprises the steps of carrying out fragment transcription on the multi-fusion acoustic feature based on the language model layer with a scene semantic association model to obtain a text of the audio data to be identified, wherein the scene semantic association model adopts a multi-head attention mechanism and is constructed based on a plurality of annotation corpus with semantic association rules of a target scene in a pre-training way, and the following steps are carried out by the scene semantic association model in the process of carrying out transcription on the multi-fusion acoustic feature by the language model layer: receiving the multi-fusion acoustic features, synchronously reading corresponding time stamp information, and sequencing the multi-fusion acoustic features according to time sequence; based on the multi-head attention mechanism, carrying out cross-correlation calculation on semantic features of the current transfer fragment and semantic features of adjacent transfer fragments, and extracting semantic association points among fragments; And verifying the spliced voice based on a pre-training rule of the target scene, and filtering the spliced result which does not accord with the semantic logic of the target scene.
2. The speech transcription method based on multi-feature fusion and scene semantic association according to claim 1, wherein the training set constructed based on the general spoken language corpus and the professional corpora of each industry performs dynamic adaptive weighted training, and the method comprises: the method comprises the steps of carrying out layered mixed pretreatment on general spoken language materials and professional language materials of each industry, wherein the pretreatment comprises the steps of screening daily dialogue and impromptu speaking fragments of a target scene in the general spoken language materials, retaining real acoustic characteristics, classifying the professional language materials of each industry according to the industry, retaining professional expression and scene sentence types of the professional language materials of each industry, splicing the language materials in disorder in the mixing process, and marking scene labels; In the initial stage of training of the speech recognition model, setting the weight of the general spoken language material to be greater than the weight of professional language materials in various industries; In the middle training stage of the voice recognition model, setting the weight of the general spoken language material to be equal to the weight of professional language materials in each industry; And setting a verification set at the later training stage of the voice recognition model, and dynamically fine-tuning the weights of the professional corpora of each industry according to the accuracy of the verification set.
3. The speech transcription method based on multi-feature fusion and scenerized semantic association according to claim 1, wherein the extracting the multi-dimensional core feature, the multi-dimensional low-frequency part feature and the short-time energy entropy feature of the audio data to be identified by using the feature extracting module and fusing to obtain multi-fusion acoustic features comprises: carrying out standardization processing on the audio data to be identified, and removing direct current components and extreme abnormal values in the audio data; Dividing the processed audio data into frames according to the preset frame length and the preset frame movement, and adopting a hanning window to inhibit spectrum leakage; extracting multi-dimensional core features from each frame of audio data after framing based on an MFCC algorithm, wherein the core features comprise a human voice core frequency band and a frequency spectrum envelope feature; Extracting multidimensional low-frequency part characteristics based on an LPCC algorithm; Calculating one-dimensional characteristics based on a short-time energy entropy algorithm, quantifying energy distribution of each frame of audio data, and distinguishing effective voice fragments from environmental noise fragments to obtain short-time energy entropy characteristics; And fusing the extracted multi-dimensional core features, the multi-dimensional low-frequency part features and the short-time energy entropy features through an attention mechanism to obtain multi-fusion acoustic features, wherein the weight ratio of the multi-dimensional core features, the multi-dimensional low-frequency part features and the short-time energy entropy features is sequentially reduced during fusion, and the attention mechanism recognizes the feature importance of the audio data of the current frame in real time and dynamically adjusts weight distribution.
4. A speech transcription method based on multi-feature fusion and scenerized semantic association according to claim 3, further comprising: introducing a time sequence attention mechanism into the feature extraction module, performing association calculation on fusion features of audio data with continuous preset frame lengths, and capturing time sequence association of features between frames; And setting a noise robustness training branch in the feature extraction module, inputting the multi-fusion acoustic feature into the voice recognition model for training, and iteratively optimizing extraction parameters of the feature extraction module based on a gradient descent algorithm so as to enable the feature extraction module to adapt to the audio characteristics of a target scene.
5. The speech transcription method based on multi-feature fusion and scenerized semantic association according to claim 1, wherein the language model layer is further provided with a term enhancement layer, the term enhancement layer is configured to convert professional terms in a dictionary into multi-dimensional vectors based on a term embedding algorithm, embed the multi-dimensional vectors into a vocabulary of the language model layer, and learn collocation rules of terms in different sentence patterns by adjusting optimized term context adaptation logic; The method further comprises the steps of optimizing a loss function of the language model layer, and introducing a joint loss function of cross entropy loss and term matching loss, wherein the weight of the cross entropy loss is larger than that of the term matching loss, the cross entropy loss is used for optimizing the transcription accuracy of the voice recognition model, and the term matching loss is used for strengthening the term transcription accuracy of the voice recognition model; And linking the language model layer with the acoustic feature extraction module, training and adjusting by adopting text marking data of a target scene, and iterating until the term transcription accuracy and semantic consistency of the voice recognition model reach preset values.
6. The voice transcription method based on multi-feature fusion and scenerification semantic association according to claim 1 is characterized by further comprising the steps of conducting semantic verification on a splicing result output by a scenerification semantic association model, if the splicing result is found to be wrong, feeding back the splicing result to the scenerification semantic association model, recalculating the semantic association degree of adjacent transcription fragments and adjusting splicing boundaries by the scenerification semantic association model, and reversely optimizing own semantic association rules by combining transcription data corrected by a user.
7. The speech transcription method based on multi-feature fusion and scenerized semantic association according to claim 1, further comprising error correction of text from which the audio data to be recognized is derived, the error correction process being performed in at least one of: screening texts which are subjected to error correction processing preferentially on the basis of the audio characteristics of the audio data to be identified and the confidence level of each section of transcribed text output by the voice recognition model; correcting grammar level errors in the text based on training results of the universal spoken language corpus and a Chinese grammar rule base during training of the speech recognition model; Based on the semantic rule of the scenerized semantic association model and the target scene, comparing the semantic association of the current transcription segment and the adjacent transcription segment, and linking the timestamp information in the audio data to be identified to correct semantic layer errors in the text; Identifying suspected terms in the text, checking the rationality of term collocation and unifying term expressions based on a multi-industry term dictionary, and preferentially calling the term dictionary of the industry corresponding to the target scene; identifying the pause time of the actual speech based on the timestamp information and the audio energy characteristics in the audio data to be identified, correspondingly correcting punctuation of the sentence breaking of the text and correcting semantic deviation caused by pause misjudgment; Establishing a noise error transfer feature library based on the noise type and the noise intensity fed back by the audio data equipment to be identified, and automatically identifying and deleting meaningless noise error transfer contents; and carrying out iterative optimization on the voice recognition model based on the user modified transfer data.
8. The speech transcription method based on multi-feature fusion and scene semantic association according to claim 7, further comprising the steps of correcting the text of the audio data to be identified, calling the scene semantic association model to verify the corrected text, and judging the meaning consistency, grammar consistency and term uniformity of the text and the front and rear fragments; If the error correction result contradiction exists, combining the audio characteristics of the audio data to be identified and the transcription confidence, and preferentially selecting the error correction result conforming to the semantic logic; and if the error correction is passed, outputting the corrected transcription text and feeding the error correction data generated in the error correction process back to the voice recognition model for iterative upgrading.
9. A speech transcription system comprising a server for performing the method according to any one of claims 1-8.

Description

Voice transcription method and system based on multi-feature fusion and scene semantic association Technical Field The application relates to the technical field of artificial intelligence, in particular to a voice transcription method and a voice transcription system based on multi-feature fusion and scene semantic association. Background Along with the development of the voice recognition technology, the voice transcription technology has breakthrough progress, and the real-time transcription of voice into characters has successful application in a plurality of scenes. For example, the conference is an important scene of a voice transcription application, and the speech in the conference is converted into the text to form an exhaustive conference record through an online or offline voice transcription technology, so that the conference content can be stored, queried, searched and propagated conveniently. Existing speech recognition transcription techniques rely on ASR (automatic speech recognition) generic models, and are not designed to adjust for the audio characteristics of a particular scene, resulting in a limitation in transcription accuracy in some particular scenes. Disclosure of Invention The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a voice transcription method and a voice transcription system based on multi-feature fusion and scene semantic association, which aim to solve the technical problem of how to improve voice transcription accuracy in a target scene. In a first aspect, the present application provides a speech transcription method based on multi-feature fusion and scenerized semantic association, the method comprising: the method comprises the steps of obtaining audio data to be recognized, and obtaining a trained voice recognition model, wherein the voice recognition model is obtained by carrying out dynamic self-adaptive weighted training on the basis of a training set constructed by general spoken language materials and professional language materials of various industries; extracting the multi-dimensional core features, the multi-dimensional low-frequency part features and the short-time energy entropy features of the audio data to be identified by using the feature extraction module, and fusing to obtain multi-fusion acoustic features; And performing fragment transcription on the multi-fusion acoustic feature based on the language model layer with the scene semantic association model to obtain the text of the audio data to be identified, wherein the scene semantic association model is built by adopting a multi-head attention mechanism and based on a plurality of annotation corpus with semantic association rules of a target scene. In at least some embodiments of the present application, The training set constructed based on the general spoken language corpus and the professional language corpus of each industry carries out dynamic self-adaptive weighting training, and the training set comprises: the method comprises the steps of carrying out layered mixed pretreatment on general spoken language materials and professional language materials of each industry, wherein the pretreatment comprises the steps of screening daily dialogue and impromptu speaking fragments of a target scene in the general spoken language materials, retaining real acoustic characteristics, classifying the professional language materials of each industry according to the industry, retaining professional expression and scene sentence types of the professional language materials of each industry, splicing the language materials in disorder in the mixing process, and marking scene labels; In the initial stage of training of the speech recognition model, setting the weight of the general spoken language material to be greater than the weight of professional language materials in various industries; In the middle training stage of the voice recognition model, setting the weight of the general spoken language material to be equal to the weight of professional language materials in each industry; And setting a verification set at the later training stage of the voice recognition model, and dynamically fine-tuning the weights of the professional corpora of each industry according to the accuracy of the verification set. In at least some embodiments of the present application, The method for extracting the multi-dimensional core features, the multi-dimensional low-frequency part features and the short-time energy entropy features of the audio data to be identified by using the feature extraction module and fusing the multi-dimensional core features, the multi-dimensional low-frequency part features and the short-time energy entropy features to obtain multi-fused acoustic features comprises the following steps: carrying out standardization processing on the audio data to be identified, and removing direct current components and ex