CN-121983061-A - Voice recognition method and device

CN121983061ACN 121983061 ACN121983061 ACN 121983061ACN-121983061-A

Abstract

The invention provides a voice recognition method and device. The method comprises the steps of obtaining original voice data, carrying out signal preprocessing on the original voice data to obtain preprocessed voice data, obtaining target feature data according to the preprocessed voice data, wherein the target feature data comprise short-time energy and short-time zero crossing rate, carrying out effective voice data extraction on the preprocessed voice data according to preset extraction conditions and the target feature data to obtain effective voice data, obtaining the effective feature data according to linear frequency spectrum of a voice frame in the effective voice data and preset data extraction parameters, carrying out voice recognition according to the effective feature data to obtain a text sequence, and carrying out text splicing processing according to the text sequence to obtain a voice recognition result. The invention can eliminate invalid voice and noise, improve the accuracy of voice recognition, reduce the workload of manual post-processing and quality inspection and reduce the voice recognition cost.

Inventors

WANG XINYONG
LUO XINKAI
SUN PICHAO
FENG SHANSHAN

Assignees

中译文娱科技(青岛)有限公司

Dates

Publication Date: 20260505
Application Date: 20260409

Claims (10)

1. A method of speech recognition, comprising: Acquiring original voice data; performing signal preprocessing on the original voice data to obtain preprocessed voice data; Obtaining target characteristic data according to the preprocessed voice data, wherein the target characteristic data comprises short-time energy and short-time zero crossing rate; Extracting the effective voice data of the preprocessed voice data according to preset extraction conditions and the target feature data to obtain the effective voice data; Extracting parameters according to the linear frequency spectrum of the voice frame in the effective voice data and preset data to obtain effective characteristic data; performing voice recognition according to the effective characteristic data to obtain a text sequence; And performing text splicing processing according to the text sequence to obtain a voice recognition result.
2. The method of claim 1, wherein performing signal preprocessing on the original voice data to obtain preprocessed voice data comprises: Pre-emphasis processing is carried out on the original voice data to obtain pre-emphasis voice data; Windowing is carried out on the pre-emphasis voice data to obtain windowed voice data; And carrying out noise reduction treatment on the windowed voice data to obtain preprocessed voice data.
3. The method of claim 1, wherein obtaining target feature data from the pre-processed speech data comprises: According to Obtaining short-time energy; According to Obtaining a short-time zero crossing rate; obtaining target characteristic data according to the short-time energy and the short-time zero crossing rate; wherein E is short-time energy, N is the number of sampling points of a single frame, For preprocessing the amplitude of the sampling point of the nth frame in the voice data, Z is the short-time zero crossing rate, As a function of the sign of the symbol, For preprocessing n-1 frames of sample point magnitudes in the speech data.
4. The voice recognition method according to claim 1, wherein the extracting of the valid voice data from the preprocessed voice data according to a preset extraction condition and the target feature data, comprises: The method comprises the steps of obtaining preset extraction conditions, wherein the preset extraction conditions comprise that short-time energy is larger than a preset energy high threshold value, and short-time zero crossing rate accords with a preset voice threshold value; And extracting the effective voice data from the preprocessed voice data according to the comparison result of the target feature data and the preset extraction condition to obtain the effective voice data.
5. The method according to claim 1, wherein obtaining valid feature data according to the linear spectrum of the speech frame in the valid speech data and the preset data extraction parameters, comprises: According to Obtaining first characteristic data; According to Obtaining second characteristic data; According to Obtaining third characteristic data; Performing characteristic data splicing on the first characteristic data, the second characteristic data and the third characteristic data to obtain effective characteristic data; Wherein, the For the j-th dimension of the first characteristic data, M is the total number of filters in the preset data extraction parameters, The logarithmic energy output for the mth filter, , For a linear spectrum of the valid speech data after fourier transform of the speech frames, For the frequency response of the mth filter, F is the total number of samples of the fast fourier transform, For the second feature data of the nth dimension, K is the difference window size in the preset data extraction parameters, The first characteristic data for the j + k frame, For the first characteristic data of the j-k frame, For the j-th-dimensional third feature data, Second characteristic data for the j+k frame, And the second characteristic data of the j-k frame.
6. The method according to claim 1, wherein performing speech recognition based on the valid feature data to obtain a text sequence comprises: Performing nonlinear transformation according to the effective characteristic data to obtain character probability; Obtaining a text sequence global probability according to the character probability; obtaining initial text data according to a preset searching strategy and the global probability of the text sequence; and correcting the initial text data to obtain a text sequence.
7. The method according to claim 1, wherein performing text splicing processing according to the text sequence to obtain a speech recognition result comprises: Performing text splicing processing on the text sequence according to the time sequence to obtain a text splicing result; and correcting the text splicing result to obtain a voice recognition result.
8. A speech recognition apparatus, comprising: The acquisition module is used for acquiring the original voice data; The processing module is used for carrying out signal preprocessing on the original voice data to obtain preprocessed voice data, obtaining target feature data according to the preprocessed voice data, wherein the target feature data comprises short-time energy and short-time zero crossing rate, carrying out effective voice data extraction on the preprocessed voice data according to preset extraction conditions and the target feature data to obtain effective voice data, carrying out effective feature data extraction on the effective voice data to obtain effective feature data, carrying out voice recognition according to the effective feature data to obtain a text sequence, and carrying out text splicing processing according to the text sequence to obtain a voice recognition result.
9. A computing device comprising a processor, a memory storing a computer program which, when executed by the processor, performs the method of any one of claims 1 to 7.
10. A computer readable storage medium storing instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 7.

Description

Voice recognition method and device Technical Field The invention relates to the technical field of voice recognition, and also relates to a voice recognition method and device. Background The current mainstream automatic voice recognition system generally adopts a full-segment voice direct processing mode, does not carry out refined segmentation and invalid screening on an original voice stream, needs to process complete long voice containing a large amount of silence, noise and irrelevant noise, and has the invalid data proportion of 30-60 percent, so that the computing power, the memory and the bandwidth are greatly occupied. Disclosure of Invention The invention aims to provide a voice recognition method and device for improving the efficiency and accuracy of voice recognition. In order to solve the technical problems, the technical scheme of the invention is as follows: in a first aspect of the present invention, there is provided a speech recognition method comprising: Acquiring original voice data; performing signal preprocessing on the original voice data to obtain preprocessed voice data; Obtaining target characteristic data according to the preprocessed voice data, wherein the target characteristic data comprises short-time energy and short-time zero crossing rate; Extracting the effective voice data of the preprocessed voice data according to preset extraction conditions and the target feature data to obtain the effective voice data; extracting effective characteristic data from the effective voice data to obtain effective characteristic data; performing voice recognition according to the effective characteristic data to obtain a text sequence; And performing text splicing processing according to the text sequence to obtain a voice recognition result. Optionally, performing signal preprocessing on the original voice data to obtain preprocessed voice data, including: Pre-emphasis processing is carried out on the original voice data to obtain pre-emphasis voice data; Windowing is carried out on the pre-emphasis voice data to obtain windowed voice data; And carrying out noise reduction treatment on the windowed voice data to obtain preprocessed voice data. Optionally, obtaining target feature data according to the preprocessing voice data includes: According to Obtaining short-time energy; According to Obtaining a short-time zero crossing rate; and obtaining target characteristic data according to the short-time energy and the short-time zero crossing rate. Optionally, extracting the valid voice data from the preprocessed voice data according to a preset extraction condition and the target feature data to obtain valid voice data, including: the method comprises the steps of obtaining preset extraction conditions, wherein the preset extraction conditions comprise that short-time energy accords with a preset energy high threshold, short-time zero crossing rate accords with a preset voice threshold, and continuous effective voice frame duration accords with a preset duration threshold; And extracting the effective voice data from the preprocessed voice data according to the comparison result of the target feature data and the preset extraction condition to obtain the effective voice data. Optionally, obtaining the effective feature data according to the linear spectrum of the voice frame in the effective voice data and the preset data extraction parameter includes: According to Obtaining first characteristic data; According to Obtaining second characteristic data; According to Obtaining third characteristic data; Performing characteristic data splicing on the first characteristic data, the second characteristic data and the third characteristic data to obtain effective characteristic data; Wherein, the For the j-th dimension of the first characteristic data, M is the total number of filters in the preset data extraction parameters,The logarithmic energy output for the mth filter,,For a linear spectrum of the valid speech data after fourier transform of the speech frames,For the frequency response of the mth filter, F is the total number of samples of the fast fourier transform,For the second feature data of the nth dimension, K is the difference window size in the preset data extraction parameters,The first characteristic data for the j + k frame,For the first characteristic data of the j-k frame,For the j-th-dimensional third feature data,Second characteristic data for the j+k frame,And the second characteristic data of the j-k frame. Optionally, performing speech recognition according to the valid feature data to obtain a text sequence, including: Performing nonlinear transformation according to the effective characteristic data to obtain character probability; Obtaining a text sequence global probability according to the character probability; obtaining initial text data according to a preset searching strategy and the global probability of the text sequence; and correcting the initial text data to