CN-115762472-B - Voice rhythm recognition method, system, equipment and storage medium

CN115762472BCN 115762472 BCN115762472 BCN 115762472BCN-115762472-B

Abstract

The embodiment of the invention discloses a voice prosody recognition method, a system, equipment and a storage medium, wherein firstly, voice in a customer service telephone is collected to obtain a dialogue voice signal; the method comprises the steps of obtaining a dialogue voice signal, carrying out signal preprocessing on the dialogue voice signal to obtain a preprocessed dialogue voice file, carrying out vectorization processing on the preprocessed dialogue voice file to obtain a corresponding feature matrix, inputting the feature matrix into a trained prosody model to obtain a model calculation result, obtaining a corresponding template threshold according to a Mandarin template and a dialect template, obtaining a prosody recognition result of the feature matrix by using the model calculation result and the template threshold, and finally carrying out text mapping processing on the feature matrix according to the prosody recognition result to obtain a dialogue word sequence text. The embodiment of the invention effectively improves the recognition accuracy and recognition efficiency of the voice with dialect.

Inventors

JIANG XIAODAN
ZHANG JING
AN JUNGANG
WANG SHUANG
DENG XIONG
ZHANG CHENGKAI
FAN HUI

Assignees

北京伽睿智能科技集团有限公司

Dates

Publication Date: 20260505
Application Date: 20221123

Claims (10)

1. A method of speech prosody recognition, the method comprising: voice collection is carried out on voice in a customer service telephone to obtain a dialogue voice signal; Performing signal preprocessing on the dialogue voice signal to obtain a preprocessed dialogue voice file; feature extraction is carried out on the preprocessed dialogue voice file, and vectorization processing is carried out on the preprocessed dialogue voice file based on a feature extraction result, so that a corresponding feature matrix is obtained; generating a prosody model based on signal characteristics of the historical voice data, and inputting the characteristic matrix into the prosody model to obtain a model calculation result; according to a preset Mandarin template and a preset dialect template, respectively calculating to obtain a Mandarin template threshold and a dialect template threshold; Obtaining a prosody recognition result corresponding to the feature matrix by using the model calculation result, the mandarin template threshold and the dialect template threshold; And carrying out text mapping processing on the feature matrix according to the prosody recognition result to obtain a dialogue word sequence text.
2. The method of claim 1, wherein the step of performing signal preprocessing on the conversational speech signal to obtain a preprocessed conversational speech file comprises: performing first beam forming processing on the dialogue voice signal to obtain a first preprocessing signal; performing second beam forming processing on the first preprocessing signal to obtain a second preprocessing signal; and performing spectrum signal control processing by using the second preprocessing signal to obtain the dialogue voice file.
3. The method for recognizing speech prosody according to claim 2, wherein performing feature extraction on the preprocessed conversational speech file, performing vectorization processing on the preprocessed conversational speech file based on a feature extraction result, to obtain a corresponding feature matrix, comprising: Segmenting the dialogue voice file based on time sequence to obtain segmented dialogue voice file; Feature extraction is carried out on each section of the segmented dialogue voice file to obtain voice spectrum features corresponding to the segmented dialogue voice file, wherein the voice spectrum features comprise a spectrum weight parameter t n , a signal delay parameter y n and a dialect voice intensity parameter tau n , and n is a positive integer which is greater than or equal to 0 and smaller than the segmentation total section number; Calculating a first spectrum characteristic matrix a by using the spectrum weight parameter t n , the signal delay parameter y n and the dialect tone intensity parameter tau n , wherein a calculation formula of the first spectrum characteristic matrix a is as follows: A={A(n)} A(n)=y n ×s(t n +τ n ) Wherein A (n) represents an nth element in the first frequency spectrum characteristic matrix A, and s is a multi-element nonlinear fitting parameter.
4. A method of prosody recognition according to claim 3, wherein generating a prosody model based on signal characteristics of historical speech data, inputting the characteristic matrix into the prosody model to obtain a model calculation result, comprises: Extracting pitch characteristic, tone intensity characteristic and tone length characteristic data of the historical voice data to obtain pitch characteristic data, tone intensity characteristic data and tone length characteristic data of the historical voice data; Average value calculation is carried out on the pitch characteristic data, the intensity characteristic data and the duration characteristic data to obtain corresponding pitch characteristic parameter omega, intensity characteristic parameter theta and duration characteristic parameter v; Inputting the first frequency spectrum characteristic matrix A into a prosody recognition model, and calculating to obtain a prosody model calculation result X, wherein a calculation formula of the prosody model calculation result X is as follows: Wherein m is determined by the length of the first spectrum characteristic matrix A, j is a preset weighting parameter, and x is a preset parameter.
5. The method of claim 4, wherein the step of calculating a mandarin chinese template threshold and a dialect chinese template threshold based on a preset mandarin chinese template and a preset dialect chinese template, respectively, comprises: Segmenting a preset Mandarin template voice file to obtain a segmented Mandarin template voice file; Calculating to obtain a corresponding second frequency spectrum feature matrix B by using the segmented Mandarin template voice file; Calculating the mandarin template threshold value X ′ by using the second spectral feature matrix B, where a calculation formula of the mandarin template threshold value X ′ is: Wherein m ′ is determined by the length of the second spectral feature matrix B, B (n ′ ) represents an nth ′ element in the second spectral feature matrix B, n ′ is a positive integer greater than or equal to 0 and less than the total number of segments of the segmented mandarin chinese template voice file; Segmenting a preset dialect template voice file to obtain a segmented dialect template voice file; Calculating to obtain a corresponding third frequency spectrum feature matrix C by using the segmented dialect speech template voice file; calculating the dialect template threshold value X ″ by using the third spectrum feature matrix C, where a calculation formula of the dialect template threshold value X ″ is as follows: Wherein m ″ is determined by the length of the third spectral feature matrix C, C (n ″ ) represents an nth ″ element in the third spectral feature matrix C, and n ″ is a positive integer greater than or equal to 0 and less than the total number of segments of the segmented dialect speech template voice file.
6. The method for prosody recognition according to claim 5, wherein obtaining a prosody recognition result corresponding to the feature matrix using the model calculation result, the mandarin template threshold value, and the dialect template threshold value comprises: Calculating a first difference absolute value C 1 by using the prosody model calculation result X and a mandarin template threshold X ′ , wherein the calculation formula of the first difference absolute value C 1 is as follows: C 1 =||X|-|X′|| Calculating a second difference absolute value C 2 by using the prosody model calculation result X and the dialect template threshold X ″ , wherein the calculation formula of the second difference absolute value C 2 is as follows: C 2 =||X|-|X′′|| Judging whether the first difference absolute value C 1 is larger than the second difference absolute value C 2 ; If the first difference absolute value C 1 is greater than the second difference absolute value C 2 , the prosody recognition result of the first spectral feature matrix a is a dialect; If the first difference absolute value C 1 is smaller than or equal to the second difference absolute value C 2 , the prosody recognition result of the first spectral feature matrix a is mandarin.
7. The method of claim 6, wherein performing text mapping on the feature matrix according to the prosody recognition result to obtain a dialog word sequence text, comprising: judging a prosody recognition result of the first frequency spectrum feature matrix A; If the prosody recognition result of the first frequency spectrum feature matrix A is Mandarin, performing first text mapping processing on the first frequency spectrum feature matrix A to obtain a first dialogue word sequence text D; And if the prosody recognition result of the first frequency spectrum feature matrix A is a dialect, performing second text mapping processing on the first frequency spectrum feature matrix A to obtain a second dialogue word sequence text D ′ .
8. A speech prosody recognition system, the system comprising: The voice signal acquisition module is used for carrying out voice acquisition on voices in the customer service telephone to obtain dialogue voice signals; The voice signal preprocessing module is used for carrying out signal preprocessing on the dialogue voice signal to obtain a preprocessed dialogue voice file; the feature matrix mapping module is used for carrying out feature extraction on the preprocessed dialogue voice file, and carrying out vectorization processing on the preprocessed dialogue voice file based on a feature extraction result to obtain a corresponding feature matrix; the prosody model module is used for generating a prosody model based on the signal characteristics of the historical voice data, and inputting the characteristic matrix into the prosody model to obtain a model calculation result; the template threshold generating module is used for respectively calculating and obtaining a Mandarin template threshold and a dialect template threshold according to a preset Mandarin template and a preset dialect template; The prosody recognition module is used for obtaining a prosody recognition result corresponding to the feature matrix by using the model calculation result, the mandarin template threshold value and the dialect template threshold value; and the word sequence text mapping module is used for carrying out text mapping processing on the feature matrix according to the prosody recognition result to obtain a dialogue word sequence text.
9. A speech prosody recognition device, characterized in that the device comprises a processor and a memory; The memory is used for storing one or more program instructions; The processor is configured to execute one or more program instructions for performing the steps of a method for prosody recognition as claimed in any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method for prosody recognition according to any one of claims 1 to 7.

Description

Voice rhythm recognition method, system, equipment and storage medium Technical Field The embodiment of the invention relates to the technical field of voice recognition, in particular to a voice prosody recognition method, a voice prosody recognition system, voice prosody recognition equipment and a storage medium. Background With the development of cloud computing and big data technology, a customer service call center in the telecommunications industry needs to collect speech from a dialogue in a customer service phone communication process, and perform speech recognition on the collected dialogue speech and transcribe the speech into text. The prior art scheme is based on the mainstream speech recognition of mandarin, and has high error rate for speech recognition with dialects. The existing expert knowledge system based on deep learning adds an extra computational power unit when dialect voice needs to process dialects, and needs to collect and simulate a large amount of corpus of the dialects, but the corpus belongs to a small sample in real life. If a large amount of data and deeper model training are required according to the conventional deep learning scheme, the business complexity and model flexibility will be deteriorated, the cost will be increased, and the efficiency of speech recognition with dialects will be low. Disclosure of Invention Therefore, the embodiment of the invention provides a voice prosody recognition method, a voice prosody recognition system, voice prosody recognition equipment and a storage medium, so as to solve the problems of high voice recognition error rate and low recognition efficiency aiming at a voice with dialect in the prior art. In order to achieve the above object, the embodiment of the present invention provides the following technical solutions: according to a first aspect of an embodiment of the present invention, there is provided a method for recognizing prosody of speech, the method including: voice collection is carried out on voice in a customer service telephone to obtain a dialogue voice signal; Performing signal preprocessing on the dialogue voice signal to obtain a preprocessed dialogue voice file; feature extraction is carried out on the preprocessed dialogue voice file, and vectorization processing is carried out on the preprocessed dialogue voice file based on a feature extraction result, so that a corresponding feature matrix is obtained; generating a prosody model based on signal characteristics of the historical voice data, and inputting the characteristic matrix into the prosody model to obtain a model calculation result; according to a preset Mandarin template and a preset dialect template, respectively calculating to obtain a Mandarin template threshold and a dialect template threshold; Obtaining a prosody recognition result corresponding to the feature matrix by using the model calculation result, the mandarin template threshold and the dialect template threshold; And carrying out text mapping processing on the feature matrix according to the prosody recognition result to obtain a dialogue word sequence text. Further, the method for preprocessing the dialogue voice signal to obtain a preprocessed dialogue voice file includes: performing first beam forming processing on the dialogue voice signal to obtain a first preprocessing signal; performing second beam forming processing on the first preprocessing signal to obtain a second preprocessing signal; and performing spectrum signal control processing by using the second preprocessing signal to obtain the dialogue voice file. Further, feature extraction is performed on the preprocessed dialogue voice file, vectorization processing is performed on the preprocessed dialogue voice file based on a feature extraction result, and a corresponding feature matrix is obtained, including: Segmenting the dialogue voice file based on time sequence to obtain segmented dialogue voice file; Feature extraction is carried out on each segment of the segmented dialogue voice file to obtain voice spectrum features corresponding to the segmented dialogue voice file, wherein the voice spectrum features comprise spectrum weight parameters Signal delay parameterAnd dialect tone intensity parametersWherein n is a positive integer greater than or equal to 0 and less than the total segmentation number, and the spectrum weight parameter is utilizedThe signal delay parameterAnd the dialect tone scale parametersCalculating to obtain a first frequency spectrum characteristic matrixThe first spectral feature matrixThe calculation formula of (2) is as follows: Wherein, the Representing the first spectral feature matrixS is a multi-element nonlinear fitting parameter. Further, generating a prosody model based on signal features of the historical speech data, inputting the feature matrix to the prosody model to obtain a model calculation result, including: Extracting pitch characteristic, tone intensity characteristic and tone l