Search

CN-121983036-A - Voice recognition method and system based on big data

CN121983036ACN 121983036 ACN121983036 ACN 121983036ACN-121983036-A

Abstract

The invention relates to the technical field of voice recognition, in particular to a voice recognition method and a voice recognition system based on big data, comprising the following steps: collecting each voice signal frame and the fundamental frequency thereof in a target area, obtaining a first matching index of each voice signal frame and a positive level tone, a second matching index of each voice signal frame and a negative level tone, a third matching index of each voice signal frame and a fourth matching index of a positive level tone, taking the maximum value of the first matching index, the second matching index, the third matching index and the fourth matching index corresponding to each voice signal frame as the characteristic value of each voice signal frame, obtaining the label of each characteristic value, and inputting the label of each voice signal frame and the characteristic value thereof into a trained machine learning model to obtain the voice content of each voice signal frame. The invention solves the problem of low accuracy of voice content recognition.

Inventors

  • CAI TIE
  • LIU SIPING
  • Tang Dingqing

Assignees

  • 广东九四智能科技有限公司

Dates

Publication Date
20260505
Application Date
20260408

Claims (10)

  1. 1. A big data based speech recognition method, comprising: The method comprises the steps of collecting each voice signal frame and fundamental frequency thereof in a target area, recording a sequence formed by the fundamental frequency of all voice signal frames in a set neighborhood radius taking each voice signal frame as a center as a target fundamental frequency sequence, and performing first-order differential processing on the target fundamental frequency sequence to obtain a differential sequence; acquiring a first matching index of each voice signal frame and a positive flat tone and a second matching index of a negative flat tone, wherein the first matching index and a target fundamental frequency sequence have an ascending trend The sum of the positive numbers in the differential sequence and the value is positive correlation and the absolute value of the sum of the negative numbers in the differential sequence is inverse correlation, and the second matching index has a descending trend with the target fundamental frequency sequence The absolute value of the sum of all negative numbers in the differential sequence is positive correlation, and the sum of all positive numbers in the differential sequence is inverse correlation; Acquiring a third matching index of each voice signal frame and a shade tone and a fourth matching index of the upper tone, wherein the third matching index and the fourth matching index represent the fluctuation degree of each voice signal frame; And taking the maximum value of the first matching index, the second matching index, the third matching index and the fourth matching index corresponding to each voice signal frame as the characteristic value of each voice signal frame, obtaining the label of each characteristic value, and inputting the label of each voice signal frame and the characteristic value thereof into a trained machine learning model to obtain the voice content of each voice signal frame.
  2. 2. The big data based speech recognition method of claim 1, wherein the first matching index The method comprises the following steps: , Is the first There is an upward trend in the target baseband sequence of individual speech signal frames The value of the sum of the values, Is the first The sum of all positive numbers in the differential sequence of individual speech signal frames, Is the first The absolute value of the sum of all negatives in the differential sequence of individual speech signal frames.
  3. 3. The big data based speech recognition method of claim 1, wherein the second matching index The method comprises the following steps: , Is the first There is a downward trend in the target baseband sequence of individual speech signal frames The value of the sum of the values, Is the first The sum of all positive numbers in the differential sequence of individual speech signal frames, Is the first The absolute value of the sum of all negatives in the differential sequence of individual speech signal frames.
  4. 4. The big data based speech recognition method of claim 1, wherein the third matching index The method comprises the following steps: , Is the first The average of all the fundamental frequencies in the target sequence of fundamental frequencies of a speech signal frame, Is the first Standard deviation of all fundamental frequencies in a target base frequency sequence of individual speech signal frames.
  5. 5. The method for recognizing speech based on big data according to claim 1, wherein the fourth matching index is specifically: the sequence formed by the fundamental frequencies of all the voice signal frames in the left set neighborhood radius of each voice signal frame is marked as a first sequence, and the sequence formed by the fundamental frequencies of all the voice signal frames in the right set neighborhood radius is marked as a second sequence; fourth matching index The method comprises the following steps: , Is the first There is a decreasing trend in the first sequence of frames of the speech signal The value of the sum of the values, Is the first In a second sequence of frames of the speech signal, there is an upward trend The value of the sum of the values, The super parameters are preset.
  6. 6. The method for recognizing speech based on big data according to claim 1, wherein the tag for obtaining each feature value is specifically: and taking each characteristic value as the input of an AMPD peak detection algorithm to obtain a peak value of each characteristic value, marking the label of the characteristic value of the voice signal frame corresponding to the peak value as 1, and marking the label of the characteristic value of the voice signal frame corresponding to the non-peak value as 0.
  7. 7. The method of claim 1, wherein the machine learning model is a long-short term memory artificial neural network model.
  8. 8. The big data based speech recognition method of claim 1, further comprising training a machine learning model, in particular: The method comprises the steps of inputting a training set into a pre-built machine learning model for training, calculating loss between an output predicted value and a label in the training process, adjusting model parameters by a gradient descent method to minimize a predicted error, and iteratively adjusting parameters of the machine learning model until the loss is smaller than a set value or reaches a set training frequency, thereby finally obtaining the trained machine learning model.
  9. 9. The method of claim 1, wherein each of the frames of the speech signal is de-noised.
  10. 10. A big data based speech recognition system, comprising a memory and a processor, wherein the memory has computer program instructions stored therein, which when executed by the processor, implement a big data based speech recognition method as claimed in any one of claims 1 to 9.

Description

Voice recognition method and system based on big data Technical Field The invention relates to the technical field of voice recognition. More particularly, the invention relates to a voice recognition method and system based on big data. Background With the continuous progress of artificial intelligence technology and the continuous improvement of computing power, speech recognition, as an important means of man-machine interaction, has been widely applied to a plurality of fields such as smart phones, smart speakers, vehicle-mounted systems, online customer service, educational learning, medical assistance, etc. The traditional voice recognition system mainly relies on the cooperation of an acoustic model, a language model and a pronunciation dictionary, and realizes the conversion from voice to text by preprocessing, feature extraction and pattern matching on an input voice signal. However, due to the influence of various factors such as language diversity, dialect difference, noise interference, speech speed fluctuation and the like, the accuracy and the robustness of the traditional voice recognition technology under a complex scene still face a great challenge. In particular, in chinese speech recognition, since there are four basic tones (dull, flat, loud, and silent) in mandarin, the same syllable may correspond to completely different chinese characters due to different tones, and tone information is important for accurately recognizing text content. However, in the existing speech recognition technology, the tones are generally treated as the subordinate attributes of speech, and there is a lack of deep and detailed modeling and distinguishing thereof, so that vocabularies with similar tones are easily confused in recognition. With the development of big data and machine learning technology, feature learning and modeling based on large-scale voice data becomes a trend. By carrying out statistical analysis on a large amount of real voice data, a more representative voice characteristic mode can be extracted, so that the recognition accuracy and the system adaptability are improved. Meanwhile, by utilizing implicit information such as context, intonation change trend and the like in big data, a new thought is provided for improving accuracy of tone identification. However, the traditional voice recognition method has obvious limitation in processing Chinese voice, and the traditional method generally depends on manually marked voice data or preset fixed voice recognition rules, so that not only is the complexity of data preparation and model training increased, but also the efficiency is low when large-scale voice data or real-time voice stream processing is faced, tiny differences and dynamic change characteristics of the voice cannot be accurately captured, voice recognition errors are caused, and further the problem of low voice content recognition accuracy is caused. Disclosure of Invention In order to solve the problem of low accuracy of voice content recognition proposed in the above background art, the present invention provides solutions in various aspects as follows. In a first aspect, the invention provides a big data-based speech recognition method, which comprises collecting each speech signal frame and its fundamental frequency in a target area, recording a sequence composed of fundamental frequencies of all speech signal frames in a set neighborhood radius centered on each speech signal frame as a target fundamental frequency sequence, performing first-order differential processing on the target fundamental frequency sequence to obtain a differential sequence, obtaining a first matching index of each speech signal frame and a positive flat tone and a second matching index of each speech signal frame and a negative flat tone, wherein the first matching index and the target fundamental frequency sequence have an ascending trendThe sum of the positive numbers in the differential sequence and the value is positive correlation and the absolute value of the sum of the negative numbers in the differential sequence is inverse correlation, and the second matching index has a descending trend with the target fundamental frequency sequenceThe method comprises the steps of obtaining a third matching index of each voice signal frame and a negative tone and a fourth matching index of a positive tone, wherein the third matching index and the fourth matching index represent fluctuation degrees of each voice signal frame, taking the maximum value of the first matching index, the second matching index, the third matching index and the fourth matching index corresponding to each voice signal frame as the characteristic value of each voice signal frame, obtaining labels of each characteristic value, and inputting the labels of each voice signal frame and the characteristic value of each voice signal frame into a trained machine learning model to obtain voice content of each voice signal frame. According to