CN-116259304-B - Continuous interactive method of voice and related products

CN116259304BCN 116259304 BCN116259304 BCN 116259304BCN-116259304-B

Abstract

The embodiment of the application provides a continuous interaction method of voice and related products, the method comprises the steps that a terminal obtains first voice data input by a target object, after the first voice data are identified and confirmed to be wake-up voice, preset data in a buffer memory are sent to a voice activity detection VAD engine of the terminal, the VAD engine of the terminal continuously monitors the conversation between the target object and a car machine and processes the conversation, and after the VAD engine of the terminal detects that the target object is completely speaking, the preset data in the buffer memory are sent to the VAD engine of the terminal again to realize continuous voice interaction. The technical scheme provided by the application has the advantages of realizing continuous interaction and improving the user experience.

Inventors

YONG XIAOWEN

Assignees

博泰车联网科技(上海)股份有限公司

Dates

Publication Date: 20260512
Application Date: 20211202

Claims (7)

1. A method of continuous interaction of speech, the method comprising the steps of: The method comprises the steps that a terminal obtains first voice data input by a target object, after the first voice data are identified and determined to be wake-up voice, preset data in a cache are sent to a voice activity detection (V AD) engine of the terminal; the VAD engine of the terminal continuously monitors the dialogue between the target object and the car machine, and processes the dialogue; after the VAD engine of the terminal detects that the target object finishes speaking, the VAD engine of the terminal sends the preset data in the buffer memory to the VAD engine of the terminal again to realize continuous voice interaction; if the target object is detected to finish speaking, whether a continuous voice ending condition is met is detected, and if the continuous voice ending condition is met, the VAD engine which sends the preset data in the buffer memory to the terminal is stopped to stop continuous voice interaction; The detecting whether the continuous voice ending condition is met specifically includes: After the target object finishes speaking, receiving the second voice data input by the target object again, and determining that the continuous voice ending condition is met when the specific voice belonging to ending the continuous voice interaction is determined through the second voice data recognition; the identifying the specific voice belonging to ending the continuous voice interaction for the second voice data specifically comprises: Determining x confidence coefficients of x words corresponding to each pronunciation group in the voice data by adopting an RNN recognition algorithm for the second voice data; The terminal equipment adds two confidence rates of the same first word in the x words and the y words in the first pronunciation group to obtain a confidence coefficient sum of the first word, traverses the same word in the x words and the y words to obtain a confidence coefficient sum of each same word, determines the word corresponding to the confidence coefficient sum maximum as the word corresponding to the first pronunciation group, traverses all pronunciation groups to obtain the word corresponding to all pronunciation groups, and forms text data of second voice data according to time sequence of all words, and if the text data belongs to specific text, determines that the second voice data belongs to specific voice ending continuous voice interaction.
2. The method according to claim 1, wherein said detecting whether a continuous end of speech condition is fulfilled comprises in particular: After the target object speaks, starting a timer, wherein the timer is used for stopping when receiving the voice data of the target object again, acquiring the first time length of the timer, and determining that the continuous voice ending condition is met if the first time length is greater than a time threshold.
3. The method of claim 1, wherein the data preset in the cache specifically includes: The terminal acquires preset recording data, obtains processing data after noise reduction and echo cancellation processing of the recording data, and stores the processing data into a cache as preset data.
4. The method of claim 1, wherein determining x confidence coefficients for each pronunciation group of the second speech data for the x words using RNN recognition algorithm specifically comprises: Wherein W represents a weight, X t-1 represents input data of an input layer at the time of t-1 of the second voice data, X t represents input data of the input layer at the time of t-1 of the second voice data, S t-1 represents an output result of a hidden layer at the time of t-1, f represents an activation function, O t-1 represents an output result of an output layer at the time of t-1, and X confidence coefficients of X words corresponding to each pronunciation group are determined according to the output result.
5. The method according to claim 4, wherein the activation function specifically comprises: sigmoid function or tanh function 。
6. A continuous interactive system for speech, the system comprising: the acquisition unit is used for acquiring first voice data input by the target object; the processing unit is used for sending the preset data in the cache to the voice activity detection V AD engine of the terminal after the first voice data are identified and determined to be wake-up voice; the VAD engine is used for continuously monitoring the dialogue between the target object and the vehicle machine and processing the dialogue; The processing unit is also used for retransmitting the preset data in the cache to the VAD engine of the terminal to realize continuous voice interaction; the processing unit is specifically used for starting a timer after the target object finishes speaking, stopping the timer when the timer is used for receiving the voice data of the target object again, acquiring the first duration of the timer, and determining that the continuous voice ending condition is met if the first duration is greater than a time threshold; The processing unit is also used for receiving the second voice data input by the target object again after the target object finishes speaking, and determining that the continuous voice ending condition is met when the specific voice belonging to ending the continuous voice interaction is determined to be recognized by the second voice data; The processing unit is further used for determining x confidence coefficients of x words corresponding to each pronunciation group in the voice data by adopting an RNN recognition algorithm for the second voice data, determining y confidence coefficients of y words corresponding to each pronunciation group in the voice data by adopting an LSTM recognition algorithm, adding the two confidence coefficients of the same first word in the x words and the y words in the first pronunciation group to obtain the confidence coefficient sum of the first word, traversing the same word in the x words and the y words to obtain the confidence coefficient sum of each same word, determining the word corresponding to the confidence coefficient sum maximum value as the word corresponding to the first pronunciation group, traversing all pronunciation groups to obtain the word corresponding to all pronunciation groups, forming the text data of the second voice data according to time sequence, and determining that the second voice data belongs to specific voice ending continuous voice interaction if the text data belongs to specific text.
7. A computer readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any of claims 1-5.

Description

Continuous interactive method of voice and related products Technical Field The application relates to the technical field of voice processing, in particular to a continuous interaction method of voice and related products. Background Vehicle-mounted voice interaction is of vital importance in the internet of vehicles, and is becoming more and more popular in listening to multimedia through voice operation navigation in the driving process, and continuous interaction cannot be realized through existing vehicle-mounted voice interaction, so that the experience of users is affected. Disclosure of Invention The embodiment of the application discloses a continuous voice interaction method and related products, which can realize the continuity of vehicle-mounted voice interaction and improve the experience of users. In a first aspect, a method for continuous interaction of speech is provided, the method comprising the steps of: The method comprises the steps that a terminal obtains first voice data input by a target object, after the first voice data are identified and determined to be wake-up voice, preset data in a cache are sent to a voice activity detection (V AD) engine of the terminal; the VAD engine of the terminal continuously monitors the dialogue between the target object and the car machine, and processes the dialogue; After the VAD engine of the terminal detects that the target object finishes speaking, the preset data in the buffer memory is sent to the VAD engine of the terminal again to realize continuous voice interaction. Optionally, the method further comprises: If the continuous voice ending condition is met, stopping sending the preset data in the buffer memory to the VAD engine of the terminal to stop continuous voice interaction. Optionally, the detecting whether the continuous voice ending condition is met specifically includes: After the target object speaks, starting a timer, wherein the timer is used for stopping when receiving the voice data of the target object again, acquiring the first time length of the timer, and determining that the continuous voice ending condition is met if the first time length is greater than a time threshold. Optionally, the detecting whether the continuous voice ending condition is met specifically includes: and after the target object finishes speaking, receiving the second voice data input by the target object again, and determining that the continuous voice ending condition is met when the specific voice belonging to ending the continuous voice interaction is determined by identifying the second voice data. Optionally, the data preset in the cache specifically includes: The terminal acquires preset recording data, obtains processing data after noise reduction and echo cancellation processing of the recording data, and stores the processing data into a cache as preset data. Optionally, the determining that the specific voice belonging to ending the continuous voice interaction by the second voice data recognition specifically includes: Determining x confidence coefficients of x words corresponding to each pronunciation group in the voice data by adopting an RNN recognition algorithm for the second voice data; The terminal equipment adds two confidence rates of the same first word in the x words and the y words in the first pronunciation group to obtain a confidence coefficient sum of the first word, traverses the same word in the x words and the y words to obtain a confidence coefficient sum of each same word, determines the word corresponding to the confidence coefficient sum maximum as the word corresponding to the first pronunciation group, traverses all pronunciation groups to obtain the word corresponding to all pronunciation groups, and forms text data of second voice data according to time sequence of all words, and if the text data belongs to specific text, determines that the second voice data belongs to specific voice ending continuous voice interaction. Optionally, determining, by using an RNN recognition algorithm on the second voice data, x confidence coefficients of x words corresponding to each pronunciation group in the voice data specifically includes: St＝Xt×W+St-1×W Ot＝f(St) Wherein W represents a weight, X t-1 represents input data of an input layer at the time of t-1 of the second voice data, X t represents input data of the input layer at the time of t-1 of the second voice data, S t-1 represents an output result of a hidden layer at the time of t-1, f represents an activation function, O t-1 represents an output result of an output layer at the time of t-1, and X confidence coefficients of X words corresponding to each pronunciation group are determined according to the output result. Optionally, the activation function specifically includes: sigmoid function or tanh function In a second aspect, there is provided a continuous interactive system for speech, the system comprising: the acquisition unit is used for acquiring first voice