CN-122024724-A - Accompanying robot voice intelligent recognition system based on machine learning

CN122024724ACN 122024724 ACN122024724 ACN 122024724ACN-122024724-A

Abstract

The invention discloses a machine learning-based accompanying robot voice intelligent recognition system which comprises a voice input processing module, an accompanying interaction state processing module, a consistency deducing module, a structure consistency state generating module, a voice recognition triggering judging module and a voice recognition executing module, wherein the voice input processing module is used for generating a voice input sequence, the accompanying interaction state processing module is used for collecting accompanying interaction state information and generating an accompanying interaction state sequence, the consistency deducing module is used for inputting the accompanying interaction state sequence to an improved AIN and generating a voice consistency prediction result, the consistency recognizing module is used for generating a voice consistency recognition result, the structure consistency state generating module is used for generating a structure consistency state result, the voice recognition triggering judging module is used for generating a voice recognition triggering instruction, and the voice recognition executing module is used for executing voice recognition processing when the voice recognition triggering instruction is generated. The invention controls the voice recognition trigger based on the consistency analysis of the accompanying interaction structure, realizes the recognition on demand, and has the advantages of low resource consumption and high interaction stability.

Inventors

GE JIAYIN
LI CHEN
BAI CHANGXU

Assignees

杭州极视科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260309

Claims (8)

1. Machine learning-based accompanying robot voice intelligent recognition system is characterized by comprising: the voice input processing module is used for collecting voice input in the operation process of the accompanying robot and generating a voice input sequence according to time sequence arrangement; the accompanying interaction state processing module is used for collecting accompanying interaction state information and generating an accompanying interaction state sequence according to time index alignment; The consistency inference module is used for inputting the accompanying interaction state sequence into the improved AIN, executing accompanying consistency inference processing and generating a voice consistency prediction result corresponding to a preset time window; The consistency recognition module is used for calculating the difference between the voice input sequence and the voice consistency prediction result in a corresponding preset time window to generate a voice consistency recognition result; the structure consistency state generation module is used for carrying out time aggregation processing on the voice consistency recognition result in a plurality of continuous preset time windows to generate a structure consistency state result; The voice recognition trigger judging module is used for comparing the structural consistency state result with a voice recognition trigger threshold value and generating a voice recognition trigger instruction when the recognition judging result meets the voice recognition trigger condition; and the voice recognition execution module is used for executing voice recognition processing on the voice input sequence when generating the voice recognition trigger instruction.
2. The machine learning-based companion robot voice intelligent recognition system according to claim 1, wherein the modules are implemented by the following means: collecting voice input and accompanying interaction state information in the operation process of the accompanying robot, generating a voice input sequence by arranging the voice input in time sequence, and generating an accompanying interaction state sequence by aligning the accompanying interaction state information in time index; inputting the accompany interaction state sequence into the improved AIN, executing accompany consistency inference processing, and generating a voice consistency prediction result corresponding to a preset time window; calculating the difference between the voice input sequence and the voice consistency prediction result in the corresponding time window to generate a voice consistency recognition result; Performing time aggregation processing on the voice consistency recognition result in a plurality of continuous time windows to generate a structure consistency state result; comparing the structure consistency state result with a preset voice recognition trigger threshold value, completing recognition judgment whether the voice recognition processing flow needs to be entered, and generating a voice recognition trigger instruction when the recognition judgment result meets the voice recognition trigger condition; when a voice recognition trigger instruction is generated, voice recognition processing is performed on the voice input sequence.
3. The machine learning-based companion robot voice intelligent recognition system according to claim 2, wherein the companion interaction state information specifically comprises interaction stage identification information of a current companion interaction, current output behavior state information of the companion robot, output content type information corresponding to the current companion interaction, interaction rhythm state information between the companion robot and a user, and interaction context state information in a companion interaction process.
4. The machine learning based companion robot voice intelligent recognition system of claim 2, wherein the generation of the voice consistency prediction result specifically comprises: in the operation process of the accompanying robot, executing time segmentation processing on the accompanying interaction state sequence according to preset time windows to form accompanying interaction state subsequences corresponding to the preset time windows; carrying out state recombination on each accompanying interaction state subsequence according to the time index sequence to generate an accompanying interaction state input sequence; Inputting the accompanying interaction state input sequence to the improved AIN, and executing continuous state evolution on the accompanying interaction state input sequence in the time dimension to generate an accompanying interaction evolution state sequence corresponding to each time index; Based on the accompanying interaction evolution state sequence, carrying out consistency inference processing on association relations among the accompanying interaction states under different time indexes in the same preset time window in the improved AIN, and generating an accompanying interaction consistency candidate result; Performing consistency integration processing on the partner interaction consistency candidate results according to the time window range to form partner consistency intermediate results corresponding to a preset time window; executing time stability constraint processing on the intermediate result of the companion consistency to generate a companion consistency characterization; And performing consistency mapping processing based on the companion consistency characterization to obtain a voice consistency prediction result.
5. The machine learning-based companion robot voice intelligent recognition system according to claim 2, wherein the generation of the voice consistency recognition result specifically comprises: intercepting voice input subsequences corresponding to each preset time window from the voice input sequence according to the preset time windows in the operation process of the accompanying robot; Performing time alignment on each voice input subsequence according to the time index sequence to generate a window voice input sequence keeping the time sequence consistent; In the same preset time window range, performing time semantic alignment processing on the window voice input sequence and the voice consistency prediction result to form a voice consistency comparison object; based on the voice consistency comparison object, carrying out difference calculation on the voice behavior characteristics of the window voice input sequence in the preset time windows and the consistency state represented by the voice consistency prediction result, and generating voice consistency difference representation corresponding to each preset time window; performing difference integration processing on the voice consistency difference characterization within a preset time window range to generate a voice consistency difference result; And performing consistency discrimination processing based on the voice consistency difference result to form a voice consistency recognition result.
6. The machine learning based companion robot voice intelligent recognition system of claim 2, wherein the generating of the structural consistency state result specifically comprises: acquiring voice consistency recognition results according to a time index sequence, and establishing a corresponding relation between each voice consistency recognition result and a corresponding preset time window to form a voice consistency recognition result sequence; Selecting a plurality of continuous time windows adjacent to the current preset time window in the voice consistency recognition result sequence by taking the current preset time window as a reference, and determining a time window set for time aggregation; Extracting the voice consistency recognition result corresponding to the time window from the voice consistency recognition result sequence according to the time window set to form an aggregation input sequence arranged in time sequence; Combining and summarizing a plurality of voice consistency recognition results in an aggregation input sequence to obtain an aggregation result; based on the aggregation result, summarizing the consistency states of the voices in a plurality of continuous time windows, and determining the consistency state value corresponding to the current preset time window; associating the consistency state value with a corresponding preset time window to generate a structural consistency state identifier; and outputting the structure consistency state identifiers according to the time index sequence to form a structure consistency state result.
7. The machine learning based companion robot voice intelligent recognition system of claim 2, wherein the generation of the voice recognition trigger instruction specifically comprises: Obtaining structure consistency state results according to a time index sequence, and establishing a corresponding relation between each structure consistency state result and a corresponding preset time window to form a structure consistency state sequence; Configuring preset voice recognition trigger thresholds for preset time windows, and establishing corresponding relations between each preset voice recognition trigger threshold and the corresponding preset time window to form a voice recognition trigger threshold set; Selecting a corresponding structure consistency state result in each preset time window, and acquiring a preset voice recognition trigger threshold corresponding to the preset time window; comparing the structure consistency state result with a corresponding preset voice recognition trigger threshold value to obtain a recognition judgment result under a current preset time window; and when the recognition judging result meets the voice recognition triggering condition, generating a voice recognition triggering instruction corresponding to a preset time window.
8. The machine learning based companion robot voice intelligent recognition system of claim 2, wherein the voice recognition process specifically comprises: when a voice recognition trigger instruction is generated, acquiring preset time window information corresponding to the voice recognition trigger instruction; Selecting a voice input subsequence corresponding to a preset time window from the voice input sequence according to the preset time window information to form a voice input sequence to be recognized; dividing voice segments of a voice input sequence to be recognized according to a time sequence to generate a voice segment sequence; Performing voice quality adjustment on the voice segment sequence to generate a standardized voice segment sequence meeting the voice recognition input requirement; extracting voice characteristics for voice recognition based on the standardized voice segment sequence, and generating a voice characteristic sequence; And generating a corresponding voice recognition intermediate result according to the voice characteristic sequence, and arranging the voice recognition intermediate result in time sequence to form a voice recognition result.

Description

Accompanying robot voice intelligent recognition system based on machine learning Technical Field The invention relates to the field of intelligent voice recognition, in particular to a machine learning-based accompanying robot voice intelligent recognition system. Background The conventional accompanying robot generally takes voice recognition as a default processing flow, and directly enters voice recognition processing after voice input is detected to acquire user instructions or interactive contents, the related technology focuses on acoustic feature analysis and recognition accuracy improvement of voice signals, recognition effect is improved by optimizing voice segmentation, feature extraction or recognition model parameters, but the general assumption that voice input once occurs has clear interaction semantic requirements, and comprehensive judgment on the overall state and the context structure of accompanying interaction is lacking. In the actual accompanying interaction process, the voice input of environmental noise, nondirectional sounding, emotional voice or the accompanying robot in the nondirectional stage frequently occurs, the prior art cannot distinguish whether the voice input is necessary to enter a voice recognition processing flow or not, and the voice recognition is triggered frequently and invalidity easily, so that the system calculation resource waste and the energy consumption are increased, the continuity and the stability of accompanying interaction are adversely affected, and the application requirement of the accompanying robot for long-term and stable operation is difficult to meet. Disclosure of Invention The invention aims to provide a machine learning-based accompanying robot voice intelligent recognition system, which controls voice recognition triggering based on accompanying interaction structure consistency analysis to realize on-demand recognition and has the advantages of low resource consumption and high interaction stability. According to the embodiment of the invention, the accompanying robot voice intelligent recognition system based on machine learning comprises: the voice input processing module is used for collecting voice input in the operation process of the accompanying robot and generating a voice input sequence according to time sequence arrangement; the accompanying interaction state processing module is used for collecting accompanying interaction state information and generating an accompanying interaction state sequence according to time index alignment; The consistency inference module is used for inputting the accompanying interaction state sequence into the improved AIN, executing accompanying consistency inference processing and generating a voice consistency prediction result corresponding to a preset time window; The consistency recognition module is used for calculating the difference between the voice input sequence and the voice consistency prediction result in a corresponding preset time window to generate a voice consistency recognition result; the structure consistency state generation module is used for carrying out time aggregation processing on the voice consistency recognition result in a plurality of continuous preset time windows to generate a structure consistency state result; The voice recognition trigger judging module is used for comparing the structural consistency state result with a voice recognition trigger threshold value and generating a voice recognition trigger instruction when the recognition judging result meets the voice recognition trigger condition; and the voice recognition execution module is used for executing voice recognition processing on the voice input sequence when generating the voice recognition trigger instruction. Optionally, the modules are realized by the following modes: collecting voice input and accompanying interaction state information in the operation process of the accompanying robot, generating a voice input sequence by arranging the voice input in time sequence, and generating an accompanying interaction state sequence by aligning the accompanying interaction state information in time index; inputting the accompany interaction state sequence into the improved AIN, executing accompany consistency inference processing, and generating a voice consistency prediction result corresponding to a preset time window; calculating the difference between the voice input sequence and the voice consistency prediction result in the corresponding time window to generate a voice consistency recognition result; Performing time aggregation processing on the voice consistency recognition result in a plurality of continuous time windows to generate a structure consistency state result; comparing the structure consistency state result with a preset voice recognition trigger threshold value, completing recognition judgment whether the voice recognition processing flow needs to be entered, and generating a voice rec