CN-122024736-A - Response acceleration method and equipment for half-duplex voice dialogue system

CN122024736ACN 122024736 ACN122024736 ACN 122024736ACN-122024736-A

Abstract

The invention relates to the technical field of voice dialogue interaction, and discloses a response acceleration method and response acceleration equipment for a half-duplex voice dialogue system. The method comprises the steps of continuously obtaining text frames output by automatic voice recognition, calculating the similarity between the recognized texts of adjacent text frames, recording stable texts and generating pre-requests carrying unique request identifiers when the similarity is larger than a similarity stability threshold value within a first threshold value time, carrying out language model reasoning on the recognized texts in the pre-requests, simultaneously carrying out silence detection on user voice signals, continuously monitoring the comparison similarity between the new text frames and the stable texts, judging that the stable texts fail and canceling the pre-requests when the comparison similarity is lower than a preset threshold value, and outputting the reasoning results of the pre-requests when the duration of silence reaches a second threshold value and the stable texts are not failed, so that the overlapping utilization of VAD waiting time and LLM reasoning time is realized, the response delay of a system is reduced, and the interaction smoothness is improved.

Inventors

GU JIAWEI
Zhu Zetong
SUN BIN
GUO ZIHAO
MA HUAIZHI

Assignees

上海灵宇宙科技发展有限公司

Dates

Publication Date: 20260512
Application Date: 20260312

Claims (10)

1. A response acceleration method of a half-duplex voice dialog system, the method comprising: Continuously acquiring a plurality of text frames output by automatic speech recognition, wherein each text frame comprises a recognition text; Calculating the similarity between the identification texts of each pair of adjacent text frames; When the similarity is larger than a similarity stability threshold value in a first threshold time, recording an identification text at a trigger time as a stable text, and generating a pre-request carrying a unique request identifier; performing language model reasoning on the identification text in the pre-request; carrying out silence detection on a user voice signal while carrying out language model reasoning on the identification text in the pre-request, and recording silence duration; before the mute duration reaches a second threshold, comparing the identification text of the newly acquired text frame with the stable text, and calculating the contrast similarity; If the mute duration reaches the second threshold, before the contrast similarity is lower than a preset threshold, modifying the pre-request into a cancel state, resetting the mute duration, and updating the identification text of the newly acquired text frame into a new stable text to generate a new pre-request for language model reasoning; And if the mute duration reaches the second threshold, confirming that the pre-request in the current active state is a final request, and outputting an reasoning result corresponding to the final request.
2. The method according to claim 1, wherein the calculating the similarity between the identified texts of each pair of adjacent text frames comprises: Extracting identification texts of two adjacent text frames, and recording the identification texts as a first text and a second text; calculating the minimum number of character editing operations required by converting the first text into the second text, and subtracting the quotient of the number of the editing operations and the maximum value of the lengths of the two text characters from 1 to obtain character-level similarity; word segmentation is carried out on the first text and the second text respectively, stop words are filtered, a first core word set and a second core word set are obtained, and the number of intersection elements is divided by the number of union elements, so that word level similarity is obtained; Respectively inputting the first text and the second text into a semantic coding model to extract semantic vectors, carrying out L2 normalization on the semantic vectors, and then calculating dot products to obtain semantic level similarity; the method comprises the steps of obtaining confidence scores of a current text frame and a plurality of previous frames to form a confidence coefficient sequence, calculating short-term variance, medium-term variance, long-term variance and platform flatness of the confidence coefficient sequence, and carrying out weighting operation according to preset weights to obtain a confidence coefficient stability score; and carrying out weighted summation on the character level similarity, the word level similarity, the semantic level similarity and the confidence stability score according to preset weights to obtain the similarity.
3. The method according to claim 2, wherein the minimum number of character editing operations required for converting the first text to the second text, in particular comprises: Constructing a two-dimensional matrix, wherein rows of the matrix correspond to character positions of the first text, columns of the matrix correspond to character positions of the second text, and matrix elements represent editing distances between the corresponding character positions; Starting from the initial position of the matrix, selecting different editing operation costs according to whether characters are the same or not, and filling the matrix through recursive calculation; terminating the calculation and judging as dissimilar when the accumulated editing distance in the calculation process exceeds the preset proportion of the maximum value of the lengths of the two text characters; and extracting the final position element of the matrix as the editing operation times.
4. The method according to claim 2, characterized in that: The short-term variance is unbiased variance of confidence scores of the nearest first preset number of text frames; the calculation of the medium-term variance comprises the steps of performing linear fitting on confidence scores of a second preset number of nearest text frames, calculating residual errors between actual confidence scores and fitting values, and calculating variances of the residual errors; The calculation of the long-term variance comprises the steps of carrying out high-pass filtering on confidence scores of a third preset number of text frames recently, and calculating the variance of the filtered numerical values; The calculation of the platform flatness comprises the steps of detecting platform sections with the absolute value of the first-order difference being smaller than a preset absolute value threshold value and above a continuous preset frame number in the confidence coefficient sequence, calculating the ratio of variance in the platform sections to global variance, and subtracting the ratio from 1 to obtain the platform flatness.
5. The method according to claim 2, wherein the inputting the first text and the second text into the semantic coding model to extract semantic vectors, respectively, specifically comprises: performing sub-word segmentation on the text by using a word segmentation device, adding a CLS mark at a starting position, and adding an SEP mark at an end position to generate a token sequence; Inputting the token sequence into a semantic coding model, and extracting a vector corresponding to the CLS mark position in the hidden state of the last layer of the transducer as an initial semantic vector; extracting vectors corresponding to the CLS mark positions in the hidden states of the last three layers of transformers, and carrying out weighted average to obtain enhanced semantic vectors; and carrying out L2 normalization processing on the enhanced semantic vector to obtain the semantic vector.
6. The method of claim 1, wherein prior to said calculating the similarity between the identified text of each pair of adjacent text frames, the method further comprises: Identifying a scene type of a current dialogue; When the scene type is a command control scene, setting a similarity stability threshold as a first stability threshold, and setting a continuous stability frame number as a corresponding first preset frame number; when the scene type is a natural dialogue scene, setting the similarity stability threshold to be between the first stability threshold and the second stability threshold, setting the continuous stability frame number to be a corresponding second preset frame number, and allowing synonym matching when calculating word-level similarity; When the scene type is a dictation transcription scene, the weight of punctuation stability is increased when the similarity is calculated.
7. The method according to claim 1, wherein the method further comprises: counting the number of the preset requests marked as the cancel state in a preset time window, and marking the number as the cancel number; counting the total number of the pre-requests generated in the preset time window, and recording the total number as the total number; Dividing the cancellation number by the total number to obtain a cancellation rate; And when the cancellation rate is greater than a preset cancellation rate threshold, increasing the value of the first threshold time by a preset adjustment duration.
8. A response accelerating device of a half-duplex speech dialog system, characterized in that the response accelerating device of the half-duplex speech dialog system comprises one or more processors and a memory, the memory being coupled with the one or more processors, the memory being for storing computer program code comprising computer instructions, the one or more processors invoking the computer instructions to cause the response accelerating device of the half-duplex speech dialog system to perform the method of any of claims 1-7.
9. A computer readable storage medium comprising instructions which, when run on a response acceleration device of a half-duplex speech dialog system, cause the response acceleration device of the half-duplex speech dialog system to perform the method of any of claims 1-7.
10. A computer program product, characterized in that the computer program product, when run on a response acceleration device of a half-duplex speech dialog system, causes the response acceleration device of the half-duplex speech dialog system to perform the method of any of claims 1-7.

Description

Response acceleration method and equipment for half-duplex voice dialogue system Technical Field The application relates to the technical field of man-machine conversation, in particular to a response acceleration method and response acceleration equipment of a half-duplex voice conversation system. Background The voice dialog system implements human-machine interaction through voice activity detection (VAD, voiceActivityDetection), automatic voice recognition (ASR, automaticSpeechRecognition), large language model (LLM, largeLanguageModel) reasoning, text-to-Speech (TTS) and other modules. The voice conversation system is divided into a full duplex mode and a half duplex mode according to the difference of the interactive modes. The full duplex mode allows users to interrupt system replies at any time, interaction experience is natural, but false triggering is easy to happen under a noisy environment, and system logic is complex and resource consumption is high. The half duplex mode employs a strict "listen-talk" alternating mechanism, and the VAD module needs to detect continuous silence (typically 1 to 2 seconds) before deciding that the user is speaking is over, and then sequentially starting ASR recognition and LLM reasoning. In some prior art, half-duplex system optimization schemes mainly include dynamic adjustment of VAD timeout times and end-to-end model based statement end point prediction. Dynamic VAD schemes shorten the VAD timeout by detecting a drop in speech energy or a specific stop word, but the reducible time is very limited, typically less than 300 milliseconds, to avoid truncating the user's normal sentences. The end-to-end prediction scheme tries to directly predict the statement end point, but requires a large amount of annotation data training, has high deployment cost, and is unstable in prediction accuracy in a changeable environment, and is easy to cause system answering or cut off user expression. The process flow of the half duplex system is serial in nature, the VAD module must wait for silence timeout to confirm the end of user speaking, and ASR recognition and LLM reasoning can only be performed sequentially after VAD confirmation. Taking the typical parameters as an example, VAD waits for about 1.4 seconds, ASR processing takes about 0.2 seconds, LLM reasoning takes about 0.7 seconds, and the user needs to wait about 2.3 seconds from speaking to hearing the system reply. In the serial flow, the system is in idle state during the waiting period of the VAD, while LLM reasoning is a main component of response delay, and the serial connection of the two results in overlong waiting time perceived by a user and influences interaction fluency. Disclosure of Invention The application provides a response acceleration method and response acceleration equipment for a half-duplex voice conversation system, which are used for improving the speed of the voice conversation system. The application provides a response acceleration method of a half-duplex voice dialogue system, which comprises the steps of continuously obtaining a plurality of text frames output by automatic voice recognition, calculating similarity between the text frames, when the similarity is larger than a similarity stability threshold value in a first threshold value time, recording the text frames at a trigger moment as stable texts, generating a pre-request carrying a unique request identifier, carrying out language model reasoning on the text frames in the pre-request, carrying out silence detection on a user voice signal while carrying out language model reasoning on the text frames in the pre-request, recording a silence duration time, comparing the text frames newly obtained with the stable texts before the silence duration time reaches a second threshold value, calculating the similarity, if the similarity is lower than a preset threshold value before the silence duration time reaches the second threshold value, modifying the pre-request into a cancel state, resetting the silence duration time, updating the text frames newly obtained with the stable texts to generate a new pre-request identifier, carrying out language model reasoning on the pre-request, and finally outputting a final result of the speech model reasoning when the silence duration time reaches the second threshold value, and the final result of the voice reasoning request is reached. In the embodiment, by continuously calculating the similarity between adjacent text frames output by the ASR, LLM reasoning is triggered after the recognition text is detected to be stable within the first threshold time, and the reasoning process which is originally only executed after VAD confirmation is executed in parallel until the VAD silence detection period, so that the serial chain of waiting-recognition-reasoning is broken. Meanwhile, by continuously monitoring the contrast similarity of the new text frame and the recorded stable text during the subsequent sil