CN-122001982-A - Outbound real-time interrupt processing method and device, computer equipment and storage medium

CN122001982ACN 122001982 ACN122001982 ACN 122001982ACN-122001982-A

Abstract

The invention discloses an outbound real-time interrupt processing method, an outbound real-time interrupt processing device, computer equipment and a storage medium. The method comprises the steps of monitoring and converting an audio stream into a text stream in real time through a VAD and an ASR service when a preset speech operation is played, inputting the text stream into a stream breaking model after processing, judging whether the text stream is supported to break, stopping playing the preset speech operation and switching to a waiting/identifying mode when the confidence coefficient of the stream breaking model exceeds a threshold value if the text stream is supported to break, monitoring the audio stream but not interrupting playing until the VAD confirms that the speech of a user is finished if the speech of the user is not supported, acquiring the whole session content of the round, inputting the whole session content into a hanging-up model and a high risk model for posterior judgment, carrying out corresponding processing, and temporarily storing the whole session content of the round. By implementing the method of the invention, high-precision intention recognition can be ensured, and simultaneously, low-delay response speed can be provided, so that unnecessary false triggering conditions can be effectively reduced.

Inventors

XU XINGBIAO
GAO PENG
YUAN LAN
QIAN LEI
Che Caide

Assignees

杭州摸象大数据科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (10)

1. The outbound real-time interrupt processing method is characterized by comprising the following steps of: When a preset speech is played, monitoring and converting an audio stream into a text stream in real time through VAD and ASR services; Inputting the text stream after processing into a stream breaking model for node configuration analysis to obtain an analysis result; Judging whether the analysis result is support interruption or not; if the analysis result is support interruption, stopping playing a preset speech and switching to a waiting/identifying mode when the confidence coefficient of the stream interruption model exceeds a threshold value, until the VAD confirms that the speech of the user is ended; if the analysis result is not support interruption, monitoring the audio stream but not interrupting playing until the VAD confirms that the speaking of the user is finished; Acquiring the whole dialogue content of the round; Inputting the whole dialogue content of the round into the whole hang-up model and the high risk model to perform posterior judgment so as to obtain posterior judgment results; And carrying out corresponding processing according to the posterior judgment result, and temporarily storing the whole dialogue content of the round.
2. The method for real-time interrupt processing of outbound call according to claim 1, wherein the real-time monitoring and converting the audio stream into the text stream by the VAD and ASR service when playing the preset speech, comprises: When a preset speaking operation is played, the VAD service is utilized to monitor the audio stream in real time so as to determine the starting and ending time points of speaking of a user; The audio signal is converted into a text stream using an ASR service, and when a new streaming result is received, a newly obtained text segment is added to the text stream to update the text stream.
3. The method for real-time interrupt processing of outbound call according to claim 1, wherein the stream interrupt model is obtained by training a text classification model as a sample after processing a text stream with a tag of whether interrupt is performed, and N-gram features are introduced in the process of processing the text stream to capture local word order information.
4. The method for real-time interrupt processing of outbound call according to claim 1, wherein the whole-segment hang-up model is obtained by training a text classification model as a sample after processing a text stream with a tag of whether to hang-up, and N-gram features are introduced to capture local word sequence information during processing the text stream.
5. The method according to claim 1, wherein the high risk model is obtained by training a text classification model as a sample after processing a text stream with a tag specifying key contents in a user language, and introducing N-gram features to capture local word order information during processing the text stream.
6. The method for real-time interrupt processing of outbound call according to claim 3, wherein the step of inputting the processed text stream into a stream interrupt model for node configuration analysis to obtain an analysis result comprises: Constructing a word list based on an ASR transcribed text, dividing the text stream into vocabulary and N-gram features, and converting the vocabulary and N-gram features into integer IDs through index mapping to form an input sequence; and inputting the input sequence into a stream break model for node configuration analysis to obtain an analysis result.
7. The method for real-time interrupt processing of outbound call according to claim 6, wherein inputting the input sequence into a stream interrupt model for node configuration analysis to obtain an analysis result comprises: in an input layer of the stream break model, converting the input sequence into a corresponding vector representation by using a predefined embedding matrix; All vectors provided by an input layer are processed through global average operation by a hidden layer of the stream break model, and are averaged after addition to generate sentence vectors with fixed length, wherein the sentence vectors comprise semantic information of words and local sequence characteristics; And receiving sentence vectors generated by the hidden layer by an output layer of the stream break model, generating probability distribution of each category after linear transformation and Softmax function processing, and determining an analysis result as intention supporting break when the prediction probability of a specified category exceeds a set threshold value.
8. The real-time interrupt processing device of outbound call, its characterized in that includes: The conversion unit is used for monitoring and converting the audio stream into a text stream in real time through the VAD and ASR service when the preset speech is played; the node analysis unit is used for inputting the text stream after processing into the stream breaking model for node configuration analysis so as to obtain an analysis result; a judging unit configured to judge whether the analysis result is a support break; the node breaking processing unit is used for stopping playing a preset speaking operation and switching to a waiting/identifying mode until the VAD confirms that the speaking of the user is finished when the confidence coefficient of the stream breaking model exceeds a threshold value if the analysis result is support breaking; a continuous monitoring unit, configured to monitor the audio stream without interrupting the playing if the analysis result is not a support interruption, until the VAD confirms that the user's speaking has ended; The whole-section content acquisition unit is used for acquiring the whole-section dialogue content of the round; The posterior judgment unit is used for inputting the whole dialogue content of the round into the whole hang-up model and the high risk model to carry out posterior judgment so as to obtain posterior judgment results; and the temporary storage unit is used for carrying out corresponding processing according to the posterior judgment result and temporarily storing the whole dialogue content of the round.
9. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-7.
10. A storage medium storing a computer program which, when executed by a processor, implements the method of any one of claims 1 to 7.

Description

Outbound real-time interrupt processing method and device, computer equipment and storage medium Technical Field The invention relates to a man-machine interaction method, in particular to an outbound real-time interrupt processing method, an outbound real-time interrupt processing device, computer equipment and a storage medium. Background In the external caller robot system, it is conventional practice to recognize and analyze the voice input of the user after the robot plays a preset speaking or waits for a period of silence. This mechanism usually involves buffering the user's voice first, waiting for the robot to complete the playback of the current speech segment, and then sending the audio data together to the system's intent recognition module for processing. However, such workflow exposes several technical challenges in practical applications, in that when a user attempts to insert a speech or interrupt the robot's speech, if the system cannot detect such behavior on-the-fly and make corresponding adjustments, the scheduled speech content continues to be played, which not only affects the user experience, but may also reduce the accuracy of the subsequent intent recognition process by missing critical information. For some special cases, such as voice messages, voice assistant interaction or encountering extension prompting voice, the system may not be able to accurately identify, so that the call resources are not occupied, and even the system is triggered by mistake to enter an irrelevant service logic branch. This situation can significantly reduce outbound efficiency and can lead to business process logic confusion. Therefore, it is necessary to design a new method to monitor the voice input of the user in real time during the playing process of the outbound robot, accurately distinguish the interrupt operation and the unexpected hang-up behavior of the user, and execute the corresponding policy actions according to different dialogue nodes. The method ensures high-precision intention recognition, simultaneously can provide low-delay response speed, effectively reduces unnecessary false triggering conditions, and further improves interaction experience and operation efficiency of the whole system. Disclosure of Invention The invention aims to overcome the defects of the prior art and provides an outbound real-time interrupt processing method, an outbound real-time interrupt processing device, computer equipment and a storage medium. In order to achieve the purpose, the invention adopts the following technical scheme that the outbound real-time interrupt processing method comprises the following steps: When a preset speech is played, monitoring and converting an audio stream into a text stream in real time through VAD and ASR services; Inputting the text stream after processing into a stream breaking model for node configuration analysis to obtain an analysis result; Judging whether the analysis result is support interruption or not; if the analysis result is support interruption, stopping playing a preset speech and switching to a waiting/identifying mode when the confidence coefficient of the stream interruption model exceeds a threshold value, until the VAD confirms that the speech of the user is ended; if the analysis result is not support interruption, monitoring the audio stream but not interrupting playing until the VAD confirms that the speaking of the user is finished; Acquiring the whole dialogue content of the round; Inputting the whole dialogue content of the round into the whole hang-up model and the high risk model to perform posterior judgment so as to obtain posterior judgment results; And carrying out corresponding processing according to the posterior judgment result, and temporarily storing the whole dialogue content of the round. The method has the further technical scheme that when the preset speech is played, the audio stream is monitored and converted into the text stream in real time through the VAD and ASR service, and the method comprises the following steps: When a preset speaking operation is played, the VAD service is utilized to monitor the audio stream in real time so as to determine the starting and ending time points of speaking of a user; The audio signal is converted into a text stream using an ASR service, and when a new streaming result is received, a newly obtained text segment is added to the text stream to update the text stream. The method comprises the following steps that a text flow with a label for breaking is adopted to be processed and then used as a sample to train a text classification model, and N-gram features are introduced to capture local word sequence information in the process of processing the text flow. The whole hang-up model is obtained by training a text classification model by adopting a text stream with a label whether to hang-up as a sample after processing, and N-gram features are introduced in the process of processing the text st