Search

CN-121985073-A - Method for realizing asynchronous double-end communication in real-time voice communication system

CN121985073ACN 121985073 ACN121985073 ACN 121985073ACN-121985073-A

Abstract

The invention discloses a method for realizing asynchronous double-end communication in a real-time voice communication system, which comprises a session initialization stage, a user terminal user voice data transmission process, a user voice data receiving, processing and outputting process and a session interrupt mechanism. The method for realizing asynchronous double-end communication in the real-time voice call system provided by the invention introduces a thread concurrency and queue asynchronization mechanism to construct a voice receiving and transmitting flow, supports multi-user voice session parallel processing, has an interrupt control function in the interaction process, and improves the intelligent degree of the system and the fluency of man-machine conversation.

Inventors

  • LI XINGPING
  • XU KAIXIANG
  • LAI XIAOHANG

Assignees

  • 四三九九网络股份有限公司

Dates

Publication Date
20260505
Application Date
20260206

Claims (4)

  1. 1. A method for implementing asynchronous double-ended communication in a real-time voice call system, comprising the steps of: Step S1, session initialization phase: when a server receives a session establishment request initiated by a user terminal, the server generates a unique session identifier SessionID and returns the unique session identifier SessionID to the user terminal; The server creates a user voice queue, an answer voice queue, an interrupt manager, an answer producer and an answer consumer in the session object STS, wherein the answer producer and the answer consumer respectively run in independent asynchronous threads; the server maintains a session state, wherein the session state comprises three states, namely a session running state, a session interrupt state and a session interrupt completion state; Step S2, the user terminal transmits the user voice data flow: After session initialization, when a user terminal firstly collects user voice data, streaming the user voice data to the server terminal according to blocks; Step S3, the server receives, processes and outputs the user voice data: Step S3.1, when the server receives the user voice data which are streamed by the user terminal according to the blocks, the received user voice data are streamed into the user voice queue according to the real-time blocks; Step S3.3, the answer producer continuously monitors the user voice queue and the session state, and in the session running state, when the user voice data written in the user voice queue is monitored, the answer producer generates answer voices based on all user voice data in the user voice queue and writes the answer voices in the answer voice queue according to blocks; Step S3.4, the answer consumer monitors the answer voice queue continuously, reads answer voice from the answer voice queue in real time when the answer voice queue is monitored to be not empty, and streams the answer voice to the user side; step S4, an interrupt mechanism: in the session process, when the server receives the session interrupt information uploaded by the user terminal, a session interrupt mechanism is triggered, including: step S4.1, the server transmits session interrupt information to the interrupt manager; step S4.2, the interrupt manager updates the session state into a session interrupt state, simultaneously empties the user voice queue and continuously monitors the session state; Step S4.3, the answer producer continuously monitors the session state in the answer production process, when the session interrupt state is monitored, the current answer production process is terminated, the session state is updated to be the session interrupt completion state, and the step S3.3 is returned; and S4.4, when the session state is continuously monitored, the interrupt manager empties the answer voice queue when the session interrupt completion state is monitored, and after the completion of the empting, the interrupt manager updates the session state into a session running state.
  2. 2. The method according to claim 1, wherein in step S3.3, the answer producer generates answer voices based on all user voice data in the user voice queue, comprising: Step S3.3.1, the answer producer continuously monitors the user voice queue, and reads user voice data to be processed from the user voice queue in sequence; step S3.3.2, the answer producer carries out voice recognition on the read user voice data through a voice recognition module ASR, and recognizes a user text; Step S3.3.3, the answer producer judges whether the user voice queue is empty, if not, the step S3.3.1 is returned, if yes, the step S3.3.4 is executed; step S3.3.4, the answer producer solves all the identified user texts through the LLM model to obtain answer texts; In step S3.3.5, the answer producer converts the answer text into answer speech through a speech synthesis module TTS.
  3. 3. The method for implementing asynchronous double-ended communication in a real-time voice communication system according to claim 2, wherein in step S4.3, the answer producer continuously monitors the session state during the answer production process, and when the session interrupt state is monitored, the current answer production process is terminated, and the session state is updated to be the session interrupt completion state, comprising: In the steps S3.3.2 to S3.3.3, after the answer producer recognizes the user text through the speech recognition module ASR, judging whether the session state is a session interrupt state, if so, executing an answer production process termination operation; And in the steps S3.3.4 to S3.3.5, after the answer producer obtains the answer text through the LLM model, judging whether the session state is a session interrupt state, and if so, executing an answer production process termination operation.
  4. 4. The method for implementing asynchronous double-ended communication in a real-time voice communication system according to claim 1, wherein the manner in which the server receives the session interrupt information uploaded by the user terminal is: In the process of session progress, if a user terminal interrupt button is clicked, triggering the user terminal to report a session interrupt signal to the server terminal; Or in the process of session progress, if the user terminal acquires new user voice data, the user terminal recognizes that the process of the previous question-answering session is interrupted, the user terminal firstly reports a session interrupt signal to the server terminal, and after set delay, the user terminal streams the new user voice data to the server terminal according to blocks.

Description

Method for realizing asynchronous double-end communication in real-time voice communication system Technical Field The invention relates to the technical field of communication, in particular to a method for realizing asynchronous double-end communication in a real-time voice communication system. Background In a conventional voice call system, the flow for implementing double-ended communication is generally: A. after the user collects the voice through the recording equipment of the user side, the whole voice file is sent to the server side. B. The server side recognizes the text of the received voice file through a voice recognition module ASR as the user problem text. C. The server analyzes the user question text through an answer module (usually based on a dialogue system or a language big model) to generate an answer text. D. And finally, the server converts the answer text into answer voice through a voice synthesis module (TTS) and pushes the answer voice back to the user side for playing. The process is typically accomplished in a synchronous, serial fashion. That is, the server must completely generate and transmit the answer speech of the current round to the client before starting the processing of the new round of user speech. The entire process flow is referred to in fig. 2. However, this conventional architecture has the following limitations: 1. and when the voice continuously output by the user terminal is longer, the server terminal needs to wait for the completion of the voice continuously output by the user terminal, and can start to process the received complete voice file, so that the response delay is caused, and the interaction efficiency is influenced. 2. And the system lacks an interruption mechanism when the user has understood that the answer wants to interrupt the voice answer, and does not accord with the daily communication habit. 3. And (3) wasting calculation resources, namely, in the response process, the system still can continuously complete the generation and transmission of the round-robin reply voice even if the user speaks again. The method not only causes the human-computer interaction to be split and seriously influences the fluency of the conversation, but also causes the waste of computing resources. Disclosure of Invention Aiming at the defects existing in the prior art, the invention provides a method for realizing asynchronous double-end communication in a real-time voice communication system, which can effectively solve the problems. The technical scheme adopted by the invention is as follows: The invention provides a method for realizing asynchronous double-end communication in a real-time voice communication system, which comprises the following steps: Step S1, session initialization phase: when a server receives a session establishment request initiated by a user terminal, the server generates a unique session identifier SessionID and returns the unique session identifier SessionID to the user terminal; The server creates a user voice queue, an answer voice queue, an interrupt manager, an answer producer and an answer consumer in the session object STS, wherein the answer producer and the answer consumer respectively run in independent asynchronous threads; the server maintains a session state, wherein the session state comprises three states, namely a session running state, a session interrupt state and a session interrupt completion state; Step S2, the user terminal transmits the user voice data flow: After session initialization, when a user terminal firstly collects user voice data, streaming the user voice data to the server terminal according to blocks; Step S3, the server receives, processes and outputs the user voice data: Step S3.1, when the server receives the user voice data which are streamed by the user terminal according to the blocks, the received user voice data are streamed into the user voice queue according to the real-time blocks; Step S3.3, the answer producer continuously monitors the user voice queue and the session state, and in the session running state, when the user voice data written in the user voice queue is monitored, the answer producer generates answer voices based on all user voice data in the user voice queue and writes the answer voices in the answer voice queue according to blocks; Step S3.4, the answer consumer monitors the answer voice queue continuously, reads answer voice from the answer voice queue in real time when the answer voice queue is monitored to be not empty, and streams the answer voice to the user side; step S4, an interrupt mechanism: in the session process, when the server receives the session interrupt information uploaded by the user terminal, a session interrupt mechanism is triggered, including: step S4.1, the server transmits session interrupt information to the interrupt manager; step S4.2, the interrupt manager updates the session state into a session interrupt state, simultaneously empties the u