Search

CN-122027609-A - Low-delay voice communication method based on end-to-end large model

CN122027609ACN 122027609 ACN122027609 ACN 122027609ACN-122027609-A

Abstract

The application relates to the technical field of voice communication and discloses a low-delay voice communication method based on an end-to-end large model, which comprises the steps that a sending end collects voice waveforms, constructs current input characteristics and an input stream type voice communication model, and obtains a temporary token and a stability result; and determining a current confirmation anchor point by combining the historical reasoning record, generating correction information when the confirmation anchor point moves forward, and transmitting a quick packet and a confirmation packet. And the receiving end generates the audio fragment to be broadcast according to the quick packet, records the mapping relation, confirms, replaces the temporary token and reconstructs the affected fragment according to the confirmation packet, and outputs the audio at the planned broadcast time. The method gives consideration to the real-time performance and the output stability of voice communication.

Inventors

  • ZHAO JUN
  • ZHAO YING

Assignees

  • 杭州星语人工智能有限公司

Dates

Publication Date
20260512
Application Date
20260415

Claims (10)

  1. 1. The low-delay voice communication method based on the end-to-end large model is characterized by comprising the following steps of: Collecting a voice waveform of a transmitting end, constructing a current input characteristic and inputting a stream type voice communication model to obtain a temporary token and a stability result; Determining a current confirmed anchor point according to a plurality of reasoning results and the stability result in the history reasoning record, and generating correction information according to the difference between the temporary token obtained by the current reasoning and the sent temporary token in an unconfirmed interval when the current confirmed anchor point moves forward; Transmitting a quick packet and transmitting a confirmation packet when the current confirmation anchor point moves forward, wherein the quick packet carries a newly added temporary token, and the confirmation packet carries the current confirmation anchor point and correction information; after receiving the fast packet, the receiving end writes the newly added temporary token into a temporary token buffer area, generates a to-be-broadcast audio fragment according to the newly added temporary token which is not synthesized, writes the to-be-broadcast audio fragment into a to-be-broadcast audio queue, and records the mapping relation between a token interval corresponding to the newly added temporary token and an audio interval; after receiving the confirmation packet, the receiving end writes the temporary token cut to the current confirmation anchor point into the confirmation token buffer area, replaces the corresponding temporary token in the temporary token buffer area according to the correction information, and rebuilds the affected audio fragment to be broadcast according to the mapping relation, and when the audio fragment to be broadcast in the audio queue reaches the planned playing time, the receiving end outputs the corresponding audio fragment to be broadcast.
  2. 2. The method for low latency voice communication based on an end-to-end large model as defined in claim 1, The method comprises the steps of carrying out frame division and feature extraction on the voice waveform to obtain feature frames arranged in time sequence, extracting the feature frames corresponding to the current processing block from a feature buffer area, and splicing the feature frames corresponding to the right context to obtain the current input feature.
  3. 3. The method for low latency voice communication based on an end-to-end large model as defined in claim 2, The streaming voice communication model sequentially comprises a time downsampling module, a streaming coding module, a token projection module and a stability estimation module, wherein the time downsampling module receives the current input characteristics and outputs a compressed characteristic sequence, the streaming coding module receives the compressed characteristic sequence and outputs a high-level coding representation, the token projection module outputs the temporary token according to the high-level coding representation, and the stability estimation module outputs the stability result according to the high-level coding representation and the temporary token.
  4. 4. The method for low latency voice communication based on an end-to-end large model as defined in claim 1, The method comprises the steps of extracting multiple reasoning results corresponding to each temporary token from the historical reasoning records to obtain a consistency result of each temporary token, obtaining a comprehensive stability result of each temporary token according to the consistency result and the corresponding stability result of each temporary token, determining prefix intervals continuously meeting the confirmation conditions according to token sequences from the temporary token after the last confirmation of the anchor point, and determining the tail end positions of the prefix intervals as the current confirmation anchor point.
  5. 5. The method for low latency voice communication based on an end-to-end large model as recited in claim 4, wherein, The consistency result is the proportion of the number of reasoning times which are the same as the current temporary token in the most recent multiple reasoning results to the number of the most recent multiple reasoning times, the comprehensive stable result is the result obtained by combining the consistency result and the stability result according to preset weight, and the confirmation condition comprises that the comprehensive stable result is not lower than a confirmation threshold value.
  6. 6. The method for low latency voice communication based on an end-to-end large model as defined in claim 1, The quick packet comprises a session identifier, a packet type, a packet sequence number, a time stamp, a token start index, a token end index and the newly added temporary token, the confirmation packet comprises the session identifier, the packet type, the packet sequence number, the time stamp, the current confirmation anchor point and the correction information, the correction information is obtained by comparing the difference between the temporary token obtained by reasoning and the sent temporary token in the position of each token in an unacknowledged interval, and the difference position and the corresponding replacement token are recorded.
  7. 7. The method for low-latency voice communication based on an end-to-end large model according to claim 1, wherein when the to-be-broadcasted audio clip is written into the to-be-broadcasted audio queue, a planned playing time is set for the to-be-broadcasted audio clip, the planned playing time is determined by an enqueuing time of the to-be-broadcasted audio clip and a to-be-broadcasted holding time, and the to-be-broadcasted holding time is determined according to a network jitter upper bound, a maximum arrival delay of a confirmation packet and a to-be-broadcasted audio clip reconstruction reservation time.
  8. 8. The method of claim 7, wherein the mapping relation comprises a token section, an audio section and a state identifier, wherein the receiving end searches for a to-be-broadcast audio segment which intersects the replaced temporary token and has not reached a scheduled playing time according to the mapping relation after replacing the corresponding temporary token in the temporary token buffer according to the correction information, deletes the to-be-broadcast audio segment, generates a replaced to-be-broadcast audio segment according to the replaced temporary token, and writes the replaced to-be-broadcast audio segment back to the to-be-broadcast audio queue.
  9. 9. The method of claim 1, wherein the generating the audio segment to be broadcast based on the newly added temporary token that has not been synthesized includes inputting the newly added temporary token into a token embedding layer to obtain a sequence of continuous vectors, inputting the sequence of continuous vectors into a causal spectrum decoder to obtain a sequence of spectrum frames, and inputting the sequence of spectrum frames into a vocoder to obtain the audio segment to be broadcast.
  10. 10. A low latency voice communication system based on an end-to-end large model, comprising: The device comprises a transmitting end and a receiving end; the sending end is used for collecting the voice waveform of the sending end, constructing the current input characteristics and inputting the stream type voice communication model to obtain a temporary token and a stability result; The sending end is also used for determining a current confirmed anchor point according to a plurality of reasoning results and the stability results in the historical reasoning record, and generating correction information according to the difference between the temporary token obtained by the current reasoning and the sent temporary token in an unconfirmed interval when the current confirmed anchor point moves forward; the sending end is also used for sending a quick packet and sending a confirmation packet when the current confirmation anchor point moves forward, the quick packet carries a new temporary token, and the confirmation packet carries the current confirmation anchor point and correction information; The receiving end is used for writing a new temporary token into the temporary token buffer area after receiving the quick packet, generating a to-be-broadcast audio fragment according to the new temporary token which is not synthesized, writing the to-be-broadcast audio fragment into the to-be-broadcast audio queue, and recording the mapping relation between a token interval corresponding to the new temporary token and an audio interval; The receiving end is also used for writing the temporary token which is cut off to the current confirmation anchor point into a confirmation token buffer area after receiving the confirmation packet, replacing the corresponding temporary token in the temporary token buffer area according to the correction information, and reconstructing the affected audio fragment to be broadcast according to the mapping relation; the receiving end is also used for outputting the corresponding audio fragment to be broadcast when the audio fragment to be broadcast in the audio queue to be broadcast reaches the planned playing time.

Description

Low-delay voice communication method based on end-to-end large model Technical Field The application relates to the technical field of voice communication, in particular to a low-delay voice communication method based on an end-to-end large model. Background With the development of applications such as two-way voice call, multi-person conference, real-time voice translation and voice assistant call links, the voice communication system needs to ensure the continuity and intelligibility of voice transmission, and reduce the end-to-end delay as much as possible so as to meet the real-time interaction requirement. In the existing voice communication technology, one type of scheme focuses on the traditional voice coding, transmission and playing processes, and although the implementation mode is relatively mature, when the streaming voice processing is required to be carried out by combining an end-to-end large model, it is often difficult to achieve low-time-delay output and result stability. In another type of scheme, although the streaming reasoning can be performed on the input voice and the intermediate result can be generated earlier, under the condition that the follow-up context is continuously supplemented by the model, the reasoning result of the preamble position may change, so that if the receiving end directly performs voice synthesis and playing according to the intermediate result, the problems of inconsistent playing content, local voice distortion, semantic jump or overlarge repeated correction range easily occur. In contrast, if the receiving end waits for more contexts and even after the whole section of voice is stable, the receiving end performs unified synthesis and playing, the waiting time of playing is obviously increased, and the instantaneity of voice communication is affected. In the actual network transmission process, the conditions of packet disorder, network jitter, acknowledgement information delay arrival and the like exist, and the implementation difficulty of receiving end buffer management, fragment replacement and play control is further increased. Therefore, how to effectively manage the reasoning result which is not stable in the streaming voice communication process, and to accurately correct the affected voice segment while ensuring lower communication time delay, and to reduce unnecessary overall reconstruction and play interruption has become a technical problem to be solved in the prior art. Disclosure of Invention The application aims to provide a low-delay voice communication method and system based on an end-to-end large model, so as to solve the problems in the background technology. According to one aspect of the present application, there is provided a low-latency voice communication method based on an end-to-end large model, comprising the steps of: Collecting a voice waveform of a transmitting end, constructing a current input characteristic and inputting a stream type voice communication model to obtain a temporary token and a stability result; Determining a current confirmed anchor point according to a plurality of reasoning results and the stability result in the history reasoning record, and generating correction information according to the difference between the temporary token obtained by the current reasoning and the sent temporary token in an unconfirmed interval when the current confirmed anchor point moves forward; Transmitting a quick packet and transmitting a confirmation packet when the current confirmation anchor point moves forward, wherein the quick packet carries a newly added temporary token, and the confirmation packet carries the current confirmation anchor point and correction information; after receiving the fast packet, the receiving end writes the newly added temporary token into a temporary token buffer area, generates a to-be-broadcast audio fragment according to the newly added temporary token which is not synthesized, writes the to-be-broadcast audio fragment into a to-be-broadcast audio queue, and records the mapping relation between a token interval corresponding to the newly added temporary token and an audio interval; after receiving the confirmation packet, the receiving end writes the temporary token cut to the current confirmation anchor point into the confirmation token buffer area, replaces the corresponding temporary token in the temporary token buffer area according to the correction information, and rebuilds the affected audio fragment to be broadcast according to the mapping relation, and when the audio fragment to be broadcast in the audio queue reaches the planned playing time, the receiving end outputs the corresponding audio fragment to be broadcast. Preferably, the construction of the current input feature comprises the steps of framing and feature extraction of the voice waveform to obtain feature frames arranged in time sequence, extracting feature frames corresponding to a current processing block from a featu