KR-102963165-B1 - Audio synthesis for synchronous communication

KR102963165B1KR 102963165 B1KR102963165 B1KR 102963165B1KR-102963165-B1

Abstract

A computer-implemented method comprises the step of receiving a first audio stream of a performance associated with a first client device. The method further comprises, during a time window of the performance, wherein the time window is shorter than the total time of the performance: generating a synthesized first audio stream that predicts the future of the performance based on audio features of the first audio stream; and mixing the synthesized first audio stream with a second audio stream associated with a second client device to form a combined audio stream that synchronizes the synthesized first audio stream and the second audio stream, wherein the time window is advanced and the generating step and the mixing step are repeated until the performance is completed.

Inventors

난드와나, 마헤쉬 쿠마르
바트, 키란
맥과이어, 모건

Assignees

로브록스 코포레이션

Dates

Publication Date: 20260511
Application Date: 20231002
Priority Date: 20221004

Claims (20)

As a computer implementation method, A step of receiving a first audio stream of a performance associated with a first client device; and During the time window of the above performance, where the time window is shorter than the total time of the above performance: A step of generating a synthesized first audio stream that predicts the future of the performance based on the audio features of the first audio stream; and The method includes the step of mixing the synthesized first audio stream and the second audio stream to form a combined audio stream that synchronizes the synthesized first audio stream and the second audio stream associated with the second client device; A computer implementation method in which the above time window is advanced, and the above generating step and the above mixing step are repeated until the above performance is completed.
In paragraph 1, In response to the step of receiving the first audio stream, the step of determining a performance identifier for the performance associated with the first audio stream; and A computer implementation method further comprising the step of receiving reference audio based on the above performance identifier.
In paragraph 2, The step of generating the synthesized first audio stream includes the step of determining a time offset between the first audio stream and the reference audio, and A computer-implemented method in which the time offset occurs when the first audio stream has a different starting point from the reference audio, and the step of generating the synthesized first audio stream is further based on the time offset.
In paragraph 2, The step of generating the synthesized first audio stream includes the step of determining the speed of the first audio stream compared with the speed of the reference audio, and A computer implementation method in which the step of generating the synthesized first audio stream is based more on the speed of the first audio stream compared with the speed of the reference audio.
In paragraph 1, A computer-implemented method in which the audio feature of the first audio stream is selected from a group of pitch, velocity, phase, or a combination thereof.
In paragraph 1, A computer-implemented method wherein the audio feature of the first audio stream comprises one or more speaker identifiers detected in the first audio stream.
In paragraph 1, A step of determining whether the time difference between the first audio stream and the second audio stream exceeds a threshold time difference; and A computer-implemented method further comprising the step of generating graphic data for displaying a user interface including a user guide for the above performance and a movement indicator that prompts a performer associated with the second client device to perform in a manner that reduces the time difference between the first audio stream and the second audio stream.
In paragraph 1, A computer-implemented method further comprising the step of modifying the combined audio stream to match the sound of the environment in which the second client device is located.
In paragraph 1, The step of generating the synthesized first audio stream above is: A step of identifying whether a portion of the above performance was skipped in the first audio stream; and A computer-implemented method comprising the step of synthesizing the first audio stream to modify the skipped portion of the performance.
In paragraph 1, A computer-implemented method further comprising the step of synchronizing the combined audio stream to match the movements of a performer displayed graphically.
As a device, processor; and The above processor includes a memory having stored instructions coupled to the above processor, wherein the instructions cause the processor to perform an operation when executed by the processor, and the operation is: The operation of receiving a first audio stream of a performance associated with a first client device; and During the time window of the above performance, where the time window is shorter than the total time of the above performance: The operation of generating a synthesized first audio stream that predicts the future of the performance based on the audio features of the first audio stream; and The method includes mixing the synthesized first audio stream and the second audio stream to form a combined audio stream that synchronizes the synthesized first audio stream and the second audio stream associated with the second client device, and A device that advances the above time window, wherein the generating operation and the mixing operation are repeated until the performance is completed.
In Paragraph 11, An operation to determine a performance identifier for the performance associated with the first audio stream in response to the operation of receiving the first audio stream; and A device further comprising the operation of receiving reference audio based on the above performance identifier.
In Paragraph 12, The operation of generating the synthesized first audio stream includes the operation of determining a time offset between the first audio stream and the reference audio, and A device in which the above time offset occurs when the first audio stream has a different starting point from the reference audio, and the operation of generating the synthesized first audio stream is more based on the time offset.
In Paragraph 12, The operation of generating the synthesized first audio stream includes the operation of determining the speed of the first audio stream compared with the speed of the reference audio, and A device in which the operation of generating the synthesized first audio stream is based more on the speed of the first audio stream compared with the speed of the reference audio.
In Paragraph 11, A device in which the audio feature of the first audio stream is selected from a group of pitch, velocity, phase, or a combination thereof.
A non-transient computer-readable storage medium having stored instructions, wherein when the instructions are executed by one or more computers, the one or more computers cause said instructions to perform an operation, and said operation is: The operation of receiving a first audio stream of a performance associated with a first client device; and During the time window of the above performance, where the time window is shorter than the total time of the above performance: The operation of generating a synthesized first audio stream that predicts the future of the performance based on the audio features of the first audio stream; and The method includes mixing the synthesized first audio stream and the second audio stream to form a combined audio stream that synchronizes the synthesized first audio stream and the second audio stream associated with the second client device; A non-transient computer-readable storage medium in which the above-mentioned time window is advanced, and the above-mentioned generating operation and the above-mentioned mixing operation are repeated until the above-mentioned performance is completed.
In Paragraph 16, An operation to determine a performance identifier for the performance associated with the first audio stream in response to the operation of receiving the first audio stream; and A non-transient computer-readable storage medium further comprising the operation of receiving reference audio based on the above performance identifier.
In Paragraph 17, The operation of generating the synthesized first audio stream includes the operation of determining a time offset between the first audio stream and the reference audio, and A non-transient computer-readable storage medium in which the above time offset occurs when the first audio stream has a different starting point from the reference audio, and the operation of generating the synthesized first audio stream is more based on the time offset.
In Paragraph 17, The operation of generating the synthesized first audio stream includes the operation of determining the speed of the first audio stream compared with the speed of the reference audio, and A non-transient computer-readable storage medium in which the operation of generating the synthesized first audio stream is based more on the speed of the first audio stream compared with the speed of the reference audio.
In Paragraph 16, A non-transient computer-readable storage medium in which the audio features of the first audio stream are selected from the group of pitch, velocity, phase, or combinations thereof.

Description

Audio synthesis for synchronous communication This application is an international application filed on October 4, 2022, and claims the benefit of priority under 35 U.S.C. §119(e) to U.S. Patent Application No. 17/959,736, titled Synthesizing Audio for Synchronous Communication, the entire contents of said application incorporated herein by reference. It is impossible for multiple people connected by a computer network, despite being physically located in different places, to perform music, chants, or conversations synchronously. This is because the performance of person P0 observed by person P1 is always in the past due to the transmission delay of the computer network. If each of the n-1 performers P1, P2, P3, … P(n-1) is delayed by exactly the appropriate amount of time relative to performer P0, P0 will observe all others synchronizing with each other, but he/she cannot synchronize with anyone else. Delay can be caused by one or more of at least three types of latency: network latency, input latency, and processing latency. Network latency occurs when there are defects in network transmission time due to delays in processing time at physical equipment or nodes used in the network. Input latency occurs when a user receives a delayed response or uses a client device with limited functionality. Processing latency can occur when a delay is intentionally introduced to perform modulation analysis. The background description provided herein is intended for the purpose of presenting the context of the present disclosure. Modes of description that may not otherwise qualify as prior art at the time of filing, as well as the work of the inventor currently named within the scope described in this background section, are not recognized as prior art to the present disclosure, expressly or impliedly. The embodiments generally relate to a system and method for synthesizing audio for synchronous communication. According to one aspect, a computer-implemented method comprises the step of receiving a first audio stream of a performance associated with a first client device. The method further comprises the step of generating a synthesized first audio stream that predicts the future of the performance based on audio features of the first audio stream during a time window of the performance, wherein the time window is shorter than the total time of the performance; and the step of mixing the synthesized first audio stream and a second audio stream associated with a second client device to form a combined audio stream that synchronizes the synthesized first audio stream and the second audio stream, wherein the time window is advanced and the steps of generating and mixing are repeated until the performance is completed. In some embodiments, the method further comprises, in response to the step of receiving a first audio stream, the step of determining a performance identifier for a performance associated with the first audio stream and the step of receiving a reference audio based on the performance identifier. In some embodiments, the step of generating a synthesized first audio stream includes the step of determining a time offset between the first audio stream and the reference audio, the time offset occurs when the first audio stream has a different starting point from the reference audio, and the step of generating a synthesized first audio stream is further based on the time offset. In some embodiments, the step of generating a synthesized first audio stream includes the step of determining the speed of the first audio stream compared to the speed of the reference audio, and the step of generating a synthesized first audio stream is further based on the speed of the first audio stream compared to the speed of the reference audio. In some embodiments, the audio features of the first audio stream are selected from a group of pitch, speed, phase, or a combination thereof. In some embodiments, the audio features of the first audio stream include one or more speaker identifiers detected in the first audio stream. In some embodiments, the method further comprises the steps of determining whether the time difference between a first audio stream and a second audio stream exceeds a threshold time difference, and generating graphic data for displaying a user interface that includes user guidance on the performance and a movement indicator that prompts a performer associated with a second client device to perform in a manner that reduces the time difference between the first audio stream and the second audio stream. In some embodiments, the method further comprises the step of modifying the combined audio stream to match the acoustics of the environment in which the second client device is located. In some embodiments, the step of generating a synthesized first audio stream includes the step of identifying that a portion of the performance has been skipped in the first audio stream and the step of synthesizing the first audio strea