CN-115668765-B - Automatic mixing of audio descriptions

CN115668765BCN 115668765 BCN115668765 BCN 115668765BCN-115668765-B

Abstract

A computer-implemented audio processing method includes receiving audio object data and audio description data, wherein the audio object data includes a first plurality of audio objects, calculating long-term loudness of the audio object data and long-term loudness of the audio description data, calculating a plurality of short-term loudness of the audio object data and a plurality of short-term loudness of the audio description data, reading a first plurality of mixing parameters corresponding to the audio object data, generating a second plurality of mixing parameters based on the first plurality of mixing parameters, the long-term loudness of the audio object data, the long-term loudness of the audio description data, the plurality of short-term loudness of the audio object data, and the plurality of short-term loudness of the audio description data, generating a gain adjustment visualization corresponding to the second plurality of mixing parameters, the audio object data, and the audio description data, and generating mixed audio object data by mixing the audio object data and the audio description data according to the second plurality of mixing parameters, wherein the mixed audio object data includes a second plurality of audio objects corresponding to the first plurality of audio objects mixed with the audio description data according to the second plurality of mixing parameters.

Inventors

D. World Wide
S. Panki

Assignees

杜比实验室特许公司

Dates

Publication Date: 20260512
Application Date: 20210412
Priority Date: 20200413

Claims (18)

1. A computer-implemented audio processing method, the method comprising: Receiving audio object data and audio description data, wherein the audio object data comprises a first plurality of audio objects; calculating a long-term loudness of the audio object data and a long-term loudness of the audio description data; Calculating a plurality of short-term loudness of the audio object data and a plurality of short-term loudness of the audio description data; Reading a first plurality of mixing parameters corresponding to the audio object data, wherein the first plurality of mixing parameters comprise a look-ahead parameter, a ramp parameter and a maximum delta parameter; Generating a second plurality of mixing parameters based on the first plurality of mixing parameters, the long-term loudness of the audio object data, the long-term loudness of the audio description data, the plurality of short-term loudness of the audio object data, and the plurality of short-term loudness of the audio description data; Generating a gain adjustment visualization corresponding to the second plurality of mixing parameters, the audio object data, and the audio description data, and Generating mixed audio object data by mixing the audio object data and the audio description data according to the second plurality of mixing parameters, wherein the mixed audio object data comprises a second plurality of audio objects, wherein the second plurality of audio objects corresponds to the first plurality of audio objects mixed with the audio description data according to the second plurality of mixing parameters.
2. The method of claim 1 wherein the long-term loudness of the audio object data is calculated from a plurality of samples of the audio object data, wherein the long-term loudness of the audio description data is calculated from a plurality of samples of the audio description data, Wherein each of the plurality of short-term loudness of the audio object data is calculated from a single sample of the audio object data, and wherein each of the plurality of short-term loudness of the audio description data is calculated from a single sample of the audio description data.
3. The method of any of claims 1-2, wherein the first plurality of mixing parameters is associated with one of a plurality of genres, wherein each of the plurality of genres is associated with a corresponding set of mixing parameters.
4. A method as recited in claim 3, wherein the plurality of genres includes an action genre, a horror genre, a suspicion genre, a news genre, a dialogue genre, a sports genre, and a talk show genre.
5. The method of claim 1, wherein the look-ahead parameter corresponds to maintaining a uniform gain adjustment during an audio pause of the audio description data.
6. The method of any of claims 1 or 5, wherein the ramp parameter corresponds to a period of time during which gain adjustment is gradually applied.
7. The method of any of claims 1 or 5, wherein the maximum delta parameter corresponds to a maximum loudness difference between a frame of the audio object data and a corresponding frame of the audio description data.
8. The method of any one of claims 1 to 2, further comprising: receiving user input to adjust the second plurality of mixing parameters prior to generating the mixed audio object data, and Generating a modified gain adjustment visualization corresponding to the second plurality of mixing parameters that have been adjusted according to the user input, Wherein the mixed audio object data is generated based on the adjusted second plurality of mixing parameters.
9. The method of any one of claims 1 to 2, further comprising: prior to receiving the audio object data: receiving audio data, wherein the audio data does not include an audio object, and Converting the audio data into the audio object data, and After generating the mixed audio object data: the mixed audio object data is converted into mixed audio data, wherein the mixed audio data corresponds to the audio data mixed with the audio description data.
10. A non-transitory computer readable medium storing a computer program which, when executed by a processor, controls a device to perform a process comprising the method of any of claims 1 to 9.
11. An apparatus for audio processing, the apparatus comprising: The processor may be configured to perform the steps of, Wherein the processor is configured to control the apparatus to receive audio object data and audio description data, wherein the audio object data comprises a first plurality of audio objects, Wherein the processor is configured to control the apparatus to calculate a long-term loudness of the audio object data and a long-term loudness of the audio description data, Wherein the processor is configured to control the apparatus to calculate a plurality of short-term loudness of the audio object data and a plurality of short-term loudness of the audio description data, Wherein the processor is configured to control the apparatus to read a first plurality of mixing parameters corresponding to the audio object data, wherein the first plurality of mixing parameters includes a look-ahead parameter, a ramp parameter, and a maximum delta parameter, Wherein the processor is configured to control the apparatus to generate a second plurality of mixing parameters based on the first plurality of mixing parameters, the long-term loudness of the audio object data, the long-term loudness of the audio description data, the plurality of short-term loudness of the audio object data, and the plurality of short-term loudness of the audio description data, Wherein the processor is configured to control the apparatus to generate a gain adjustment visualization corresponding to the second plurality of mixing parameters, the audio object data, and the audio description data, and Wherein the processor is configured to control the apparatus to generate mixed audio object data by mixing the audio object data and the audio description data according to the second plurality of mixing parameters, wherein the mixed audio object data comprises a second plurality of audio objects, wherein the second plurality of audio objects corresponds to the first plurality of audio objects mixed with the audio description data according to the second plurality of mixing parameters.
12. The apparatus of claim 11, further comprising: a display configured to display the gain adjustment visualization.
13. The apparatus of any one of claims 11 to 12 wherein the long-term loudness of the audio object data is calculated from a plurality of samples of the audio object data, wherein the long-term loudness of the audio description data is calculated from a plurality of samples of the audio description data, Wherein each of the plurality of short-term loudness of the audio object data is calculated from a single sample of the audio object data, and wherein each of the plurality of short-term loudness of the audio description data is calculated from a single sample of the audio description data.
14. The apparatus of any of claims 11-12, wherein the first plurality of mixing parameters is associated with one of a plurality of genres, wherein each of the plurality of genres is associated with a corresponding set of mixing parameters.
15. The apparatus of claim 11, wherein the look-ahead parameter corresponds to maintaining a uniform gain adjustment during an audio pause of the audio description data.
16. The apparatus of any one of claims 11 or 15, wherein the ramp parameter corresponds to a period of time during which gain adjustment is gradually applied.
17. The apparatus of any of claims 11 or 15, wherein the maximum delta parameter corresponds to a maximum loudness difference between a frame of the audio object data and a corresponding frame of the audio description data.
18. The apparatus of any of claims 11-12, wherein the processor is configured to control the apparatus to receive user input to adjust the second plurality of mixing parameters prior to generating the mixed audio object data; Wherein the processor is configured to control the apparatus to generate a revised gain adjustment visualization corresponding to the second plurality of mixing parameters that have been adjusted according to the user input, and Wherein the mixed audio object data is generated based on the adjusted second plurality of mixing parameters.

Description

Automatic mixing of audio descriptions Cross Reference to Related Applications The present application is associated with U.S. provisional application No. 63/009,327, filed on day 13 of 4/2020, which is incorporated herein by reference, which automatically mixes the audio description into the immersive media. Technical Field The present disclosure relates to audio processing, and in particular to audio mixing. Background Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section. Audio description generally refers to a verbal description of the visual component of an audiovisual medium such as a movie. The audio description aids vision-impaired consumers in perceiving audiovisual media. For example, the audio description may verbally describe visual aspects of the movie, such as movements of characters and objects, facial expressions, and so forth. The audio description is different from what is called primary audio (also called default audio), which refers to the audio aspects of the audiovisual content itself (e.g., dialog, sound effects, background music, etc.). Typically, the audio description is generated as a separate file that the audio engineer mixes with the main audio file to create an audio version that now contains the audio description. The audio engineer performs mixing to create a coordinated listening experience, for example, such that the audio description is audible in noisy scenes and not too loud in quiet scenes. Applying a gain that reduces the loudness level (e.g., a gain less than 1.0) may be referred to as evasion. The content provider (e.g., netflix TM service, amazon Prime Video TM service, hulu TM service, apple tv+ TM service, etc.) may then provide various audio file versions that the consumer may select. These versions may include main audio files in various formats (stereo, 5.1 channel surround sound, etc.), various languages (e.g., english, spanish, french, japanese, korean, etc.), versions with audio descriptions, etc. The content provider stores the audio file version and provides the selected audio file to the consumer (e.g., via the hypertext transfer protocol (HTTP) live streaming (HLS) protocol), for example, as an audio component of the stream of audiovisual data. As mentioned above, the audio file version may have a variety of formats including mono, stereo, 5.1 channel surround, 7.1 channel surround, and the like. Other audio formats that have recently evolved include the surround sound high fidelity stereo (Ambisonics) format (also known as the B format), the Dolby attos TM format, and the like. In general, the high fidelity stereo format corresponds to a three-dimensional representation of sound pressure and sound pressure gradients in various dimensions. In general, the Dolby attos TM format corresponds to a collection of audio objects, each of which includes an audio track and metadata defining where the audio track is to be output. Disclosure of Invention One problem with existing systems is the time required to perform the mixing. Mixing typically requires an audio engineer to spend multiple man-hours on each hour of content. For example, a 90 minute movie may involve 16 to 24 hours to generate an audio mix containing an audio description. In addition, there may be multiple basic formats of audio (e.g., stereo, 5.1 channel surround sound) and multiple languages, and generating an audio description mix for each combination of format and language may multiply the time required. Embodiments relate to automatically generating a mix containing an audio description in order to reduce the time required by an audio engineer. According to an embodiment, a computer-implemented audio processing method includes receiving audio object data and audio description data, wherein the audio object data includes a first set of audio objects. The method further includes calculating a long-term loudness of the audio object data and a long-term loudness of the audio description data. The method further includes calculating a short-term loudness of the audio object data and a short-term loudness of the audio description data. The method further includes reading a first set of mixing parameters corresponding to the audio object data. The method further includes generating a second set of mixing parameters based on the first set of mixing parameters, the long-term loudness of the audio object data, the long-term loudness of the audio description data, the short-term loudness of the audio object data, and the short-term loudness of the audio description data. The method further includes generating a gain adjustment visualization corresponding to the second set of mixing parameters, the audio object data, and the audio description data. The method further includes generating mixed audio object data by mixing the audio object data and the audio description