US-12621419-B2 - Systems and methods for generating video based on informational audio data

US12621419B2US 12621419 B2US12621419 B2US 12621419B2US-12621419-B2

Abstract

In one embodiment, a computer-implemented method may include receiving a media file, extracting, using an artificial intelligence engine including one or more trained machine learning models, one or more audio features from the media file. The one or more audio features include at least one of a time-synchronized transcript, speaker recognition data, mood data, index data, visual asset speaker data, color palette data, and written description speaker data. The method may include generating, based on the one or more audio features, a video, wherein the video is presented via a media player on a user interface of a computing device.

Inventors

Marco Paglia
Alessandro Camedda
Davide Soranzio

Assignees

Musixmatch S.P.A.

Dates

Publication Date: 20260505
Application Date: 20240607

Claims (17)

1 . A computer-implemented method comprising: receiving an audio file; extracting, using an artificial intelligence engine comprising one or more trained machine learning models, one or more audio features from the audio file, wherein the one or more audio features comprise at least one of a time-synchronized transcript, speaker recognition data, mood data, visual asset speaker data, color palette data, and written description speaker data, wherein the visual asset speaker data comprises information pertaining to one or more visual representations of one or more speakers associated with the speaker recognition data, and the color palette data comprises information pertaining to the images; assigning, using the visual asset speaker data, one or more visual representations to one or more speakers identified in the speaker recognition data; and generating, based on the one or more audio features, a video, wherein the video is presented via a media player on a user interface of a computing device, and the video presents the one or more visual representations of the one or more speakers when the one or more speakers speak during the video.
2 . The computer-implemented method of claim 1 , further comprising causing the time-synchronized transcript to be dynamically presented in conjunction with the one or more visual representations of the one or more speakers during playback of the video.
3 . The computer-implemented method of claim 1 , further comprising modifying, based on the mood data, a visual representation of speech of a speaker during playback of the video.
4 . The computer-implemented method of claim 1 , further comprising modifying, based on the written description speaker data, presented text in the video during playback of the video.
5 . The computer-implemented method of claim 1 , further comprising: identifying, based on the time-synchronized transcription, an entity associated with a word; obtaining an image associated with the entity; and modifying the video to include the image during playback of the video.
6 . The computer-implemented method of claim 5 , further comprising: obtaining written description data of the entity; and modifying the video to include the written description data of the entity and the image during playback of the video.
7 . The computer-implemented method of claim 6 , further comprising: receiving a selection of the image or the written description data of the entity; responsive to receiving the selection, causing additional information pertaining to the entity to be presented via the user interface.
8 . A tangible, non-transitory computer-readable media storing instructions that, when executed, cause one or more processing devices to: receive an audio file; extract, using an artificial intelligence engine comprising one or more trained machine learning models, one or more audio features from the audio file, wherein the one or more audio features comprise at least one of a time-synchronized transcript, speaker recognition data, mood data, visual asset speaker data, color palette data, and written description speaker data, wherein the visual asset speaker data comprises information pertaining to one or more visual representations of one or more speakers associated with the speaker recognition data, and the color palette data comprises information pertaining to the images; assign, using the visual asset speaker data, one or more visual representations to one or more speakers identified in the speaker recognition data; and generate, based on the one or more audio features, a video, wherein the video is presented via a media player on a user interface of a computing device, and the video presents the one or more visual representations of the one or more speakers when the one or more speakers speak during the video.
9 . The computer-readable media of claim 8 , further comprising causing the time-synchronized transcript to be dynamically presented in conjunction with the one or more visual representations of the one or more speakers during playback of the video.
10 . The computer-readable media of claim 8 , further comprising modifying, based on the mood data, a visual representation of speech of a speaker during playback of the video.
11 . The computer-readable media of claim 8 , further comprising modifying, based on the written description speaker data, presented text in the video during playback of the video.
12 . The computer-readable media of claim 8 , further comprising: identifying, based on the time-synchronized transcription, an entity associated with a word; obtaining an image associated with the entity; and modifying the video to include the image during playback of the video.
13 . The computer-readable media of claim 12 , further comprising: obtaining written description data of the entity; and modifying the video to include the written description data of the entity and the image during playback of the video.
14 . The computer-readable media of claim 13 , further comprising: receiving a selection of the image or the written description data of the entity; responsive to receiving the selection, causing additional information pertaining to the entity to be presented via the user interface.
15 . A system comprising: a memory device storing instructions; and a processing device communicatively coupled to the memory device, wherein the processing device executes the instructions to: receive an audio file; extract, using an artificial intelligence engine comprising one or more trained machine learning models, one or more audio features from the audio file, wherein the one or more audio features comprise at least one of a time-synchronized transcript, speaker recognition data, mood data, visual asset speaker data, color palette data, and written description speaker data, wherein the visual asset speaker data comprises information pertaining to one or more visual representations of one or more speakers associated with the speaker recognition data, and the color palette data comprises information pertaining to the images; assign, using the visual asset speaker data, one or more visual representations to one or more speakers identified in the speaker recognition data; and generate, based on the one or more audio features, a video, wherein the video is presented via a media player on a user interface of a computing device, and the video presents the one or more visual representations of the one or more speakers when the one or more speakers speak during the video.
16 . The system of claim 15 , further comprising causing the time-synchronized transcript to be dynamically presented in conjunction with the one or more visual representations of the one or more speakers during playback of the video.
17 . The system of claim 15 , further comprising modifying, based on the mood data, a visual representation of speech of a speaker during playback of the video.

Description

CROSS-REFERENCE TO RELATED APPLICATION AND CLAIM OF PRIORITY This application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63/506,946 filed on Jun. 8, 2023 titled “SYSTEMS AND METHODS FOR GENERATING VIDEO BASED ON INFORMATIONAL AUDIO DATA.” The above-identified provisional patent application is hereby incorporated by reference in its entirety. TECHNICAL FIELD This disclosure relates to content. More specifically, this disclosure relates to systems and methods for generating video based on informational audio data. BACKGROUND Content items (e.g., songs, movies, videos, podcasts, transcriptions, etc.) are conventionally played via a computing device, such as a smartphone, laptop, desktop, television, or the like. SUMMARY In one embodiment, a computer-implemented method may include receiving an audio file, extracting, using an artificial intelligence engine including one or more trained machine learning models, one or more audio features from the audio file. The one or more audio features include at least one of a time-synchronized transcript, speaker recognition data, mood data, index data, visual asset speaker data, color palette data, and written description speaker data. The method may include generating, based on the one or more audio features, a video, wherein the video is presented via a media player on a user interface of a computing device. In one embodiment, a tangible, non-transitory computer-readable medium stores instructions that, when executed, cause a processing device to perform any operation of any method disclosed herein. In one embodiment, a system includes a memory device storing instructions and a processing device communicatively coupled to the memory device. The processing device executes the instructions to perform any operation of any method disclosed herein. Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims. BRIEF DESCRIPTION OF THE DRAWINGS For a detailed description of example embodiments, reference will now be made to the accompanying drawings in which: FIG. 1 illustrates a system architecture according to certain embodiments of this disclosure; FIG. 2 illustrates an example of a method for generating a customized video based on at least audio features extracted from an audio file according to certain embodiments of this disclosure; FIG. 3 illustrates an example video generated based on one or more audio features extracted from an audio file according to certain embodiments of this disclosure; FIG. 4 illustrates another example video generated based on one or more audio features extracted from an audio file according to certain embodiments of this disclosure; FIG. 5 illustrates another example video generated based on one or more audio features extracted from an audio file according to certain embodiments of this disclosure; FIG. 6 illustrates another example video generated based on one or more audio features extracted from an audio file according to certain embodiments of this disclosure; FIG. 7 illustrates another example video generated based on one or more audio features extracted from an audio file according to certain embodiments of this disclosure; FIG. 8 illustrates another example video generated based on one or more audio features extracted from an audio file according to certain embodiments of this disclosure; FIG. 9 illustrates an example computer system according to embodiments of this disclosure. NOTATION AND NOMENCLATURE Various terms are used to refer to particular system components. Different entities may refer to a component by different names—this document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms “including” and “comprising” are used in an open-ended fashion, and thus should be interpreted to mean “including, but not limited to . . . .” Also, the term “couple” or “couples” is intended to mean either an indirect or direct connection. Thus, if a first device couples to a second device, that connection may be through a direct connection or through an indirect connection via other devices and connections. The terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms “a,” “an,” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order discussed or illustrated, unless specifically identified as an order of performance. It is also to be understood that additional or alternative steps may be employed. The terms first, second, third, etc. may be used herein to describe various elements, components, regions, layers and/or sections; however, these e