US-12626718-B2 - Cover song identification method and system

US12626718B2US 12626718 B2US12626718 B2US 12626718B2US-12626718-B2

Abstract

A cover song identification method implemented by a computing system comprises receiving, by a computing system and from a user device, harmonic pitch class profile (HPCP) information that specifies one or more HPCP features associated with target audio content. A major chord profile feature and a minor chord profile feature associated with the target audio content are derived from the HPCP features. Machine learning logic of the computing system determines, based on the major chord profile feature and the minor chord profile feature, a relatedness between the target audio content and each of a plurality of audio content items specified in records of a database. Each audio content item is associated with cover song information. Cover song information associated with an audio content item having a highest relatedness to the target audio content is communicated to the user device.

Inventors

Xiaochen Liu
Joseph P. Renner
Joshua E. Morris
Todd J. Hodges
Robert Coover
Zafar Rafii

Assignees

GRACENOTE, INC.

Dates

Publication Date: 20260512
Application Date: 20240710

Claims (20)

1 . A tangible, non-transitory computer-readable storage medium comprising instructions that, when executed by one or more processors, cause a computing device to perform a set of operations comprising: deriving, from one or more harmonic pitch class profile (HPCP) features associated with target audio content, a major chord profile feature and a minor chord profile feature associated with the target audio content, wherein deriving the major chord profile feature and the minor chord profile feature associated with the target audio content comprises selecting HPCP features that correlate with twelve different major chords and twelve different minor chords, time aligning the selected HPCP features to a nearest bar line or measure based on an estimated tempo and beat associated with the target audio content, and normalizing the time aligned selected HPCP features, to emphasize a sequential structure of the target audio content with major and minor chords that are present in the target audio content; and determining, by machine learning logic, based on at least the major chord profile feature and the minor chord profile feature, a relatedness between the target audio content and at least one audio content item of a database, wherein the at least one content item is associated with cover song information.
2 . The tangible, non-transitory computer-readable storage medium of claim 1 , wherein the set of operations further comprises communicating cover song information associated with the at least one audio content item based on a relatedness to the target audio content.
3 . The tangible, non-transitory computer-readable storage medium of claim 1 , wherein the at least one audio content items comprises a plurality of audio content items.
4 . The tangible, non-transitory computer-readable storage medium of claim 3 , wherein determining a relatedness between the target audio content and at least one audio content item of a database comprises determining a relatedness between the target audio content and each of the plurality of audio content items of the database.
5 . The tangible, non-transitory computer-readable storage medium of claim 4 , wherein the set of operations further comprises communicating cover song information associated with an audio content item of the plurality of audio content items having a highest relatedness to the target audio content.
6 . The tangible, non-transitory computer-readable medium of claim 1 , wherein determining the relatedness between the target audio content and at least one audio content item a database further comprises: determining, by the machine learning logic, a target embedding associated with the major chord profile feature and the minor chord profile feature associated with the target audio content.
7 . The tangible, non-transitory computer-readable medium of claim 6 , wherein determining the target embedding comprises: inputting the major chord profile feature and the minor chord profile feature associated with the target audio content into a convolutional neural network (CNN) of the machine learning logic; receiving, by a recurrent neural network (RNN) of the machine learning logic, an output of the CNN; and reshaping an output of the RNN to a vector that corresponds to the target embedding.
8 . A computing device that facilitates cover song identification comprising: one or more processors; and a tangible, non-transitory computer-readable storage medium comprising instructions that, when executed by the one or more processors, cause the computing device to perform a set of operations comprising: deriving, from one or more harmonic pitch class profile (HPCP) features associated with target audio content, a major chord profile feature and a minor chord profile feature associated with the target audio content, wherein deriving the major chord profile feature and the minor chord profile feature associated with the target audio content comprises selecting HPCP features that correlate with twelve different major chords and twelve different minor chords, time aligning the selected HPCP features to a nearest bar line or measure based on an estimated tempo and beat associated with the target audio content, and normalizing the time aligned selected HPCP features, to emphasize a sequential structure of the target audio content with major and minor chords that are present in the target audio content; and determining, by machine learning logic, based on at least the major chord profile feature and the minor chord profile feature, a relatedness between the target audio content and at least one audio content item of a database, wherein the at least one content item is associated with cover song information.
9 . The computing device of claim 8 , wherein the set of operations further comprises communicating cover song information associated with the at least one audio content item based on a relatedness to the target audio content.
10 . The computing device of claim 8 , wherein the at least one audio content items comprises a plurality of audio content items.
11 . The computing device of claim 10 , wherein determining a relatedness between the target audio content and at least one audio content item of a database comprises determining a relatedness between the target audio content and each of the plurality of audio content items of the database.
12 . The computing device of claim 11 , wherein the set of operations further comprises communicating cover song information associated with an audio content item of the plurality of audio content items having a highest relatedness to the target audio content.
13 . The computing device of claim 8 , wherein determining the relatedness between the target audio content and at least one audio content item a database further comprises: determining, by the machine learning logic, a target embedding associated with the major chord profile feature and the minor chord profile feature associated with the target audio content.
14 . The computing device of claim 13 , wherein determining the target embedding comprises: inputting the major chord profile feature and the minor chord profile feature associated with the target audio content into a convolutional neural network (CNN) of the machine learning logic; receiving, by a recurrent neural network (RNN) of the machine learning logic, an output of the CNN; and reshaping an output of the RNN to a vector that corresponds to the target embedding.
15 . A computer-implemented method comprising: deriving, from one or more harmonic pitch class profile (HPCP) features associated with target audio content, a major chord profile feature and a minor chord profile feature associated with the target audio content, wherein deriving the major chord profile feature and the minor chord profile feature associated with the target audio content comprises selecting HPCP features that correlate with twelve different major chords and twelve different minor chords, time aligning the selected HPCP features to a nearest bar line or measure based on an estimated tempo and beat associated with the target audio content, and normalizing the time aligned selected HPCP features, to emphasize a sequential structure of the target audio content with major and minor chords that are present in the target audio content; and determining, by machine learning logic, based on at least the major chord profile feature and the minor chord profile feature, a relatedness between the target audio content and at least one audio content item of a database, wherein the at least one content item is associated with cover song information.
16 . The computer-implemented method of claim 15 , wherein the method further comprises communicating cover song information associated with the at least one audio content item based on a relatedness to the target audio content.
17 . The computer-implemented method of claim 15 , wherein the at least one audio content items comprises a plurality of audio content items.
18 . The computer-implemented method of claim 17 , wherein determining a relatedness between the target audio content and at least one audio content item of a database comprises determining a relatedness between the target audio content and each of the plurality of audio content items of the database.
19 . The computer-implemented method of claim 18 , wherein the method further comprises communicating cover song information associated with an audio content item of the plurality of audio content items having a highest relatedness to the target audio content.
20 . The computer-implemented method of claim 15 , wherein determining the relatedness between the target audio content and at least one audio content item a database further comprises: determining, by the machine learning logic, a target embedding associated with the major chord profile feature and the minor chord profile feature associated with the target audio content.

Description

RELATED APPLICATIONS This application claims the benefit of priority under 35 U.S.C. § 119 (e) of U.S. Provisional Application No. 63/133,042, filed Dec. 31, 2020, the content of which is incorporated herein by reference in its entirety. BACKGROUND Field This application generally relates to audio content recognition. In particular, this application describes a cover song identification method and system for performing cover song identification. Description of Related Art Cover song identification (CSI) is a popular task in music information retrieval (MIR) that aims to identify if two music recordings are different renditions or covers of the same composition. CSI is utilized in applications such as the classification of musical works, music rights management, and general music similarity search. Covers typically vary in terms of key, tempo, singer, or instrumentation, which can make identification of a particular cover song challenging. SUMMARY In a first aspect, a cover song identification method implemented by a computing system comprises receiving, by a computing system and from a user device, harmonic pitch class profile (HPCP) information that specifies one or more HPCP features associated with target audio content. A major chord profile feature and a minor chord profile feature associated with the target audio content are derived from the HPCP features. Machine learning logic of the computing system determines, based on the major chord profile feature and the minor chord profile feature, a relatedness between the target audio content and each of a plurality of audio content items specified in records of a database. Each audio content item is associated with cover song information. Cover song information associated with an audio content item having a highest relatedness to the target audio content is communicated to the user device. In a second aspect, a computing system that facilitates cover song identification includes a memory and a processor. The memory stores instruction code. The processor is in communication with the memory. The instruction code is executable by the processor to cause the computing system to perform operations that include receiving, by a computing system and from a user device, harmonic pitch class profile (HPCP) information that specifies one or more HPCP features associated with target audio content. A major chord profile feature and a minor chord profile feature associated with the target audio content are derived from the HPCP features. Machine learning logic of the computing system determines, based on the major chord profile feature and the minor chord profile feature, a relatedness between the target audio content and each of a plurality of audio content items specified in records of a database. Each audio content item is associated with cover song information. Cover song information associated with an audio content item having a highest relatedness to the target audio content is communicated to the user device. In a third aspect, a non-transitory computer-readable medium has stored thereon instruction code that facilitates cover song identification. When the instruction code is executed by a processor, the processor performs operations that include receiving, by a computing system and from a user device, harmonic pitch class profile (HPCP) information that specifies one or more HPCP features associated with target audio content. A major chord profile feature and a minor chord profile feature associated with the target audio content are derived from the HPCP features. Machine learning logic of the computing system determines, based on the major chord profile feature and the minor chord profile feature, a relatedness between the target audio content and each of a plurality of audio content items specified in records of a database. Each audio content item is associated with cover song information. Cover song information associated with an audio content item having a highest relatedness to the target audio content is communicated to the user device. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings are included to provide a further understanding of the claims, are incorporated in, and constitute a part of this specification. The detailed description and illustrated examples described serve to explain the principles defined by the claims. FIG. 1 illustrates an environment that includes various systems/devices that facilitate performing audio content recognition, in accordance with an example. FIG. 2 illustrates an audio source device, in accordance with an example. FIG. 3 illustrates a content recognition system (CRS), in accordance with an example. FIG. 4 illustrates machine learning (ML) logic implemented by the CRS, in accordance with an example. FIG. 5 illustrates content matching records stored in a database of the CRS, in accordance with an example. FIG. 6 illustrates operations performed by the audio source device and/or the CRS, in accordance with an exa