CN-118136019-B - Audio data rhythm analysis playing method, equipment and storage medium

CN118136019BCN 118136019 BCN118136019 BCN 118136019BCN-118136019-B

Abstract

The invention relates to the field of audio analysis, and discloses a method, equipment and a storage medium for analyzing and playing a musical rhythm of audio data. The method comprises the steps of receiving audio data, carrying out phoneme recognition processing on the audio data according to a preset phoneme recognition algorithm to obtain a phoneme sequence, carrying out division marking processing on the phoneme sequence according to preset phoneme rhythm data to obtain a phoneme sequence, carrying out description matching processing on the phoneme sequence by utilizing a preset description library to obtain description words corresponding to the audio data, and carrying out display processing on word segments corresponding to the playing positions in the description words based on the playing positions of the audio data when the audio data is played. In the embodiment of the invention, the technical problem that the prior analysis technology lacks of analyzing the rhythm of the audio and cannot meet the demand of people for analyzing the audio music is solved.

Inventors

ZHANG SIJIAN
YANG DEWEN
PI BIHONG
LONG DINGFEN

Assignees

深圳市同行者科技有限公司

Dates

Publication Date: 20260508
Application Date: 20240115

Claims (6)

1. A method for analyzing and playing the voice rhythm of audio data is characterized by comprising the following steps: receiving audio data; Performing phoneme recognition processing on the audio data according to a preset phoneme recognition algorithm to obtain a phoneme sequence; Dividing and marking the phoneme sequence according to preset rhythm data to obtain a rhythm sequence; Performing description matching processing on the musical notation sequence by using a preset description library to obtain description words corresponding to the audio data; When the audio data is played, based on the playing position of the audio data, displaying and processing a text segment corresponding to the playing position in the descriptive text; and performing phoneme recognition processing on the audio data according to a preset phoneme recognition algorithm to obtain a phoneme sequence, wherein the obtaining the phoneme sequence comprises the following steps: according to a preset frame window, carrying out frame decomposition processing on the audio data to obtain N frames of audio, wherein N is a positive integer; vectorizing N frames of audio data to obtain N frames of vectors; Respectively carrying out convolution processing on N frame vectors according to a preset convolution matrix set to obtain N convolution frame vectors; Normalizing the N convolved frame vectors to obtain N normalized frame vectors; according to a preset activation function, carrying out activation processing on N normalized frame vectors to obtain N activation values; Generating a phoneme sequence based on the N activation values; The vectorizing the N frame audio data to obtain N frame vectors includes: performing sequential convolution processing on the frame audio data to obtain convolution frame data; performing time frame average processing on the convolution frame data to obtain a frame vector; the performing the sequential convolution processing on the frame audio data to obtain convolution frame data includes: Based on a preset first convolution kernel, extracting frame data of T-2, T-1, T, t+1 and t+2 in the range of the frame audio data T, and carrying out convolution processing to obtain a first convolution sub-audio; Based on a preset second convolution kernel, extracting frame data of T-2, T and t+2 in the range of the frame audio data T, and carrying out convolution processing to obtain second convolution sub-audio; Based on a preset third convolution kernel, extracting frame data of T in the range of the frame audio data T, and carrying out convolution processing to obtain third convolution sub-audio; The first convolution sub-audio, the second convolution sub-audio and the third convolution sub-audio are spliced in parallel to obtain convolution frame data; The method for obtaining the descriptive text corresponding to the audio data comprises the following steps of: selecting a target rhythm name in the rhythm sequence; Matching the corresponding description field of the target phonetic name in a preset description library, and establishing a mapping relation between the description field and the phonetic sequence; based on the rhythm ordering of the rhythm sequence, combining the description fields corresponding to all the rhythm names in the rhythm sequence to generate the description text corresponding to the audio data.
2. The method for analyzing and playing the musical rhythm of the audio data according to claim 1, wherein the performing time frame average processing on the convolved frame data to obtain the frame vector comprises: Wherein H is an average time frame, H t1 is a first convolution sub-audio T time audio value, H t2 is a second convolution sub-audio T time audio value, H t3 is a third convolution sub-audio T time audio value, and T is the end time of the frame audio data; and generating a frame vector according to the average time frame.
3. The method of claim 1, wherein the activating N normalized frame vectors according to a preset activation function to obtain N activation values includes: And respectively carrying out activation processing on the N normalized frame vectors according to the softmax activation function to obtain N activation values.
4. The method for analyzing and playing a musical rhythm of audio data according to claim 1, wherein the performing a division marking process on the phoneme sequence according to preset musical rhythm data to obtain a musical rhythm sequence comprises: Reading a rhythm sequence of preset rhythm data; judging whether the phoneme sequence has a matching sequence in the phoneme sequence or not; if so, performing marking processing on the matching sequences in the phoneme sequences; if not, another sequence of the preset rhythms of the rhythms is read.
5. The voice rhythm analysis playing device of the audio data is characterized by comprising a memory and at least one processor, wherein the memory is stored with instructions, and the memory and the at least one processor are interconnected through a line; The at least one processor invokes the instructions in the memory to cause a temperament analysis playback device of the audio data to perform the temperament analysis playback method of the audio data as recited in any one of claims 1-4.
6. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements a method of rhythmically parsing playback of audio data as claimed in any one of claims 1-4.

Description

Audio data rhythm analysis playing method, equipment and storage medium Technical Field The present invention relates to the field of audio analysis, and in particular, to a method, an apparatus, and a storage medium for playing a musical rhythm analysis of audio data. Background Audio analysis refers to the digital processing and analysis of an audio signal from which useful information is extracted. It can be applied in many fields such as speech recognition, music information retrieval, sound event detection, emotion recognition, etc. In terms of speech recognition, audio analysis may convert speech signals into text so that a computer can understand the language of a human being. In the aspect of music information retrieval, the audio analysis can extract characteristics of songs, such as rhythms, melodies, tones and the like, so that functions of song searching, recommendation and the like are realized. In terms of sound event detection, audio analysis may identify various sound events in the environment, such as car horns, dog calls, human voices, and the like. In emotion recognition, audio analysis can infer speaker emotion states such as anger, sadness, happiness, etc. through sound features. Although the targets and functions of audio analysis are rich and various, the current analysis technology lacks of analysis of the rhythm of audio, and cannot meet the demand of people for the analysis of audio music, so a new technology is needed to solve the current technical problems. Disclosure of Invention The invention mainly aims to solve the technical problems that the prior analysis technology lacks of analyzing the rhythm of the audio frequency and can not meet the demand of people for the analysis of the audio frequency music. The first aspect of the present invention provides a method for analyzing and playing a musical rhythm of audio data, where the method for analyzing and playing a musical rhythm of audio data includes: receiving audio data; Performing phoneme recognition processing on the audio data according to a preset phoneme recognition algorithm to obtain a phoneme sequence; Dividing and marking the phoneme sequence according to preset rhythm data to obtain a rhythm sequence; Performing description matching processing on the musical notation sequence by using a preset description library to obtain description words corresponding to the audio data; and when the audio data is played, based on the playing position of the audio data, displaying the text segment corresponding to the playing position in the descriptive text. Optionally, in a first implementation manner of the first aspect of the present invention, performing phoneme recognition processing on the audio data according to a preset phoneme recognition algorithm, to obtain a phoneme sequence includes: according to a preset frame window, carrying out frame decomposition processing on the audio data to obtain N frames of audio, wherein N is a positive integer; vectorizing N frames of audio data to obtain N frames of vectors; Respectively carrying out convolution processing on N frame vectors according to a preset convolution matrix set to obtain N convolution frame vectors; Normalizing the N convolved frame vectors to obtain N normalized frame vectors; according to a preset activation function, carrying out activation processing on N normalized frame vectors to obtain N activation values; based on the N activation values, a phoneme sequence is generated. Optionally, in a second implementation manner of the first aspect of the present invention, performing vectorization processing on the N frame audio data to obtain N frame vectors includes: performing sequential convolution processing on the frame audio data to obtain convolution frame data; and performing time frame average processing on the convolved frame data to obtain a frame vector. Optionally, in a third implementation manner of the first aspect of the present invention, the performing a sequential convolution processing on the frame audio data to obtain convolved frame data includes: Based on a preset first convolution kernel, extracting frame data of T-2, T-1, T, t+1 and t+2 in the range of the frame audio data T, and carrying out convolution processing to obtain a first convolution sub-audio; Based on a preset second convolution kernel, extracting frame data of T-2, T and t+2 in the range of the frame audio data T, and carrying out convolution processing to obtain second convolution sub-audio; Based on a preset third convolution kernel, extracting frame data of T in the range of the frame audio data T, and carrying out convolution processing to obtain third convolution sub-audio; And parallelly splicing the first convolution sub-audio, the second convolution sub-audio and the third convolution sub-audio to obtain convolution frame data. Optionally, in a fourth implementation manner of the first aspect of the present invention, performing time frame averaging processing on th