CN-116580703-B - Audio segmentation and classification method based on multi-granularity slices

CN116580703BCN 116580703 BCN116580703 BCN 116580703BCN-116580703-B

Abstract

The invention discloses an audio segmentation and classification method based on multi-granularity slices, which comprises the steps of preprocessing audio to obtain an audio file with a uniform sampling rate, slicing the audio file at different time granularities according to corresponding time granularities, performing MFCC feature extraction on each section of slice at different time granularities, performing imaging processing, establishing an image classification convolutional neural network model, performing training and verification, inputting the processed audio into the image classification convolutional neural network model to obtain a classification result of each section, and performing aggregation analysis according to the classification result to obtain segmentation points and segmentation types of the audio file. According to the invention, long audio is cut by adopting different time granularities, the type judgment and the classification and collection are carried out by utilizing the image classification convolutional neural network model, and finally the aggregation analysis is carried out, so that the cutting points among different types of audio can be rapidly and accurately found, and the audio types of audio segments before and after the cutting points are judged.

Inventors

LIU QIANG
ZHENG ZHU

Assignees

四川中云智网科技有限公司

Dates

Publication Date: 20260505
Application Date: 20230606

Claims (8)

1. A multi-granularity slice-based audio segmentation and classification method is characterized by comprising the following steps: s1, preprocessing audio to obtain an audio file with a uniform sampling rate; S2, slicing the audio files with the uniform sampling rate according to the corresponding time granularity from beginning to end under different time granularities; s3, carrying out MFCC feature extraction on each section of slice under different time granularities, and imaging the extracted feature data to obtain an MFCC feature image; S4, establishing an image classification convolutional neural network model, constructing sample set audios, processing 80% of audios in the sample set according to the steps S1-S3, associating the processed audios with corresponding classification type labels, inputting the image classification convolutional neural network model to complete training of the image classification convolutional neural network model, and verifying the trained image classification convolutional neural network model by taking 20% of audios in the sample set as a test set to obtain a final image classification convolutional neural network model; s5, processing the audio to be cut according to the steps S1-S3 to obtain MFCC characteristic images of each slice under different time granularities, and inputting the obtained MFCC characteristic images into a final image classification convolutional neural network model to obtain a classification result of each slice; And S6, obtaining the segmentation point and the segmentation type of the audio based on the accuracy of the minimum granularity by carrying out aggregation analysis on the classification result of each slice under different time granularities.
2. The method for audio segmentation and classification based on multi-granularity slices as set forth in claim 1, wherein the preprocessing of the audio in the step S1 comprises the following steps: S11, selecting a corresponding decoder for decoding according to the format of the audio file to obtain audio sampling data; S12, storing the audio sampling data into an audio file in a WAV uncompressed format; and S13, analyzing the sampling rate of the audio file, ensuring the uniform adoption rate of the audio file, and resampling unsatisfied audio.
3. The method of claim 1 or 2, wherein the time granularity in step S2 is determined by defining the audio frame length and frame number of the audio file.
4. The method of audio segmentation and classification according to claim 3, wherein the audio frames defined in the different time granularity in step S2 are identical, the minimum time granularity is determined according to the number of frames, and the remaining time granularity is a multiple of the minimum time granularity.
5. The method of claim 1 or 4, wherein the extracting the MFCC features in the step S3 comprises pre-emphasis of audio data, windowing of audio signals, discrete Fourier transform of audio signals, mel filtering, computing Fbank features, inverse discrete cosine transform computation, differential sum energy and constituent MFCC features.
6. The method of claim 1, wherein the classification result aggregate analysis in step S6 comprises parallel aggregate analysis and serial aggregate analysis.
7. The method for audio segmentation and classification based on multi-granularity slices of claim 6 wherein the parallel aggregation analysis comprises performing simultaneous classification aggregation analysis on classification slices at different time granularities.
8. The method for audio segmentation and classification based on multi-granularity slices as set forth in claim 6, wherein the serial aggregation analysis comprises sequentially performing the classification aggregation analysis on the classification slices at different time granularities from a maximum time granularity to a minimum time granularity.

Description

Audio segmentation and classification method based on multi-granularity slices Technical Field The invention belongs to the technical field of computer hearing, and particularly relates to an audio segmentation and classification method based on multi-granularity slices. Background The audio segmentation and classification technology refers to the steps of segmenting continuous content in long audio according to segments by using a signal processing and pattern recognition method, and recognizing the type of the content. The current common audio segmentation method is to search for mutation points in audio data or audio characteristics for segmentation, and also to segment the audio signals according to analysis results after local similarity analysis, and the common classification mode is mainly to adopt a neural network classification model for classifying and managing the segmented audio. The segmentation classification method can not realize automatic segmentation aiming at audios with different lengths and different precision requirements. Disclosure of Invention Aiming at the defects of the prior art, the invention provides an audio segmentation and classification method based on multi-granularity slicing, which segments audio with different time granularities, sorts the slices and performs type aggregation, thereby obtaining audio cutting points and cutting segment types, and is suitable for automatic segmentation of audio with different lengths and different precision requirements. The technical scheme adopted by the invention is that the audio segmentation and classification method based on multi-granularity slices comprises the following steps: s1, preprocessing audio to obtain an audio file with a uniform sampling rate; S2, slicing the audio files with the uniform sampling rate according to the corresponding time granularity from beginning to end under different time granularities; s3, carrying out MFCC feature extraction on each section of slice under different time granularities, and imaging the extracted feature data to obtain an MFCC feature image; S4, establishing an image classification convolutional neural network model, constructing sample set audios, processing 80% of audios in the sample set according to the steps S1-S3, associating the processed audios with corresponding classification type labels, inputting the image classification convolutional neural network model to complete training of the image classification convolutional neural network model, and verifying the trained image classification convolutional neural network model by taking 20% of audios in the sample set as a test set to obtain a final image classification convolutional neural network model; s5, processing the audio to be cut according to the steps S1-S3 to obtain MFCC characteristic images of each slice under different time granularities, and inputting the obtained MFCC characteristic images into a final image classification convolutional neural network model to obtain a classification result of each slice; And S6, obtaining the segmentation point and the segmentation type of the audio based on the accuracy of the minimum granularity by carrying out aggregation analysis on the classification result of each slice under different time granularities. Preferably, the preprocessing of the audio in step S1 includes the steps of: S11, selecting a corresponding decoder for decoding according to the format of the audio file to obtain audio sampling data; S12, storing the audio sampling data into an audio file in a WAV uncompressed format; and S13, analyzing the sampling rate of the audio file, ensuring the uniform adoption rate of the audio file, and resampling unsatisfied audio. Preferably, the time granularity in step S2 is determined by defining the audio frame length and the number of frames of the audio file. Preferably, the audio frames defined in the different time granularity in step S2 are identical in length, the minimum time granularity is determined according to the number of frames, and the remaining time granularity is a multiple of the minimum time granularity. Preferably, the MFCC feature extraction in step S3 includes sequentially performing audio data pre-emphasis, audio signal windowing, discrete fourier transform of audio signal, mel filtering, computing Fbank features, inverse discrete cosine transform computation, differential sum energy, and constituent MFCC features. Preferably, the classification result aggregation analysis in step S6 includes parallel aggregation analysis and serial aggregation analysis. Preferably, the parallel aggregation analysis comprises simultaneous classification aggregation analysis of the classification slices at different time granularity. Preferably, the serial aggregation analysis comprises sequentially performing the classification aggregation analysis on the classification slices at different time granularities from the maximum time granularity to the minimum time granu