CN-121983091-A - Snore detection method and device and related equipment

CN121983091ACN 121983091 ACN121983091 ACN 121983091ACN-121983091-A

Abstract

The application provides a snore detection method, a device and related equipment, and relates to the technical field of audio detection, wherein the method comprises the steps of converting audio data acquired in a preset time period to obtain a spectrogram, wherein the audio data comprises at least one of snore audio and noise audio, and the spectrogram is used for representing time-frequency characteristics of each audio data; inputting the spectrogram into a pre-trained convolutional neural network model to obtain a detection result of the audio data, wherein the convolutional neural network model is used for carrying out feature extraction on the spectrogram to obtain a first feature and a second feature, determining the detection result based on the first feature and the second feature, wherein the first feature represents local features of the audio data, and the second feature is used for representing global features of the audio data. The snore can be effectively identified on the internet of things equipment with limited calculation force, so that the problem that the existing snore detection method is poor in detection accuracy on the internet of things equipment is solved.

Inventors

CHEN XIAOLIANG
YU XIN
CHANG LE
HUANG BINHE
JING TENG

Assignees

北京中科声智科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260310

Claims (10)

1. A method for detecting snoring, applied to a terminal device, the method comprising: Converting the audio data acquired in a preset time period to obtain a spectrogram, wherein the audio data comprises at least one of snore audio and noise audio, and the spectrogram is used for representing time-frequency characteristics of each audio data; Inputting the spectrogram to a pre-trained convolutional neural network model to obtain a detection result of the audio data, wherein the convolutional neural network model is used for extracting features of the spectrogram to obtain a first feature and a second feature, determining the detection result based on the first feature and the second feature, wherein the first feature represents local features of the audio data, and the second feature is used for representing global features of the audio data.
2. The method of claim 1, wherein the converting the audio data collected during the preset period of time to obtain the spectrogram comprises: preprocessing the audio data acquired in the preset time period to obtain preprocessed audio data, wherein the preprocessing comprises the steps of removing direct current components in the audio data and normalizing the audio data; Performing short-time Fourier transform (STFT) on the preprocessed audio data, and converting a time domain signal into a frequency domain signal to obtain an original frequency spectrum of the preprocessed audio data; Filtering the original frequency spectrum through a Mel filter bank, and mapping the frequency domain signal to Mel scales to obtain Mel frequency spectrum; Carrying out logarithmic transformation on the Mel frequency spectrum to obtain a spectrogram, wherein the spectrogram is a Log-Mel spectrogram, the size of the Log-Mel spectrogram is 64xN, and N is the number of frames corresponding to the audio data in the preset time period.
3. The method of claim 1, wherein inputting the spectrogram into a pre-trained convolutional neural network model to obtain the detection result of the audio data comprises: extracting local features of the spectrogram through a first convolution block of the convolution neural network model to obtain the first features, wherein the first features are used for representing spectral details and local voiceprint changes of the audio data; global feature extraction is carried out on the first features through a second convolution block of the convolution neural network model, so that the second features are obtained, the second features are used for representing the overall spectrum distribution rule and snore inherent mode features of the audio data, and the second convolution block is a convolution block after the first convolution block; And determining a detection result of the audio data according to the first characteristic and the second characteristic.
4. A method according to claim 3, wherein said determining a detection result of the audio data from the first feature and the second feature comprises: Fusing and global average pooling are carried out on the first feature and the second feature to obtain a fused feature; inputting the fusion characteristics to a classification layer of the convolutional neural network model, and calculating to obtain a probability value of snore existing in the preset time period through full-connection mapping and sigmoid activation function of the classification layer, wherein the probability value is used for determining whether snore exists in the preset time period; and under the condition that snoring exists in the preset time period, determining a starting time stamp and a stopping time stamp of the snoring according to the time sequence distribution of the fusion characteristic and the time position of the corresponding frame of the audio data.
5. The method of any one of claims 1 to 4, wherein the convolutional neural network model comprises an input layer, a backbone network, a downsampling module, and a classification layer; The input layer is used for receiving data; The backbone network comprises a first convolution block and a second convolution block, wherein the first convolution block is used for carrying out local feature extraction, the second convolution block is used for carrying out global feature extraction, the first convolution block comprises a first sub-convolution block and a second sub-convolution block, the second convolution block comprises a third sub-convolution block and a fourth sub-convolution block, each sub-convolution block comprises Conv3x3 convolution layers, batchNorm batches of normalization layers and ReLU activation layers, and the channel numbers of the first sub-convolution block, the second sub-convolution block, the third sub-convolution block and the fourth sub-convolution block are [32,32,64,128] in sequence; the downsampling module is used for executing 2 times of maximum pooling MaxPooling operations in each round; The classifying layer is used for outputting a probability value of snore existing in a preset time period.
6. A snore detecting device, the device comprising: The processing module is used for converting the audio data acquired in the preset time period to obtain a spectrogram, wherein the audio data comprises at least one of snore audio and noise audio, and the spectrogram is used for representing the time-frequency characteristics of each audio data; The input module is used for inputting the spectrogram to a pre-trained convolutional neural network model to obtain a detection result of the audio data, wherein the convolutional neural network model is used for extracting features of the spectrogram to obtain a first feature and a second feature, the detection result is determined based on the first feature and the second feature, the first feature represents local features of the audio data, and the second feature is used for representing global features of the audio data.
7. The apparatus of claim 6, wherein the processing module is specifically configured to: preprocessing the audio data acquired in the preset time period to obtain preprocessed audio data, wherein the preprocessing comprises the steps of removing direct current components in the audio data and normalizing the audio data; Performing short-time Fourier transform (STFT) on the preprocessed audio data, and converting a time domain signal into a frequency domain signal to obtain an original frequency spectrum of the preprocessed audio data; Filtering the original frequency spectrum through a Mel filter bank, and mapping the frequency domain signal to Mel scales to obtain Mel frequency spectrum; Carrying out logarithmic transformation on the Mel frequency spectrum to obtain a spectrogram, wherein the spectrogram is a Log-Mel spectrogram, the size of the Log-Mel spectrogram is 64xN, and N is the number of frames corresponding to the audio data in the preset time period.
8. An electronic device comprising a processor, a memory and a program stored on the memory and executable on the processor, the program when executed by the processor implementing the steps of the method according to any one of claims 1 to 6.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the method according to any one of claims 1to 6.
10. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1-6.

Description

Snore detection method and device and related equipment Technical Field The present application relates to the field of audio detection technologies, and in particular, to a snore detection method, device and related equipment. Background More and more users have snoring symptoms, and snorers with severe snoring are prone to sleep apnea syndrome, which results in reduced oxygen content in blood, and chronic diseases are prone to be induced for a long time. Thus, detecting snoring helps the user's health management. At present, high-power consumption equipment such as a health detector is usually detected by using a large model, such as ResNet or a transducer, and the large model has a large calculation force requirement and is difficult to run on internet of things (Internet of Things, ioT) equipment in real time, and the snore detection can be performed to a certain extent by a mode that the internet of things equipment is provided with a traditional digital signal Processor (DIGITAL SIGNAL Processor, DSP), but the accuracy is poor and the environmental noise is difficult to effectively identify in the detection process. Therefore, the existing snore detection method has the problem of poor detection accuracy on the Internet of things equipment. Disclosure of Invention The embodiment of the application provides a snore detection method, a device and related equipment, which are used for solving the problem that the existing snore detection method is poor in detection accuracy on Internet of things equipment. In a first aspect, an embodiment of the present application provides a snore detecting method, applied to a terminal device, where the method includes: Converting the audio data acquired in a preset time period to obtain a spectrogram, wherein the audio data comprises at least one of snore audio and noise audio, and the spectrogram is used for representing time-frequency characteristics of each audio data; Inputting the spectrogram to a pre-trained convolutional neural network model to obtain a detection result of the audio data, wherein the convolutional neural network model is used for extracting features of the spectrogram to obtain a first feature and a second feature, determining the detection result based on the first feature and the second feature, wherein the first feature represents local features of the audio data, and the second feature is used for representing global features of the audio data. Optionally, the converting the audio data collected in the preset time period to obtain a spectrogram includes: preprocessing the audio data acquired in the preset time period to obtain preprocessed audio data, wherein the preprocessing comprises the steps of removing direct current components in the audio data and normalizing the audio data; Performing short-time Fourier transform (STFT) on the preprocessed audio data, and converting a time domain signal into a frequency domain signal to obtain an original frequency spectrum of the preprocessed audio data; Filtering the original frequency spectrum through a Mel filter bank, and mapping the frequency domain signal to Mel scales to obtain Mel frequency spectrum; Carrying out logarithmic transformation on the Mel frequency spectrum to obtain a spectrogram, wherein the spectrogram is a Log-Mel spectrogram, the size of the Log-Mel spectrogram is 64xN, and N is the number of frames corresponding to the audio data in the preset time period. Optionally, the inputting the spectrogram to a pre-trained convolutional neural network model to obtain a detection result of the audio data includes: extracting local features of the spectrogram through a first convolution block of the convolution neural network model to obtain the first features, wherein the first features are used for representing spectral details and local voiceprint changes of the audio data; global feature extraction is carried out on the first features through a second convolution block of the convolution neural network model, so that the second features are obtained, the second features are used for representing the overall spectrum distribution rule and snore inherent mode features of the audio data, and the second convolution block is a convolution block after the first convolution block; And determining a detection result of the audio data according to the first characteristic and the second characteristic. Optionally, the determining the detection result of the audio data according to the first feature and the second feature includes: Fusing and global average pooling are carried out on the first feature and the second feature to obtain a fused feature; inputting the fusion characteristics to a classification layer of the convolutional neural network model, and calculating to obtain a probability value of snore existing in the preset time period through full-connection mapping and sigmoid activation function of the classification layer, wherein the probability value is used for determining w