CN-121983060-A - Audio content identification method, device, electronic equipment, storage medium and program product

CN121983060ACN 121983060 ACN121983060 ACN 121983060ACN-121983060-A

Abstract

The embodiment of the application provides an audio content identification method, an audio content identification device, electronic equipment, a storage medium and a program product, which comprise the steps of obtaining audio data to be processed, and performing feature processing on the audio data to be processed to obtain acoustic features of audio; the method comprises the steps of coupling acoustic features and text features of audio to obtain coupling features, orderly judging the coupling features to obtain matching probabilities of the text and the acoustic features, outputting an audio content recognition result according to the matching probabilities of the text and the acoustic features, judging the matching probabilities of the acoustic features and the text features by extracting the acoustic features of the audio and coupling the acoustic features with the text features, and outputting the recognition result according to the matching probabilities, so that the problem of recognition ambiguity caused by directly outputting the text to be recognized is avoided, the consumption of calculation resources and memory is reduced compared with a general speech recognition model in the prior art, and the universality of speech recognition is improved.

Inventors

Yao Rentian

Assignees

紫光展锐(重庆)科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260316

Claims (10)

1. A method of audio content identification, comprising: acquiring audio data to be processed, and performing feature processing on the audio data to be processed to obtain acoustic features of the audio; coupling the acoustic features and the text features of the audio to obtain coupling features; carrying out order judgment on the coupling features to obtain matching probability of the text and the acoustic features; And outputting an audio content identification result according to the matching probability of the text and the acoustic characteristics.
2. The method of claim 1, wherein coupling the acoustic feature of the audio with the text feature results in a coupled feature, comprising: calculating the attention weight between the acoustic feature and the text feature of the audio through an attention mechanism; Weighting and fusing the text features according to the attention weight to obtain alignment features; and performing matrix multiplication operation on the alignment feature and the acoustic feature to obtain a coupling feature.
3. The method according to claim 1, wherein the performing feature processing on the audio data to be processed to obtain acoustic features of audio includes: Carrying out framing and windowing on the audio data to be processed to generate an audio frame sequence; Performing Fourier transform on the audio frame sequence to obtain spectrum characteristics; Performing feature conversion on the spectrum features through a filter bank to obtain converted spectrum features; And inputting the converted frequency spectrum characteristics into a neural network for dimension reduction processing to obtain acoustic characteristics of the audio.
4. The method according to claim 1, wherein after the order determination is performed on the coupling feature to obtain a matching probability of the text and the acoustic feature, the method further comprises: performing environment detection on the audio data through an environment sensing module to obtain environment characteristics; adjusting a preset matching threshold according to the environmental characteristics to obtain an adjusted threshold; And comparing the matching probability of the text and the acoustic characteristic with the adjusted threshold value to generate a comparison result, wherein the comparison result is used for outputting an audio content recognition result.
5. The method of claim 2, wherein said weighting the text feature based on the attention weight results in an alignment feature comprising: Splicing the acoustic features and the text features to obtain spliced features; Inputting the spliced characteristics into a gating network, and generating gating coefficients through an activation function; element multiplication is carried out on the gating coefficient and the attention weight to obtain an adjusted attention weight; and carrying out weighted fusion on the text features according to the adjusted attention weight to obtain alignment features.
6. The method of any one of claims 1 to 5, wherein the coupling the acoustic feature of the audio with the text feature, prior to obtaining the coupled feature, further comprises: acquiring a text to be identified corresponding to the audio data; word segmentation is carried out on the text to be identified to obtain a text sequence, wherein the text sequence comprises at least one text modeling unit; mapping a plurality of text modeling units in the text sequence into corresponding embedded vectors to obtain initial text features; and inputting the initial text features into a text feature extraction network to perform context modeling, and generating text features.
7. An audio content recognition device, comprising: The first acquisition module is used for acquiring audio data to be processed and carrying out feature processing on the audio data to be processed to obtain acoustic features of the audio; The coupling module is used for coupling the acoustic characteristics and the text characteristics of the audio to obtain coupling characteristics; the judging module is used for orderly judging the coupling characteristics to obtain the matching probability of the text and the acoustic characteristics; And the output module is used for outputting an audio content identification result according to the matching probability of the text and the acoustic characteristics.
8. An electronic device is characterized by comprising a memory and a processor; the memory stores computer-executable instructions; the processor executing computer-executable instructions stored in the memory, causing the processor to perform the audio content identification method of any one of claims 1 to 6.
9. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are for implementing the audio content identification method according to any of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the audio content identification method of any one of claims 1 to 6.

Description

Audio content identification method, device, electronic equipment, storage medium and program product Technical Field The present application relates to the field of audio identification technologies, and in particular, to an audio content identification method, an apparatus, an electronic device, a storage medium, and a program product. Background The voice wake-up device is widely applied to the fields of mobile phones, televisions, automobiles and the like, and supports a user to wake-up equipment to execute voice operation instructions through wake-up words. In the prior art, a user translates audio content into text by constructing a generalized speech recognition model, and judges whether target content is contained or not through text matching. However, in the prior art, the complexity of the general speech recognition model is high, a large amount of computing resources and memory are required, and the general speech recognition model cannot be suitable for low-power-consumption equipment or scenes with high real-time requirements, so that the universality of speech recognition is reduced. Disclosure of Invention The embodiment of the application provides an audio content recognition method, an audio content recognition device, electronic equipment, a storage medium and a program product, which are used for solving the technical problem of low universality of voice recognition in the prior art. In a first aspect, an embodiment of the present application provides an audio content identification method, including: acquiring audio data to be processed, and performing feature processing on the audio data to be processed to obtain acoustic features of the audio; coupling the acoustic features and the text features of the audio to obtain coupling features; carrying out order judgment on the coupling features to obtain matching probability of the text and the acoustic features; And outputting an audio content identification result according to the matching probability of the text and the acoustic characteristics. In one possible implementation manner, the coupling of the acoustic feature and the text feature of the audio to obtain the coupling feature includes calculating attention weight between the acoustic feature and the text feature of the audio through an attention mechanism, carrying out weighted fusion on the text feature according to the attention weight to obtain an alignment feature, and carrying out matrix multiplication operation on the alignment feature and the acoustic feature to obtain the coupling feature. In a possible implementation manner, the feature processing is performed on the audio data to be processed to obtain acoustic features of audio, and the feature processing includes framing and windowing the audio data to be processed to generate an audio frame sequence, performing fourier transform on the audio frame sequence to obtain spectral features, performing feature conversion on the spectral features through a filter bank to obtain converted spectral features, and inputting the converted spectral features into a neural network to perform dimension reduction processing to obtain acoustic features of the audio. In a possible implementation manner, after the coupling feature is orderly judged to obtain the matching probability of the text and the acoustic feature, the method further comprises the steps of detecting the environment of the audio data through an environment sensing module to obtain the environment feature, adjusting a preset matching threshold according to the environment feature to obtain an adjusted threshold, comparing the matching probability of the text and the acoustic feature with the adjusted threshold to generate a comparison result, and outputting the audio content recognition result by the comparison result. In a possible implementation manner, the text feature is subjected to weighted fusion according to the attention weight to obtain an alignment feature, and the method comprises the steps of splicing an acoustic feature and the text feature to obtain a spliced feature, inputting the spliced feature into a gating network, generating a gating coefficient through an activation function, multiplying the gating coefficient by the attention weight to obtain an adjusted attention weight, and carrying out weighted fusion on the text feature according to the adjusted attention weight to obtain the alignment feature. In a possible implementation manner, before the acoustic features and the text features of the audio are coupled to obtain the coupling features, the method further comprises the steps of obtaining a text to be recognized corresponding to audio data, performing word segmentation on the text to be recognized to obtain a text sequence, wherein the text sequence comprises at least one text modeling unit, mapping a plurality of text modeling units in the text sequence into corresponding embedded vectors to obtain initial text features, and inputting the in