CN-115910062-B - Audio identification method, device, equipment and storage medium

CN115910062BCN 115910062 BCN115910062 BCN 115910062BCN-115910062-B

Abstract

The disclosure relates to an audio recognition method, an audio recognition device, audio recognition equipment and a storage medium, relates to the technical field of computers, and is used for solving the problem that an audio recognition model in a general technology is low in recognition processing efficiency. The audio identification method comprises the steps of obtaining audio data to be identified, inputting the audio data to be identified into a target identification model obtained through training in advance to obtain an identification result, wherein the target identification model comprises a plurality of target audio identification modules, the target audio identification modules are used for verifying the input audio data based on target verification units corresponding to the target audio identification modules to obtain a target verification result, the target verification result is used for representing whether the identification processing of the input audio data by the target audio identification modules is skipped, the input audio data is audio feature data based on the audio data to be identified, and the target verification units are used for verifying whether the audio signal features of the input audio data accord with a target feature range.

Inventors

YAO PENG
HUANG JINWEN
TAN JIANCHAO
DENG FENG
WANG XIAORUI
SONG CHENGRU

Assignees

北京达佳互联信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20221125

Claims (11)

1. An audio recognition method, comprising: Acquiring audio data to be identified; The method comprises the steps of inputting audio data to be identified into a target identification model obtained through training in advance to obtain an identification result, wherein the target identification model comprises a plurality of target audio identification modules, the target audio identification modules are modules which are repeatedly overlapped by a target identification model middle layer and cannot change the size and dimension of a feature map, the types of the target audio identification modules are convolution types, activation types and residual types, the target audio identification modules are used for verifying input audio data based on target verification units corresponding to the target audio identification modules to obtain a target verification result, the target verification result is used for representing whether the identification processing of the input audio data by the target audio identification modules is skipped or not, the input audio data are audio feature data based on the audio data to be identified, and the target verification unit is used for verifying whether the audio signal features of the input audio data accord with a target feature range or not.
2. The audio recognition method according to claim 1, wherein the target audio recognition module is further configured to input the input audio data to a next audio recognition module adjacent to and subsequent to the target audio recognition module when the target verification result characterization skips recognition processing of the input audio data by the target audio recognition module; Or when the target verification result characterizes that the recognition processing of the target audio recognition module on the input audio data is not skipped, the input audio data is subjected to recognition processing to obtain output audio data, and the output audio data is input to a next audio recognition module which is adjacent to the target audio recognition module and is positioned behind the target audio recognition module.
3. The audio recognition method of claim 1, further comprising: The method comprises the steps of obtaining a plurality of sample audio data and an initial recognition model, wherein the initial recognition model comprises a plurality of initial audio recognition modules, and different initial audio recognition modules are used for executing different audio recognition tasks; The method comprises the steps of carrying out updating operation on a preset type module in a plurality of initial audio frequency identification modules to obtain an updated identification model, wherein the updating operation is used for adding an initial verification unit for the preset type module to obtain an initial target audio frequency identification module; Training the updated recognition model according to a preset loss function and a plurality of sample audio data to obtain the target recognition model, wherein the preset loss function is generated based on an expected passing rate, and the expected passing rate is used for representing the duty ratio of a preset type module expected to be skipped in a plurality of preset type modules included in the updated recognition model.
4. The method of claim 3, wherein the predetermined type module comprises at least one of a convolution type module, an activation type module, and a residual type module, and wherein the performing an update operation on the predetermined type module of the plurality of initial audio recognition modules to obtain an updated recognition model comprises: determining a preset type module in the plurality of initial audio frequency identification modules as an identification module to be updated, and obtaining a plurality of identification modules to be updated; Updating the processing logic of each recognition module to be updated into processing procedures of executing the audio recognition tasks corresponding to the initial verification unit and the recognition modules to be updated in parallel to obtain the updated recognition model comprising a plurality of initial target audio recognition modules; The initial target audio recognition module is used for verifying the audio data to be processed based on the initial verification unit to obtain an initial verification result, performing recognition processing on the audio data to be processed to obtain processed audio data, performing weighted summation on the audio data to be processed and the processed audio data based on the initial verification result to obtain an output result, and inputting the output result to a next audio recognition module which is adjacent to the initial target audio recognition module and is located behind the initial target audio recognition module.
5. The audio recognition method of claim 4, wherein training the updated recognition model according to a predetermined loss function and the plurality of sample audio data to obtain the target recognition model comprises: Obtaining a sample subset comprising a preset number of sample audio data, wherein the sample subset is obtained based on division of a plurality of sample audio data; Inputting the sample subset into the updated recognition model for recognition processing, and determining the number of the initial target audio recognition modules skipped by the sample audio data in the sample subset in the recognition processing process; Determining a loss value corresponding to the sample subset based on the number of the initial target audio recognition modules skipped by the sample audio data in the sample subset in the recognition processing process, the total number of the initial target audio recognition modules and the expected passing rate; And when the loss value is smaller than or equal to a preset threshold value, updating processing logic of each initial target audio recognition module into processing processes of executing the audio recognition tasks corresponding to the initial verification unit and the initial target audio recognition modules in series to obtain the target recognition model comprising a plurality of target audio recognition modules.
6. The audio recognition method according to claim 1, wherein the acquiring audio data to be recognized includes: receiving content data to be identified, which is sent by a terminal, wherein the content data to be identified comprises at least one section of voice content; And performing splicing processing on at least one section of voice content in the content data to be identified to obtain the audio data to be identified.
7. The audio recognition method according to claim 6, wherein the splicing at least one piece of the voice content in the content data to be recognized to obtain the audio data to be recognized includes: respectively determining a start time and a stop time corresponding to at least one section of voice content; And splicing based on the starting time and the ending time corresponding to each section of the voice content to obtain the audio data to be identified.
8. The audio recognition method according to claim 1, wherein the inputting the audio data to be recognized into a target recognition model obtained by training in advance, after obtaining a recognition result, further comprises: determining content data to be recommended corresponding to the identification result; And sending the content data to be recommended to a terminal.
9. An audio recognition device is characterized by comprising an acquisition unit and a processing unit; the acquisition unit is configured to perform acquisition of audio data to be identified; The processing unit is configured to input the audio data to be identified into a target identification model obtained through training in advance to obtain an identification result, the target identification model comprises a plurality of target audio identification modules, the target audio identification modules are modules which are repeatedly overlapped for middle layers of the target identification model and cannot change the size and dimension of a feature map, the types of the target audio identification modules are convolution types, activation types and residual types, the target audio identification modules are used for verifying input audio data based on target verification units corresponding to the target audio identification modules to obtain a target verification result, the target verification result is used for representing whether the identification processing of the input audio data by the target audio identification modules is skipped or not, the input audio data are audio feature data based on the audio data to be identified, and the target verification unit is used for verifying whether the audio signal features of the input audio data accord with a target feature range or not.
10. An electronic device, the electronic device comprising: A processor; a memory for storing the processor-executable instructions; Wherein the processor is configured to execute the instructions to implement the audio recognition method of any one of claims 1-8.
11. A computer readable storage medium having instructions stored thereon, which, when executed by a processor of an electronic device, enable the electronic device to perform the audio recognition method of any one of claims 1-8.

Description

Audio identification method, device, equipment and storage medium Technical Field The disclosure relates to the field of computer technology, and in particular, to an audio recognition method, device, equipment and storage medium. Background Dialect recognition refers to the recognition of specific dialect categories from human utterances, and is commonly used in the front-end of speech processing systems. For example, automatic speech recognition techniques (automatic speech recognition, ASR), multilingual translation systems, biometric authentication, and the like. Currently, the mainstream dialect recognition technology trains a deep learning model in a supervised mode, and further judges the language or dialect attribute of the audio based on the model obtained by training. Such models typically process audio data based on a given plurality of processing modules when processing the audio data. However, the quality of the audio data is uneven, some audio data are clearer and easy to distinguish, and the background noise of some audio data is noisy and difficult to distinguish. If the same processing flow is adopted for all the data, the waste of resources such as calculation resources, storage resources and the like is easy to be caused, and the efficiency is low. Disclosure of Invention The disclosure provides an audio recognition method, device, equipment and storage medium, which are used for solving the problem that the efficiency is low when an audio recognition model in the general technology performs recognition processing. The technical scheme of the embodiment of the disclosure is as follows: according to a first aspect of the embodiment of the present disclosure, an audio recognition method is provided, which includes obtaining audio data to be recognized, inputting the audio data to be recognized into a target recognition model trained in advance to obtain a recognition result, wherein the target recognition model includes a plurality of target audio recognition modules, the target audio recognition module is used for verifying the input audio data based on a target verification unit corresponding to the target audio recognition module to obtain a target verification result, the target verification result is used for representing whether to skip recognition processing of the input audio data by the target audio recognition module, the input audio data is audio feature data based on the audio data to be recognized, and the target verification unit is used for verifying whether audio signal features of the input audio data accord with a target feature range. Optionally, the target audio recognition module is further configured to input the input audio data to a next audio recognition module adjacent to the target audio recognition module and located after the target audio recognition module when the target verification result characterizes skipping recognition processing of the input audio data by the target audio recognition module; Or when the target verification result characterizes that the recognition processing of the target audio recognition module on the input audio data is not skipped, the input audio data is recognized to obtain output audio data, and the output audio data is input to the next audio recognition module which is adjacent to the target audio recognition module and is positioned behind the target audio recognition module. The audio recognition method comprises the steps of obtaining a plurality of sample audio data and an initial recognition model, wherein the initial recognition model comprises a plurality of initial audio recognition modules, different initial audio recognition modules are used for executing different audio recognition tasks, updating a preset type module in the plurality of initial audio recognition modules to obtain an updated recognition model, adding an initial verification unit for the preset type module to obtain an initial target audio recognition module, verifying whether the audio data input into the initial target audio recognition module accords with an initial characteristic range or not by the initial verification unit, training the updated recognition model according to a preset loss function and the plurality of sample audio data to obtain a target recognition model, wherein the preset loss function is generated based on an expected passing rate, and the expected passing rate is used for representing the duty ratio of the expected skipped preset type module in the plurality of preset type modules included in the updated recognition model. The method comprises the steps of determining the preset type module in the plurality of initial audio frequency identification modules as an identification module to be updated to obtain a plurality of identification modules to be updated, updating processing logic of each identification module to be updated to execute processing of audio frequency identification tasks corresponding to the initial verifi