CN-121983035-A - Semantic recognition method, device, equipment and medium for fusing audio and video

CN121983035ACN 121983035 ACN121983035 ACN 121983035ACN-121983035-A

Abstract

The invention provides a semantic recognition method and device for fusing audio and video, wherein the semantic recognition method comprises the steps of carrying out feature extraction on audio information to obtain audio features, carrying out feature extraction on video information to obtain lip features, carrying out alignment and splicing on the audio features and the lip features to obtain joint features, carrying out signal-to-noise ratio estimation on a spectrum feature map to obtain signal-to-noise ratio, carrying out feature splitting on the joint features based on the audio weight and the video weight to obtain audio branch features and video branch features, carrying out feature fusion on the audio branch features, the video branch features and the joint features to obtain fusion features, and carrying out semantic recognition on the fusion features to obtain recognition results. The semantic recognition method and device for fusing audio and video provided by the invention are used for solving the technical problem of poor voice recognition effect in a complex acoustic environment.

Inventors

WU HAINING
ZHAO SIZHE
WANG HONGXIN
XIAO YIFAN
Sun Huailang
YANG JUAN
LI SHUJIE
XUE FENG

Assignees

合肥工业大学

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. A semantic recognition method for fusing audio and video, comprising: Performing Mel spectrum conversion on the audio information to obtain Mel spectrum feature graphs, and performing feature extraction on the Mel spectrum feature graphs to obtain audio features; extracting features of video information to obtain lip features, wherein the video information and the audio information are collected simultaneously, and the video information comprises lip information; aligning and splicing the audio features and the lip features to obtain joint features; Estimating the signal-to-noise ratio of the Mel spectrum feature map to obtain the signal-to-noise ratio, and obtaining an audio weight and a video weight according to the signal-to-noise ratio and a preset weight distribution function; performing feature splitting on the joint features based on the audio weights and the video weights through a pre-trained feature fusion model to obtain audio branch features and video branch features, and performing feature fusion on the audio branch features, the video branch features and the joint features to obtain fusion features; and carrying out semantic recognition on the fusion features to obtain a recognition result.
2. The method for semantic recognition of a fused audio and video according to claim 1, wherein the feature extraction of the Mel spectrum feature map to obtain audio features comprises: Calculating frequency domain attention weights through a preset frequency domain attention function and the Mel frequency spectrum feature map; element-level weighting is carried out on the Mel frequency spectrum feature map based on the frequency domain attention weight, so that a weighted feature map is obtained; And extracting the characteristics of the weighted characteristic diagram to obtain audio characteristics.
3. The method for semantic recognition of a fusion audio and video according to claim 1, wherein the feature extraction of the video information to obtain lip features comprises: Extracting space-time characteristics of the video information to obtain basic lip characteristics; Calculating a lip mask through a preset lip region attention function and the basic lip characteristic; performing element level multiplication on the lip mask and the basic lip feature to obtain a weighted lip feature; and carrying out position information enhancement on the weighted lip features to obtain lip features.
4. The method for semantic recognition of a fusion audio and video according to claim 1, wherein the feature splitting is performed on the joint feature based on the audio weight and the video weight by means of a feature fusion model trained in advance to obtain an audio branch feature and a video branch feature, and feature fusion is performed on the audio branch feature, the video branch feature and the joint feature to obtain a fusion feature, which comprises: Performing multi-head attention calculation based on relative positions on the combined features based on a multi-head attention mechanism through a pre-trained feature fusion model to obtain attention output features; performing feature splitting on the attention output features based on the audio weights to obtain audio branch features; Performing feature splitting on the attention output feature based on the video weight to obtain a video branch feature; calculating modal difference perception coefficients based on the audio weights and the video weights; and carrying out feature fusion on the audio branch feature, the video branch feature and the attention output feature according to the modal difference perception coefficient and a preset modal fusion coefficient, and then carrying out feature refining to obtain fusion features.
5. The method for semantic recognition of a fused audio and video according to claim 4, wherein the feature splitting of the attention output feature based on the audio weight to obtain an audio branch feature comprises: Calculating an audio modality mask based on the audio weights and the attention output features; Performing element level multiplication on the audio mode mask and the attention output feature to obtain an initial audio branch feature; and performing downsampling linear transformation on the initial audio branch characteristics to reduce characteristic dimensions, performing nonlinear activation function processing, and recovering the characteristic dimensions through upsampling linear transformation to obtain the audio branch characteristics.
6. The method for semantic recognition of a fused audio and video according to claim 4, wherein the feature splitting of the attention output feature based on the video weight to obtain a video branch feature comprises: calculating a video modality mask based on the video weights and the attention output features; performing element level multiplication on the video mode mask and the attention output characteristic to obtain an initial video branch characteristic; and performing downsampling linear transformation on the initial video branch characteristics to reduce characteristic dimensions, performing nonlinear activation function processing, and recovering the characteristic dimensions through upsampling linear transformation to obtain video branch characteristics.
7. The method for semantic recognition of a fused audio and video according to claim 1, wherein the semantic recognition of the fused feature is performed to obtain a recognition result, and the method comprises the following steps: Performing linear transformation and normalization processing on the fusion characteristics to generate character probability distribution corresponding to each time sequence step, and selecting characters with highest probability values from time sequence step to time sequence step; Performing time sequence splicing on all selected characters, and performing de-duplication processing to obtain a predicted character sequence; Coding operation and semantic optimization are sequentially carried out on the predicted character sequence, and coding characteristics are obtained; and decoding the coding features into natural language texts, and performing format adjustment to obtain recognition results.
8. A semantic recognition device for fusing audio and video, comprising: The audio feature acquisition module is used for carrying out Mel frequency spectrum conversion on the audio information to obtain Mel frequency spectrum feature graphs, and carrying out feature extraction on the Mel frequency spectrum feature graphs to obtain audio features; the lip feature acquisition module is used for extracting features of the video information to obtain lip features, wherein the video information and the audio information are acquired simultaneously, and the video information comprises lip information; the feature splicing module is used for aligning and splicing the audio features and the lip features to obtain joint features; The weight calculation module is used for estimating the signal-to-noise ratio of the Mel frequency spectrum feature map to obtain the signal-to-noise ratio, and obtaining an audio weight and a video weight according to the signal-to-noise ratio and a preset weight distribution function; The feature fusion module is used for carrying out feature splitting on the joint features based on the audio weight and the video weight through a pre-trained feature fusion model to obtain audio branch features and video branch features, and carrying out feature fusion on the audio branch features, the video branch features and the joint features to obtain fusion features; and the semantic recognition module is used for carrying out semantic recognition on the fusion features to obtain a recognition result.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the semantic recognition method of merging audio and video according to any one of claims 1-7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the semantic recognition method of merging audio and video according to any one of claims 1 to 7.

Description

Semantic recognition method, device, equipment and medium for fusing audio and video Technical Field The invention relates to the field of voice recognition, in particular to a semantic recognition method and device for fusing audio and video. Background In complex acoustic environments such as car cabins, industrial control centers, etc., the individual speech recognition systems can drastically degrade when the background noise is too high. In order to improve the robustness of recognition, the prior art often introduces video information for assistance, and extracts the lip movement characteristics of a speaker to be fused with audio characteristics, so that the recognition accuracy is assisted and improved in a noisy environment. The current mainstream fusion method is to simply splice or weight the audio features and the video features and then input the audio features and the video features into the recognition model for processing. However, the existing method still has a static single problem in the fusion strategy. Such methods cannot adaptively adjust the weights of audio and video according to the change in the ambient signal-to-noise ratio due to lack of dynamic perception of audio quality. Under the scene of serious noise interference, the excessive weighting of low-quality audio can inhibit the complementary effect of video information, so that the fusion effect is poor, and the accuracy and stability of overall semantic recognition are further affected. Therefore, there is a need for improvement. Disclosure of Invention The invention provides a semantic recognition method and a semantic recognition device for fusing audio and video, which are used for solving the technical problem of poor voice recognition effect in a complex acoustic environment. The invention provides a semantic recognition method for fusing audio and video, which comprises the following steps: Performing Mel spectrum conversion on the audio information to obtain Mel spectrum feature graphs, and performing feature extraction on the Mel spectrum feature graphs to obtain audio features; extracting features of video information to obtain lip features, wherein the video information and the audio information are collected simultaneously, and the video information comprises lip information; aligning and splicing the audio features and the lip features to obtain joint features; Estimating the signal-to-noise ratio of the Mel spectrum feature map to obtain the signal-to-noise ratio, and obtaining an audio weight and a video weight according to the signal-to-noise ratio and a preset weight distribution function; performing feature splitting on the joint features based on the audio weights and the video weights through a pre-trained feature fusion model to obtain audio branch features and video branch features, and performing feature fusion on the audio branch features, the video branch features and the joint features to obtain fusion features; and carrying out semantic recognition on the fusion features to obtain a recognition result. In an embodiment of the present invention, the feature extraction of the Mel spectrum feature map to obtain audio features includes: Calculating frequency domain attention weights through a preset frequency domain attention function and the Mel frequency spectrum feature map; element-level weighting is carried out on the Mel frequency spectrum feature map based on the frequency domain attention weight, so that a weighted feature map is obtained; And extracting the characteristics of the weighted characteristic diagram to obtain audio characteristics. In an embodiment of the present invention, the feature extraction of the video information to obtain lip features includes: Extracting space-time characteristics of the video information to obtain basic lip characteristics; Calculating a lip mask through a preset lip region attention function and the basic lip characteristic; performing element level multiplication on the lip mask and the basic lip feature to obtain a weighted lip feature; and carrying out position information enhancement on the weighted lip features to obtain lip features. In an embodiment of the present invention, the feature splitting is performed on the joint feature based on the audio weight and the video weight by using a feature fusion model trained in advance to obtain an audio branch feature and a video branch feature, and feature fusion is performed on the audio branch feature, the video branch feature and the joint feature to obtain a fusion feature, including: Performing multi-head attention calculation based on relative positions on the combined features based on a multi-head attention mechanism through a pre-trained feature fusion model to obtain attention output features; performing feature splitting on the attention output features based on the audio weights to obtain audio branch features; Performing feature splitting on the attention output feature based on the video weig