CN-121528248-B - Audio and video semantic analysis method and system for multi-mode contrast learning

CN121528248BCN 121528248 BCN121528248 BCN 121528248BCN-121528248-B

Abstract

The application discloses an audio and video semantic analysis method and system for multi-mode contrast learning, and relates to the technical field of audio and video processing, wherein the method comprises the steps of acquiring audio and video data, and extracting an audio signal and a video frame sequence; the method comprises the steps of obtaining audio emotion characteristics and audio content characteristics, video emotion characteristics and video content characteristics, calculating emotion intensity difference coefficients of the audio emotion characteristics and the video emotion characteristics, carrying out intensity normalization processing, carrying out weighted fusion on the normalized audio emotion characteristics and the normalized video emotion characteristics to obtain cross-modal emotion characteristics, and carrying out joint coding on the cross-modal emotion characteristics, the audio content characteristics and the video content characteristics to generate unified audio and video semantic representation. The method solves the problems that the existing AI video generation system often directly outputs audio and video contents, and lacks a detection and optimization mechanism for the emotion matching degree of the audio and video in the generated result, so that the AI video generation system has uncoordinated emotion expression.

Inventors

SHU LEI

Assignees

北京流金岁月科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251117

Claims (9)

1. An audio and video semantic analysis method for multi-modal contrast learning is characterized by comprising the following steps: Acquiring audio and video data, and respectively extracting an audio signal and a video frame sequence; Acquiring audio emotion characteristics and audio content characteristics according to the audio signals, and acquiring video emotion characteristics and video content characteristics according to the video frame sequence; calculating an emotion intensity difference coefficient between the audio emotion feature and the video emotion feature, and carrying out intensity normalization processing on the audio emotion feature and the video emotion feature according to the emotion intensity difference coefficient to obtain a normalized audio emotion feature and a normalized video emotion feature; Weighting and fusing the normalized audio emotion characteristics and the normalized video emotion characteristics to obtain cross-modal emotion characteristics; And carrying out joint coding on the cross-modal emotion characteristics, the audio content characteristics and the video content characteristics to generate unified audio and video semantic representation.
2. The method of claim 1, wherein obtaining audio emotion features and audio content features from the audio signal and obtaining video emotion features and video content features from the sequence of video frames comprises: Processing the audio signal through an audio decoupling network to obtain the audio emotion characteristics and the audio content characteristics, wherein the audio decoupling network comprises an audio emotion characteristic extraction branch and an audio content characteristic extraction branch; And processing the video frame sequence through a video decoupling network to acquire the video emotion characteristics and the video content characteristics, wherein the video decoupling network comprises a video emotion characteristic extraction branch and a video content characteristic extraction branch.
3. The method of claim 2, wherein the step of constructing the audio decoupling network comprises: Retrieving a plurality of sample audio signals through big data to construct a sample audio signal set; performing audio emotion feature tag labeling on each sample audio signal in the sample audio signal set to construct a sample audio emotion feature tag set; labeling the audio content characteristic labels of the sample audio signals in the sample audio signal set, and constructing a sample audio content characteristic label set; taking the sample audio signal set as input, taking the sample audio emotion feature tag set as a supervision tag, and constructing the audio emotion feature extraction branch; taking the sample audio signal set as input, and taking the sample audio content feature tag set as a supervision tag to construct the audio content feature extraction branch; And integrating the input ends of the audio emotion feature extraction branch and the audio content feature extraction branch to obtain the audio decoupling network in an integrated way.
4. The method of claim 2, wherein the step of constructing the video decoupling network comprises: searching a plurality of sample video frame sequences through big data to construct a sample video frame sequence set; carrying out video emotion feature tag labeling on each sample video frame sequence in the sample video frame sequence set, and constructing a sample video emotion feature tag set; labeling the video content characteristic labels of the sample video frame sequences in the sample video frame sequence set, and constructing a sample video content characteristic label set; Taking the sample video frame sequence set as input, and taking the sample video emotion feature tag set as a supervision tag to construct the video emotion feature extraction branch; Taking the sample video frame sequence set as input, and taking the sample video content feature tag set as a supervision tag to construct the video content feature extraction branch; And integrating the video emotion feature extraction branch with the input end of the video content feature extraction branch to obtain the video decoupling network in an integrated way.
5. The method of claim 1, wherein calculating an emotion intensity difference coefficient between the audio emotion feature and the video emotion feature, performing intensity normalization processing on the audio emotion feature and the video emotion feature according to the emotion intensity difference coefficient, and obtaining a normalized audio emotion feature and a normalized video emotion feature, comprises: Respectively calculating the characteristic intensity values of the audio emotion characteristics and the video emotion characteristics to obtain an audio emotion intensity value and a video emotion intensity value; calculating the emotion intensity difference coefficient based on the audio emotion intensity value and the video emotion intensity value; and performing intensity adjustment on the audio emotion characteristics according to the emotion intensity difference coefficient to obtain the normalized audio emotion characteristics, and performing intensity adjustment on the video emotion characteristics according to the emotion intensity difference coefficient to obtain the normalized video emotion characteristics.
6. The method of claim 5, wherein calculating the emotion intensity difference coefficient based on the audio emotion intensity value and the video emotion intensity value comprises: calculating the relative ratio of the video emotion intensity value to the audio emotion intensity value by taking the audio emotion intensity value as a reference; And taking the relative ratio as the emotion intensity difference coefficient.
7. The method of claim 5, wherein performing intensity adjustment on the audio emotion feature according to the emotion intensity difference coefficient to obtain the normalized audio emotion feature, and performing intensity adjustment on the video emotion feature according to the emotion intensity difference coefficient to obtain the normalized video emotion feature, comprising: multiplying the audio emotion characteristics based on the square root value of the emotion intensity difference coefficient to obtain the normalized audio emotion characteristics; and dividing the video emotion characteristics based on the square root value of the emotion intensity difference coefficient to obtain the normalized video emotion characteristics.
8. The method of claim 1, wherein weighting and fusing the normalized audio emotion features and the normalized video emotion features to obtain cross-modal emotion features comprises: identifying a current scene type based on the audio-video data; acquiring a corresponding preset audio fusion weight and a corresponding preset video fusion weight according to the current scene type; And according to the preset audio fusion weight and the preset video fusion weight, carrying out weighted combination on the normalized audio emotion characteristics and the normalized video emotion characteristics to obtain the cross-modal emotion characteristics.
9. An audio-visual semantic analysis system for multimodal contrast learning, characterized by performing the method of any of claims 1-8, comprising: The information acquisition module is used for acquiring audio and video data and respectively extracting an audio signal and a video frame sequence; the feature acquisition module is used for acquiring audio emotion features and audio content features according to the audio signals and acquiring video emotion features and video content features according to the video frame sequence; The coefficient calculation module is used for calculating an emotion intensity difference coefficient between the audio emotion characteristics and the video emotion characteristics, and carrying out intensity normalization processing on the audio emotion characteristics and the video emotion characteristics according to the emotion intensity difference coefficient to obtain normalized audio emotion characteristics and normalized video emotion characteristics; the feature processing module is used for carrying out weighted fusion on the normalized audio emotion feature and the normalized video emotion feature to obtain a cross-modal emotion feature; And the semantic generation module is used for carrying out joint coding on the cross-modal emotion characteristics, the audio content characteristics and the video content characteristics to generate unified audio and video semantic representation.

Description

Audio and video semantic analysis method and system for multi-mode contrast learning Technical Field The application relates to the technical field of audio and video processing, in particular to an audio and video semantic analysis method and system for multi-mode contrast learning. Background With the rapid development of artificial intelligence technology, the field of audio and video processing has made remarkable progress. However, the traditional post-generation processing method mainly focuses on technical indexes such as definition, fluency and the like, lacks special detection and optimization capability for emotion semantic layer consistency, focuses on extraction of content features, and does not accurately and finely process emotion features, so that emotion feature intensities of different modes are unbalanced. Disclosure of Invention The embodiment of the application solves the technical problems that the existing AI video generation system often directly outputs audio and video contents, and lacks a detection and optimization mechanism for the emotion matching degree of the audio and video in the generated result, so that the AI video generation system has uncoordinated emotion expression by providing the audio and video semantic analysis method and system for multi-mode comparison learning. The technical scheme for solving the technical problems is as follows: in a first aspect, the present application provides an audio/video semantic analysis method for multi-modal contrast learning, where the method includes: Acquiring audio and video data, and respectively extracting an audio signal and a video frame sequence; Acquiring audio emotion characteristics and audio content characteristics according to the audio signals, and acquiring video emotion characteristics and video content characteristics according to the video frame sequence; calculating an emotion intensity difference coefficient between the audio emotion feature and the video emotion feature, and carrying out intensity normalization processing on the audio emotion feature and the video emotion feature according to the emotion intensity difference coefficient to obtain a normalized audio emotion feature and a normalized video emotion feature; Weighting and fusing the normalized audio emotion characteristics and the normalized video emotion characteristics to obtain cross-modal emotion characteristics; And carrying out joint coding on the cross-modal emotion characteristics, the audio content characteristics and the video content characteristics to generate unified audio and video semantic representation. In a second aspect, the present application provides an audio/video semantic analysis system for multi-modal contrast learning, including: The information acquisition module is used for acquiring audio and video data and respectively extracting an audio signal and a video frame sequence; the feature acquisition module is used for acquiring audio emotion features and audio content features according to the audio signals and acquiring video emotion features and video content features according to the video frame sequence; The coefficient calculation module is used for calculating an emotion intensity difference coefficient between the audio emotion characteristics and the video emotion characteristics, and carrying out intensity normalization processing on the audio emotion characteristics and the video emotion characteristics according to the emotion intensity difference coefficient to obtain normalized audio emotion characteristics and normalized video emotion characteristics; the feature processing module is used for carrying out weighted fusion on the normalized audio emotion feature and the normalized video emotion feature to obtain a cross-modal emotion feature; And the semantic generation module is used for carrying out joint coding on the cross-modal emotion characteristics, the audio content characteristics and the video content characteristics to generate unified audio and video semantic representation. The application provides one or more technical schemes, which at least have the following technical effects or advantages: According to the audio and video semantic analysis method and system for multi-mode contrast learning, through obtaining audio and video data and respectively extracting audio and video features, intensity normalization and weighted fusion are carried out on emotion features, finally self-adaptive fusion is carried out on the basis of scene types, unified semantic representation after optimization is generated, the emotion matching degree of the audio and video can be effectively detected and optimized, and the problem that emotion expression of an existing AI video generation system is not coordinated is solved. Through the technical scheme, not only can the audio and video data be accurately processed and analyzed, but also innovative solutions can be brought for a plurality of fields. The method and the