CN-122027793-A - Fine code rate regulation and control method based on audio-video multi-mode sensing
Abstract
The invention relates to the technical field of video coding and discloses a code rate fine regulation method based on audio-video multi-mode sensing, which comprises the following steps of 1, extracting and normalizing auditory sensing characteristics; step 2, calculating a frame-level hearing significance score S aud , step 3, fusing a hearing significance map with the frame-level hearing significance score S aud to obtain a code rate adjusting factor matrix F, step 4, obtaining a local adjusting factor value F CU corresponding to each coding unit through F, comparing the local adjusting factor value F CU with an adjusting factor reference value F ref to obtain relative deviation, introducing the deviation value into QP offset control logic, dynamically correcting the original QP offset of an encoder, effectively improving the interpretation capability of a model on audio details, enhancing the space-time resolution capability of the code rate adjusting factor matrix F guided by significance, and having stronger flexibility and being suitable for a task scene of audio and video fusion.
Inventors
- YIN HAIBING
- CHEN QI
- WANG LEYANG
- HUANG XIAOFENG
- WANG HONGKUI
- WANG XIA
- WANG JUN
- SUN YAOQI
- XIE YUN
- FENG GAO
Assignees
- 杭州电子科技大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251017
Claims (9)
- 1. A code rate fine regulation and control method based on audio and video multi-mode sensing comprises the following steps: Step 1, based on human auditory perception, extracting frame-by-frame characteristics of an audio signal to obtain auditory perception characteristics; Step 2, carrying out normalization processing on auditory perception characteristics, fusing by utilizing a NAMM model, introducing a voice activity detection probability P VAD to carry out weighted adjustment, and calculating to obtain a frame-level auditory significance score S aud ; step 3, fusing the audio-visual saliency map generated based on STAViS model with the frame-level hearing saliency score S aud to enable the frame-level hearing saliency score S aud to carry out global amplitude modulation on the audio-visual saliency map so as to obtain a code rate adjustment factor matrix F; And 4, acting the code rate adjusting factor matrix F on the code rate adjusting process, extracting a corresponding local adjusting factor value F CU from each coding unit, comparing the local adjusting factor value F CU with an adjusting factor reference value F ref to obtain a relative offset value, introducing the relative offset value as an increment term into QP offset control logic, and correcting on the basis of the quantization offset delta QP base (k) of the original strategy of the encoder, wherein the formula is as follows: ΔQP new (k)=ΔQP base (k)+λ·(F ref -F CU (k)) Where λ is the intensity of the control perceptual offset, k is the coding unit index, Δqp new (k) is the new quantization parameter offset that is dynamically adjusted under multi-modal perceptual guidance.
- 2. The method for fine adjustment and control of code rate based on audio/video multi-mode sensing as set forth in claim 1, wherein the auditory sensing features in step 1 include a loudness dimension, a pitch dimension and a tone dimension, and the continuous audio signal is sliced according to a time window synchronized with the video frame before the features are extracted, and the frame-by-frame feature extraction is performed on the audio signal and the video frame.
- 3. The method for finely adjusting and controlling the code rate based on the audio-video multi-mode sensing is characterized in that the loudness sensing characteristic of each frame is extracted in the loudness dimension, the loudness sensing characteristic selects a linear loudness scale established based on a psychoacoustic experiment, a loudness index with a tone as a unit is set as x i , the sampling frequency is f s , the calibration factor is C mic , and the loudness L sone (i) of the ith frame is expressed as: Wherein the function is Representing a loudness calculation function conforming to the ISO 532B standard.
- 4. The method for fine adjustment and control of code rate based on audio/video multi-mode sensing as set forth in claim 2, wherein each frame of fundamental frequency difference is extracted in a pitch dimension, each frame of audio signal x i is input into a fundamental frequency estimation model based on periodic detection, the model adopts pyin algorithm based on probability model, and the formula is: Where T is a sample point, T i = { t|voiceed (T) =1 }, voiceed (T) =1 indicates that the sample point is a voiced region, F 0 (T) is an estimated fundamental frequency value for each sample point, F 0 (i) is an average fundamental frequency of the i-th frame audio sample points, and if no unvoiced section of the fundamental frequency is detected, F 0 (i) =0 is used for filling, and an inter-frame fundamental frequency difference is calculated:
- 5. The method for finely regulating and controlling the code rate based on the audio-video multi-mode sensing according to claim 2, wherein the tone dimension is a tone sensing modeling method combining static and dynamic, and is characterized in that, on one hand, a Mel frequency spectrum is selected as a static tone characteristic MS (i) to describe an energy distribution structure of an audio signal in a sensing frequency space, on the other hand, a spectrum flux is extracted as a dynamic tone characteristic SF (i) to represent the change degree of the frequency spectrum along with time and reflect transient characteristics of sound, and the tone characteristic T (i) is obtained through a NAMM model: T(i)=MS(i)+SF(i)-β·min(MS(i),SF(i)) Wherein, beta epsilon [0,1] is the aliasing suppression coefficient.
- 6. The method for fine regulation and control of code rate based on audio and video multi-mode sensing of claim 5 wherein the Mel frequency spectrum is based on an equal-loudness curve fitting human ear frequency sensitivity model, and the frequency spectrum energy of different frequency bands is weighted to dynamically adjust the sensing weight of each frequency band.
- 7. The method of claim 1, wherein in step 2, a threshold type voice activation detection model based on background energy adaptive estimation is adopted for non-voice audio, non-voice activation probability is calculated according to a comparison relation between intra-frame RMS energy and background energy E bg , and mean square energy of an ith frame is set as: Where x i (t) is the t sample in the i frame, N frame is the audio sample length, and the background energy E bg is estimated by the median of the past several frame energies, and the non-speech activation probability is defined as: Where γ is an empirically set threshold adjustment coefficient.
- 8. The method for fine regulation and control of code rate based on audio and video multi-mode sensing according to claim 1, wherein in step 3, after the audio-visual saliency map is normalized and nonlinear inversion mapping is performed by adopting a Sigmoid function to generate an audio-visual saliency weight map W a , the audio-visual saliency map is fused with a frame-level audio saliency score S aud in a pixel-by-pixel multiplication manner, and the specific formula is as follows: F=(1+S aud )·W a 。
- 9. The method for finely adjusting and controlling the code rate based on audio-video multi-mode sensing according to claim 1, which is characterized in that: In step 4, a local adjustment factor value F CU is extracted by using a region averaging method, and for each coding unit, the spatial region corresponding to F is denoted as Ω CU , and the following formula is given: Where N CU is the number of pixels of the current coding unit, and x and y represent coordinates within the spatial region.
Description
Fine code rate regulation and control method based on audio-video multi-mode sensing Technical Field The invention relates to the technical field of video coding, in particular to a code rate fine regulation and control method based on audio-video multi-mode sensing. Background With the wide application of ultra-high definition video, immersive media and streaming media technologies and the increasing abundance of video content, the requirements of users on subjective perception quality of video are continuously improved, and video coding technologies also face the double challenges of realizing efficient compression and subjective quality assurance under limited bandwidth. The conventional video coding method mainly relies on image features to perform compression optimization, currently, the mainstream video coding standard (including h.264/AVC, HEVC, and latest VVC) mainly adopts an adaptive Quantization Parameter (QP) control method based on image structural features, and determines bit allocation by analyzing low-level visual features such as texture complexity, brightness variance, motion vector, and the like of a video frame, and the latest standard VERSATILE VIDEO CODING (VVC) introduces various new technologies, such as coding unit division, intra-frame sub-partition (ISP), cross Component Linear Model (CCLM), and the like of a quadtree plus multi-type tree (QTMT) structure, so that coding efficiency is remarkably improved, however, along with higher coding complexity, it is also difficult to accurately reflect human perception significance, so that bit allocation is inconsistent with a visual attention area, resource waste is easily caused, and today, the traditional coding strategies implemented only depending on these simple visual features gradually expose limitations of video image content. Currently, a part of enhancement methods consider human eye visual perception, a visual saliency-based coding technology (saliency-based coding) is introduced to simulate human eye space attention characteristics, more coding resources are preferentially allocated to a high-saliency region, so that subjective video quality is improved under the same code rate, for example, chen et al propose a video coding method using motion-assisted saliency analysis, chai et al integrate visual saliency maps into a video encoder, and 360-degree video saliency maps are utilized to predict bit allocation of CTUs. However, most of these methods use a visual channel as a unique information source, and fail to fully consider the important guiding effect of human multi-modal perception systems, especially auditory signals on video perception. In recent years, studies of psychological and cognitive sciences have shown that audio signals can significantly affect human visual attention. For example, auditory events such as speech conversations, sudden changes in ambient sound, changes in music tempo, etc., can often draw the viewer's attention to a particular time period or spatial region, thereby affecting the selective processing of visual information. Inspired by this mechanism, there have been studies in recent years that began to attempt to introduce audio information into saliency modeling to more realistically simulate the human attention profile. For example, tsiami et al propose a STAViS (Spatio-Temporal Audio-Visual Saliency) model that utilizes a deep neural network to predict salient regions in video by fusing visual and Audio features. Xie et al propose a multi-perception fusion framework combining audio, motion and image saliency features, extracting multi-modal information through a three-stream encoder, and effectively integrating to predict the attention distribution of humans in an audiovisual scene. The current multi-modal significance modeling method mostly adopts a deep convolution network or a attention mechanism to realize high-dimensional feature fusion, while the significance prediction effect is improved, the model structure is generally a black box, the interpretation capability of the significant region recognition basis is poor, and the researches mostly focus on multi-modal significance prediction tasks, and lack of a system method and practical application for combining audio and video with significance modeling results to guide video coding and optimize code rate allocation. In recent years, some efforts have been made to use saliency maps for coding optimization, for example, li et al propose a fast coding unit partitioning scheme based on the visual saliency model. Zhu et al constructed a video coding framework that incorporated a multi-scale temporal saliency detection model (SALDPC). These methods can achieve a somewhat perceptually based allocation of bit resources, but still lack integration of audio information with a single visual saliency map as input. In summary, although attention mechanisms are tried to be introduced in the existing video coding optimization method to improve the percept