CN-121985141-A - Video conference audio and video quality comprehensive detection system integrating deep learning

CN121985141ACN 121985141 ACN121985141 ACN 121985141ACN-121985141-A

Abstract

The application relates to the technical field of multimedia communication and discloses a video conference audio and video quality comprehensive detection system integrating deep learning, which comprises a demultiplexing module, a video processing module, an audio processing module, a pair Ji Huancun, an analysis and judgment module, wherein the demultiplexing module separates an audio and video code stream and analyzes a time stamp, the video processing module analyzes a code stream grammar structure, utilizes a weighting coefficient based on space division size and residual code quantity to aggregate a motion vector to generate a video feature sequence without pixel reconstruction, the audio processing module calculates a framing energy index to generate an audio feature sequence, the pair Ji Huancun module maps the feature sequence to a unified time axis, the analysis module calculates a cross-correlation function to determine a peak value and an optimal time lag amount, and the judgment module judges a playing quality state through activity gating, lag comparison and asymmetric synchronous tolerance interval. The application utilizes the compressed domain characteristics and the layering judgment mechanism, reduces the calculation consumption, and improves the detection accuracy and robustness.

Inventors

WANG XIQIAN
YANG YIXI
SI JIA
LI YAXI
QIU YAJUN
HE YUZE
SU TONG
ZHOU QIAN
Lv Jianshu

Assignees

国家电网有限公司信息通信中心

Dates

Publication Date: 20260505
Application Date: 20260204

Claims (10)

1. A video conference audio and video quality comprehensive detection system integrating deep learning is characterized by comprising: a demultiplexing module (110) for receiving the multimedia data packet, separating the multimedia data packet into a video elementary stream and an audio elementary stream, and parsing the time stamp; The video processing module (120) is used for analyzing the grammar structure of the video basic code stream without pixel reconstruction, extracting a motion vector, a space division size and residual code quantity, and aggregating the motion vector by using a weighting coefficient calculated based on the space division size and the residual code quantity to generate a video characteristic sequence; An audio processing module (130) for decoding the audio elementary streams and framing according to video frames, calculating an energy index to generate an audio feature sequence; a pair Ji Huancun module (140) for mapping the video feature sequence and the audio feature sequence to the same time axis using the time stamps and for aligned storage within a sliding window; An analysis module (150) for calculating a cross-correlation function of the video feature sequence and the audio feature sequence within the sliding window, determining a cross-correlation peak and an optimal time lag; and the judging module (160) is used for judging the playing quality state comprising video freezing and sound-picture asynchronism according to the cross-correlation peak value and the optimal time lag.
2. The integrated detection system for video conferencing audio/video quality incorporating deep learning as claimed in claim 1 wherein the video processing module (120) calculates the weighting coefficients for each coding unit within the current video frame when generating the video feature sequence; the numerical value of the weighting coefficient is positively correlated with the residual code amount and negatively correlated with the space division size; the video processing module (120) performs weighted summation on the motion vector module length of the coding unit by using the weighting coefficient to obtain the original weighted aggregation characteristic of the current video frame.
3. The integrated detection system for video conference audio and video quality according to claim 2, wherein the video processing module (120) multiplexes the video feature value of the previous frame or uses the average value of the feature values of the predicted frames before and after as the feature value of the current frame if the current frame is identified as the intra-frame encoded frame not including the motion vector when processing the video elementary stream.
4. The integrated detection system for video conference audio and video quality with deep learning according to claim 1, wherein the audio processing module (130) obtains normalized amplitude values of all sampling points in an audio frame when calculating an energy index, calculates a root mean square of the normalized amplitude values, and converts the root mean square into logarithmic energy values in decibels; The conversion process uses a 20-fold base-10 logarithmic calculation for maintaining dimensional consistency with the physical field quantities.
5. The integrated deep learning integrated video conference audio and video quality detection system according to claim 1, wherein the pair Ji Huancun module (140) is configured with a dual-channel synchronous ring buffer, and the physical capacity of the ring buffer is set to be greater than the sum of the detection window length and a preset network jitter tolerance; The comprehensive detection system also utilizes the sampling clock frequency of the current session to uniformly convert the original time stamps of the video basic code stream and the audio basic code stream into display time stamps in millisecond units, and the display time stamps are used for establishing time sequence mapping in the annular buffer.
6. The integrated detection system for video conference audio and video quality with deep learning according to claim 1, wherein the analysis module (150) calculates normalized cross-correlation coefficients of the video feature sequence and the audio feature sequence in a preset integer search space when calculating the cross-correlation function; The optimal time lag amount is a relative time offset when the normalized cross-correlation coefficient reaches a maximum value, and the cross-correlation peak value is a maximum value of the normalized cross-correlation coefficient in the search space.
7. The integrated deep learning video conferencing audio/video quality detection system as claimed in claim 1, wherein the decision module (160) performs a hierarchical decision logic comprising liveness gating detection, content consistency decisions and time alignment decisions; The liveness gating detection is used for checking whether the amplitude values of the current audio feature sequence and the video feature sequence are respectively higher than a preset effective voice threshold and a preset minimum motion threshold before cross-correlation analysis is carried out.
8. The integrated detection system for video conference audio and video quality with deep learning according to claim 7, wherein the specific way for the decision module (160) to perform the content consistency decision is: evaluating the cross-correlation peak value by adopting hysteresis comparison logic; When the current state is asynchronous, judging that the contents are matched only when the cross correlation peak value is larger than a set high threshold value; when the current state is synchronous, content mismatch is determined only when the cross correlation peak is less than a set low threshold.
9. The integrated detection system for video conference audio and video quality with deep learning according to claim 7, wherein the decision module (160) converts the optimal time lag into a time delay millisecond value when performing a time alignment decision, and determines whether the time delay millisecond value falls within a preset synchronization tolerance interval; the synchronization tolerance interval adopts asymmetric setting, and the maximum absolute value of the allowable audio lead of the synchronization tolerance interval is smaller than the maximum absolute value of the allowable audio lag.
10. The integrated video conference audio and video quality detection system according to claim 1, wherein the decision module (160) is further configured to perform a state smoothing process, maintain a state confirmation sliding window, and output a final play quality state detection result only when the frame number ratio of the same abnormal state is determined to exceed a preset ratio in the sliding window.

Description

Video conference audio and video quality comprehensive detection system integrating deep learning Technical Field The invention relates to the technical field of multimedia communication, in particular to a video conference audio and video quality comprehensive detection system integrating deep learning. Background With the popularization of remote office and online collaboration modes, video conferences have become a core tool for cross-regional communication, and users have increasingly increased audiovisual quality requirements in conference processes. Among the quality experience indicators, the fluency of video pictures and the degree of synchronization of audio and video are key factors in determining the user experience. Once the phenomenon of picture freezing or sound-picture non-synchronization occurs, the information transmission efficiency can be obviously reduced, and even the visual fatigue and uncomfortable feeling of a participant are caused. In the existing video conference quality monitoring system, detection of picture freezing mainly depends on pixel domain analysis technology. Such methods typically require the received compressed video stream to be fully decoded into the original image sequence, and then a determination of whether the picture is still is made by computing the sum of pixel difference absolute values (SAD) between adjacent frames or extracting dense optical flow features. However, video decoding and pixel level computation are computationally intensive tasks, with extremely high computational power and memory bandwidth occupation on the processor. When the media server needs to process hundreds to thousands of concurrent sessions at the same time or runs on the edge node with limited calculation power and the mobile terminal, the full decoding detection mode is difficult to meet the real-time requirement, even the normal audio and video processing resources can be preempted, and the stability of the conference is affected. In addition, the existing solutions have certain limitations for detecting the audio and video synchronization problem. Conventional synchronization detection is often based on a Presentation Time Stamp (PTS) of the transport protocol header for comparison. Although this approach is simple to implement, it only reflects the time relation of the data packets at the transmission level, and cannot perceive the actual play-out time difference caused by the decoding time-consuming difference, the rendering queue blocking or the play-out device clock drift. Some schemes employing signal processing techniques attempt to exploit the cross-correlation of audio waveforms with video time domain features, but typically employ symmetric time thresholds in the decision logic, i.e., considering sound earlier than picture and sound later than picture as equivalent to the effect of picture on the user experience. This is not consistent with the psycho-acoustic and visual-perceptual characteristics of humans, where the sensitivity of the human eye to "sound before picture" is much higher than "sound after picture", and the use of symmetric thresholds tends to result in inconsistent detection results with the subjective perception of the user. Meanwhile, when the existing cross-correlation detection algorithm faces low information content scenes such as background silence, pure-color PPT display and the like which are common in a conference, the calculated correlation coefficient is severely fluctuated due to the lack of an effective signal gating and state smoothing mechanism, so that frequent false alarms are caused. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a video conference audio and video quality comprehensive detection system integrating deep learning, which solves the problems of large consumption of full decoding analysis and calculation resources, incapability of adapting to human eye perception difference in audio and video synchronous detection and inaccurate detection caused by network environment interference in the prior art. The invention aims at realizing the technical scheme that the video conference audio and video quality comprehensive detection system integrating deep learning comprises a demultiplexing module, a video processing module, an audio processing module, a pair Ji Huancun module, an analysis module and a judgment module. The de-multiplexing module is used for receiving the multimedia data packet, separating the multimedia data packet into a video basic code stream and an audio basic code stream, and analyzing the time stamp. The video processing module is used for analyzing the grammar structure of the video basic code stream and extracting the motion vector, the space division size and the residual code quantity. The module does not need to reconstruct pixels, but utilizes weighting coefficients calculated based on space division size and residual error code quantity to aggregate motion vectors