CN-122027788-A - Machine visual angle video quality evaluation system based on human and machine double-system perception

CN122027788ACN 122027788 ACN122027788 ACN 122027788ACN-122027788-A

Abstract

The invention relates to a machine visual angle video quality evaluation system based on human and machine dual-system perception, which belongs to the field of video quality evaluation, and comprises a human perception evaluation subsystem, a video processing subsystem and a video processing subsystem, wherein the human perception evaluation subsystem acquires videos acquired by external equipment, performs multi-dimensional quality scoring and distortion type recognition on the videos, generates specific scoring and overall quality evaluation of human perception multi-dimensional evaluation, and outputs a distortion type report; the machine perception evaluation subsystem performs multi-type tasks through a large model and a specific model aiming at the suitability of machine tasks to obtain human perception evaluation and machine perception evaluation results of the large model output to the video and machine perception scores under specific task scenes, and the comprehensive evaluation result generation subsystem dynamically adjusts weights of human perception and machine perception according to application scenes and fuses and generates comprehensive evaluation results. The system realizes the omnibearing accurate assessment of video quality by constructing a double-system cooperative assessment framework.

Inventors

Ying Jiangyong
ZHANG YINI
LI YITONG
JIN JIANING
ZHAI GUANGTAO
ZHENG YULI
DUAN HUIYU

Assignees

天翼视联科技股份有限公司
上海交通大学

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (10)

1. A machine visual angle video quality evaluation system based on human and machine dual-system perception is characterized by comprising a human perception evaluation subsystem, a machine perception evaluation subsystem and a comprehensive evaluation result generation subsystem, wherein, The human perception evaluation subsystem is used for acquiring videos acquired by external equipment, carrying out multi-dimensional quality scoring and distortion type identification on the videos according to visual properties and human subjective experience of the videos, generating specific scoring and overall quality evaluation of human perception multi-dimensional evaluation, and outputting a distortion type report to fit human subjective quality judgment logic; The machine perception evaluation subsystem is used for acquiring videos acquired by external equipment, executing multi-type tasks according to the suitability of machine tasks through a large model and a specific model, and obtaining human perception evaluation and machine perception evaluation results of the large model output on the videos and machine perception scores under specific task scenes so as to evaluate the influence of the videos on machine function scenes; The comprehensive evaluation result generation subsystem is used for dynamically adjusting the weight of human perception and machine perception according to an application scene, fusing the specific score of human perception multi-dimensional evaluation, the human perception evaluation, machine perception evaluation result and machine perception score under a specific task scene, generating a comprehensive evaluation result, and feeding back to a video generation end or a machine task end.
2. The system for evaluating the video quality of the visual angle of the machine based on human and machine dual-system perception according to claim 1, wherein the external equipment adopts an autonomous camera system, and the video collected by the external equipment comprises a first visual angle video collected by a robot, an unmanned plane, a monitoring camera, a mechanical arm or a vehicle.
3. The system for evaluating video quality of a machine viewing angle based on human-machine dual-system perception according to claim 1, wherein the human perception evaluation subsystem comprises a quality scoring module and a large model distortion type evaluation module, The quality scoring module is used for extracting multidimensional visual features of the video and generating specific scores and overall quality scores of human perception multidimensional evaluation; The large model distortion type evaluation module is used for identifying distortion existing in each dimension in the video by utilizing a large model and outputting a distortion type report.
4. The system of claim 3, wherein the quality scoring module extracts visual features from a plurality of dimensions of color, noise, artifacts, blur and timing consistency for the video and quantitatively scores the video to obtain a score for each dimension, and combines the quantitative scores for five dimensions of the video to obtain an overall quality score for the human on video quality.
5. The system for evaluating visual angle video quality of a machine based on human and machine dual-system perception according to claim 1, wherein the machine perception evaluation subsystem comprises a large model task evaluation module and a specific model task evaluation module, The large model task evaluation module is used for driving a video-text multi-mode large model to execute tasks to obtain task results, analyzing task result deviation to quantify the influence of video quality on machine understanding, synthesizing a distortion type report of the large model distortion type evaluation module, simulating human subjective judgment to output large model human perception evaluation by combining actual task requirements, and outputting large model machine perception evaluation results by combining task deviation; the specific model task evaluation module is used for obtaining the machine perception score under the scene of completing the specific task by calling the special model, and evaluating the interference degree of the video quality on the execution precision of the machine vision task.
6. The system for evaluating visual angle video quality of a machine based on human and machine dual-system perception according to claim 5, wherein the large model task evaluation module drives the multi-mode large model to execute a regression task, a classification task, a visual question-answer task and a video description task, and evaluates video quality according to the deviation of the task result and a standard value.
7. The system for evaluating visual angle video quality of a machine based on human and machine dual-system perception according to claim 6, wherein the large model task evaluation module designs judgment and multiple choice questions related to video content, calculates deviation of predicted value and true value in regression task, semantic similarity of visual question-answering task and text evaluation index of video description task, and quantifies influence of video quality on machine understanding.
8. The system of claim 6, wherein the large model task evaluation module is configured to train LMMs based on a multi-task joint loss function to drive the multi-mode large model to perform a regression task, a classification task, a visual question-answer task, and a video description task.
9. The system of claim 5, wherein the model-specific task evaluation module invokes a dedicated model for segmentation, detection and retrieval to perform the corresponding task, and the indexes of mask accuracy, detection accuracy and recall rate obtained by the execution are used to quantify the influence of the video on the machine task.
10. The system for evaluating the video quality of the machine visual angle based on human-machine dual-system perception according to claim 9, wherein the specific model task evaluation module executes a segmentation task, segments the video content through a segmentation model, calculates the intersection ratio of a segmentation result and a standard mask, and evaluates the influence of the video quality on the segmentation task; The specific model task evaluation module executes a detection task, detects targets in the video through a detection model, calculates detection accuracy and omission factor, and evaluates the influence of video quality on the detection task; the specific model task evaluation module executes a search task, performs video content search through a search model, calculates a search recall rate, and evaluates the influence of video quality on the search task.

Description

Machine visual angle video quality evaluation system based on human and machine double-system perception Technical Field The invention belongs to the field of video quality evaluation, and particularly relates to a machine visual angle video quality evaluation system based on human and machine dual-system perception. Background In recent years, the popularity of robotics and smart devices has spawned a large number of robotics-GENERATED CONTENT (RGC). RGC is a content photographed from a robot view angle, and video or image content automatically collected and generated by a machine having sensing and generating capabilities is generally from a first view angle of a machine such as a robot, a monitoring camera, an autopilot car, etc., and is widely used in scenes such as personal intelligence, autopilot, and intelligent manufacturing. For example, a first view video captured by a device such as an unmanned aerial vehicle, a wheeled robot, a robotic arm, or the like. Such videos have unique motion patterns (easy generation of strong vibration and mechanical shake), device induced distortion (e.g. sensor noise, self-occlusion) and task oriented characteristics (e.g. remote operation, monitoring scene), and are essentially different from traditional User generated content (User-GENERATED CONTENT, UGC) and professional generated content (Professionally-GENERATED CONTENT, PGC). UGC refers to user generated content, namely content which is authored by a common user and uploaded to a network platform, such as short videos, comments, pictures and the like, is an important component of social media content ecology, and has wide participation and diversity. PGCs refer to professional production content, and high-quality content authored and produced by professional teams or institutions generally has a higher production level, and is widely applied to the fields of news, movies, education and the like, and is commonly found in streaming media platforms and traditional media. However, existing VQA (Video Quality Assessment ) techniques have not established a specialized assessment framework for the characteristics of RGCs, resulting in an inability to accurately quantify the quality performance of RGC video in robotic task scenarios. VQA is a technique for objectively or subjectively evaluating the quality of video, often in combination with a human eye perception model or a deep learning method. Meanwhile, the traditional video quality assessment technology is mainly based on human subjective preference, and a human-centered assessment system is constructed by simulating perception of low-level attributes such as color, texture and structure by a human visual system (Human Visual System, HVS). The HVS is a human visual system, namely a perception system for cooperatively processing visual information by human eyes and brains, has sensitivity to characteristics such as color, texture, structure and the like, and is a theoretical basis of most image and video quality evaluation methods. Such methods have a certain applicability in human-oriented viewing of scenes, but as machines become the primary consumers of visual data (e.g., autopilot, industrial inspection, multimodal large-model-driven intelligent systems, etc.), their limitations are increasingly prominent. The prior art has obvious defects in machine vision video quality evaluation: 1. The evaluation dimension is single, only human visual subjective experience is focused, and preference of a neglect machine on video quality is determined by downstream task execution effects such as target detection and semantic segmentation. For example, in industrial vision, the fine edge blurring has little influence on human beings, but the recognition accuracy of a machine detection model is remarkably reduced, and the human-sensitive color deviation has limited influence on part of machine tasks, so that the human center logic cannot reflect the quality expression of a video in a machine task scene, and especially, the special requirements of task-time sequence consistency such as autonomous navigation of a robot in an RGC scene are difficult to meet. 2. Human and machine perception mechanisms differ significantly and are not addressed. Human vision focuses on the overall appeal of the video and low-level features such as brightness, contrast, etc., and machine vision (Machine Visual System, MVS) focuses on task result consistency, such as detection frame accuracy. MVS is a machine vision system, which refers to a machine vision sensing framework composed of a sensor, an image processing algorithm and artificial intelligence, has the capabilities of target detection, identification and analysis, and is widely applied to the fields of industrial automation, security monitoring, intelligent driving and the like. The machine is highly sensitive to distortion such as lens blur, robust to average brightness variations, and human being more sensitive to the latter. The prior art