CN-122023369-A - 4K face quality analysis method based on artificial intelligence

CN122023369ACN 122023369 ACN122023369 ACN 122023369ACN-122023369-A

Abstract

The invention relates to the technical quality detection field of artificial intelligence and 4K ultra-high definition programs, and particularly provides a 4K face quality analysis method based on artificial intelligence. The method comprises the steps of constructing and utilizing a dataset to train a face detection model based on an improvement YOLOv, extracting and analyzing HSY color space characteristics of a face region, adding RGB channel proportion constraint items into a distance formula, improving a K-means clustering algorithm, constructing a face multi-color space conversion algorithm, integrating and improving ResNet and Vision Transformer according to JzAzBz output by the algorithm, constructing and training a 4K image quality assessment visual model, integrating output results through a large model, generating optimization suggestions meeting 4K ultra-high definition program manufacturing specifications, and carrying out multi-dimensional visual output.

Inventors

HAN GUODONG
GUAN ZHIPENG
QIU JIANPENG
DU HUI
ZHANG ZHICHEN
YIN XIAOYANG
ZHU XUN

Assignees

山东广播电视台
山东广电信通网络运营有限公司
海看网络科技(山东)股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260206

Claims (9)

1. A 4K face quality analysis method based on artificial intelligence, the method comprising: S1, obtaining an original image and manufacturing a data set; s2, constructing and training a face detection model based on the improvement YOLOv by utilizing a data set; s3, extracting and analyzing HSY color space characteristics of the front face region according to the front face detection model; S4, adding RGB channel proportion constraint terms into a distance formula, and improving a K-means clustering algorithm; S5, based on an improved K-means clustering algorithm, a face front face multi-color space conversion algorithm based on a BT.2020 color gamut is built, conversion from RGB to a perception uniform color space JzAzBz, a brightness color difference color space YCrCb, xy chromaticity coordinates and HSV is completed, and ideal display brightness Lv is calculated; S6, integrating and improving ResNet and Vision Transformer according to JzAzBz output by a face-to-face multi-color space conversion algorithm, and building and training a 4K image quality assessment visual model; S7, integrating the HSY color space characteristics, the output of a face forward face multi-color space conversion algorithm and the output of a 4K image quality evaluation visual model through a Deepseek large model, judging whether an original image meets 4K manufacturing specifications, and generating an optimization suggestion; S8, based on HSY color space characteristics and the output of a face-to-face multi-color space conversion algorithm, multi-dimensional visual output is carried out.
2. The method according to claim 1, wherein the specific steps of step S1 include: S11, recording N-digit moderators in the modified 4K ultra-high definition studio, and frame extraction is carried out on parts containing key features in videos of different moderators to obtain an original image, wherein the original image contains different backgrounds, standard front faces under different light rays, slight turning and moderator postures with different expressions; s12, marking key features of the original image to obtain positions, widths, types and labels of the key features, adding a front face confidence label in the label, wherein 1 is a standard front face, 0 is a non-front face, and dividing a training set and a verification set according to 8:2 to finish the manufacturing of a data set.
3. The method according to claim 1, wherein the specific step of step S2 comprises: S21, replacing a convolutional neural network CNN by adopting a convolutional neural network SCNN based on a self-normalization neural network, adding a lightweight self-attention module after a second layer SCNN of a Bottleneck module, wherein the lightweight self-attention module is used for paying attention to the spatial dependence of the forehead, the eyes, the nose, the left cheek, the right cheek and the mouth in a feature map, adding a gating circulation unit GRU after the self-attention layer, converting the spatial feature into a sequence feature for processing, flattening a feature map output by the lightweight self-attention module into a sequence form, taking the feature map as the input of the GRU, wherein each spatial position is a time step, accessing a SCNN recovery channel number of 1X 1, and carrying out residual connection on the SCNN output of 1X 1 and the second layer SCNN of Bottleneck; S22, improving an A2C2f module, defining the improved A2C2f as SRA2C2f, replacing a single-path convolution with a double-branch residual convolution, firstly performing feature splitting, using an original branch for local feature extraction, adding 1X 3 CNN and 3X 1 CNN, wherein 1X 3 CNN is used for capturing horizontal symmetric features, 3X 1 CNN is used for capturing vertical symmetric features, performing feature addition on the output of the double branches, and then compressing the channel through 1X 1 CNN; s23, improving a YOLOv network structure, adding a front face gesture pre-judging branch after a first CNN of an input layer, wherein the front face gesture pre-judging branch is formed by integrating two layers of SCNNs, the SNN carries out self-normalization of network internal features through an activating function SELU, the third layer is AGSC3k2 and a filtering side face sample module are of a parallel structure, the internal logic of the filtering side face sample module is that when a face deflection angle prediction angle is larger than a preset threshold value, the weight in subsequent feature extraction is weakened through a mask, the original C3k2 module is replaced by the AGSC3k2, the original CNN module is replaced by the SCNN, the original A2C2f is replaced by the SRA2C2f, the front face feature attention module is added in a Backbone network Backbone, the learned attention weight is distributed by the front face feature attention module in a priori by utilizing the position of a front face organ, the coordinate range of the front face organ is preset, and then the attention mask of a corresponding region is generated through the SCNN, so that the feature weight of the region is increased; S24, expanding a detection branch of YOLOv, adding a layer of AGSC3k2, SCNN and SRA2C2f at the tail part of a Backbone network Backbone, adding a layer of SRA2C2f at the tail part of a left branch of a neck network Neck, performing splicing Concat and up-sampling Upsample operation, adding SRA2C2f and SCNN above the last layer of AGSC3k2 of a right branch, performing splicing Concat operation, and adding a layer of detection in a Head network Head; S25, freezing the front 50% layer of the backface, training SRA2C2f and Head, setting the learning rate to be 1e-5, using an SGD optimizer to enable the SRA2C2f to learn basic distribution rules, thawing all network layers, adjusting the learning rate to be 5e-5, introducing mixed precision training to improve efficiency, improving model generalization through data enhancement, enabling the model to strengthen capturing of symmetric features and key areas of the face, performing super-parameter tuning based on the result of a verification set, adjusting a face screening threshold and NMS IoU threshold, adopting an early-stop strategy, and finally saving model weights which are compatible with detection precision and real-time performance, wherein the average accuracy AP of the face is more than or equal to 86%.
4. The method according to claim 1, wherein the specific step of step S3 comprises: s31, performing HSY color space analysis based on a face region detected by a face detection model, and extracting hue H, saturation S and brightness Y characteristic parameters, wherein the brightness Y is calculated by adopting a human eye vision sensitivity weight formula Y=0.299×R+0.587×G+0.114×B; S32, performing HSY color statistical analysis of the front face region, wherein the HSY color statistical analysis comprises hue distribution histogram, saturation mean value calculation and brightness range evaluation, and color characteristic data is provided for skin color quality evaluation and face display effect optimization.
5. The method according to claim 1, wherein the specific step of step S4 comprises: Improving a K-means clustering algorithm, adding an RGB channel proportion constraint term into a distance formula, wherein the formula is as follows: ; Wherein, the The total distance is indicated as such, Normalized RGB values representing two pixels, w c representing channel weights; Representing a loop traversing R, G, B three color channels; Representing the calculation of the square of the absolute difference of two pixels on channel c; the scale constraint term is represented as a function of the scale constraint term, Representing a balance coefficient for measuring the RGB channel duty ratio difference of two pixels and adjusting the weight of a proportion constraint term; And Respectively representing the numerical values of two pixel points on a color channel; Representing the proportion of the first pixel on channel c, where Representing the total brightness of the first pixel; for the proportion of the second pixel on channel c, Representing the total brightness of the second pixel; representing the difference in the proportions of the two pixels on channel c.
6. The method according to claim 1, wherein the specific step of step S5 comprises: S51, receiving RGB values of a color region detected based on an improved K-means clustering algorithm, adopting a parallel conversion strategy, and simultaneously carrying out a plurality of independent conversion paths by utilizing RGB: S511. RGB is converted to YCbCr by the following formula: Y = 0.2627×R + 0.6780×G + 0.0593×B; Cb = -0.1396×R - 0.3604×G + 0.5000×B; Cr = 0.5000×R - 0.4598×G - 0.0402×B; wherein Y is brightness, cb is blue color difference, and Cr is red color difference; S512. RGB is converted to XYZ using the white point of bt.2020 and a3 x 3 conversion matrix, and then to CIE1931 xy chromaticity coordinates by the following formula: x = X/(X+Y+Z),y = Y/(X+Y+Z); Converting CIE1931 xy chromaticity coordinates into JzAzBz to perceive a uniform color space, multiplying XYZ values by 1000 in the JzAzBz conversion process, and adjusting the brightness range; s513, converting RGB into HSV color space; s514, obtaining ideal display brightness Lv by receiving YCbCr input values, wherein Y is epsilon [64,940], cb is epsilon [64,960], and normalization processing is carried out on the YCbCr: Yf=(Y-64)/(940-64) ;Cbf=(Cb-512)/448.0;Crf=(Cr-512)/448.0; Wherein Y is in the range of [64,940],64 is black level, 940 is peak white level, cb is in the range of [64,960],512 is zero color difference center value, cr is in the range of [64,960],512 is zero color difference center value, yf represents normalized luminance component, cbf represents normalized blue color difference component, range [ -1,1], crf represents normalized red color difference component, range [ -1,1]; 940-64=876 represents effective quantization range of luminance component in BT.2020 standard, 448 represents quantized amplitude coefficient of color difference component in BT.2020 standard; The encoded RGB values are calculated using the bt.2020 standard YCbCr to RGB conversion matrix: R=Yf+1.4746×Crf; G=Yf-0.16455×Cbf-0.57135×Crf; B=Yf+1.8814×Cbf; wherein R, G, B denotes the converted encoded RGB value, the range [0,1];1.4746 denotes the conversion coefficient of red color difference to red component, 0.16455 denotes the influence coefficient of blue color difference to green component, 0.57135 denotes the influence coefficient of red color difference to green component, 1.8814 denotes the conversion coefficient of blue color difference to blue component; S52, processing brightness information by using an HLG reverse photoelectric conversion mode, and calculating linear brightness E and a telescopic coefficient K=E/Ep by using an index slope parameter a= 0.17883277, an index offset parameter b= 0.28466892 and a segmentation offset parameter c= 0.55991073; When Ep is less than or equal to 0.5, e= (Ep 2 )/3.0; When Ep >0.5, e= (exp ((Ep-c)/a) +b)/12.0; wherein Ep represents a normalized luminance input value equal to Yf, E represents a linear luminance output value, 3.0 represents a quadratic function coefficient of a low luminance segment, 12.0 represents a normalized coefficient of a high luminance segment, and k=e/Ep represents a scaling factor of linear luminance and encoding luminance; Obtaining a real linear RGB value through linear RGB=K×coding RGB, wherein linear RGB calculation is realized through element-by-element multiplication of a telescopic coefficient K and coding RGB, and then conversion from BT.2020 color space to XYZ is carried out; s53, after all the conversion is completed, the system integrates the results of all the color spaces into a unified dictionary structure, wherein the results comprise display RGB, YCbCr, xy chromaticity coordinates, jzAzBz, HSV and ideal display brightness Lv, jzAzBz comprises perceived brightness Jz, red-green perceived color difference Az and yellow-blue perceived color difference Bz, and HSV comprises hue, saturation and brightness.
7. The method according to claim 1, wherein the specific step of step S6 comprises: S61, collecting pairs of undistorted original 4K images and defective 4K images containing different scenes, illumination conditions and content types, and converting all the images into JzAzBz color spaces to serve as model input; s62, constructing a 4K image quality evaluation visual model, and integrating and improving ResNet and Vision Transformer; S63, converting the 4K image quality evaluation data set into JzAzBz color space by adopting a transfer learning mode, and then pre-training by using an improved perceptually weighted mean square error PW-MSE to initialize model parameters, inputting the 4K image quality evaluation data set into a model, setting an initial learning rate to be 0.0005, setting the lower limit to be 5e-7, adopting a cosine annealing strategy, setting a batch size to be 64, training by using an Adam optimizer, and adjusting the model parameters through counter propagation, setting the training iteration times to be 100 epoch, calculating PW-MSE on a verification set after each round is finished, and stopping training if the loss values of 10 epochs are not obviously reduced continuously; S64, in the training process, combining data enhancement with channel weight, performing color dithering based on JzAzBz space, wherein the color dithering comprises the steps of expanding the offset range of a Jz channel by 6%, reducing Az and Bz channels by 3%, and simultaneously performing random cutting, and after training, acquiring PW-MSE of a model on a test set, and checking the convergence condition of the model.
8. The method according to claim 1, wherein the specific step of step S7 comprises: S71, carrying out parameter optimization on a Deepseek model, and adjusting model parameters by analyzing the complexity of input color characteristics, wherein if the characteristics are simple, setting the temperature to be 0.2-0.4, setting the top_p to be 0.91-0.95, if the characteristics are complex, adjusting the temperature to be 0.6-0.8, and adjusting the top_p to be 0.79-0.87, wherein the simple and complex judgment criteria are whether the parameters are in a single BT.2020 color gamut and whether JzAzBz channel parameters are in a conventional effective range or not, and whether the parameters are recorded in a color gamut conversion mode, if so, the parameters are simple, if not, the parameters are complex; S72, integrating HSY color space characteristics, the output of a face forward face multi-color space conversion algorithm, the output of a 4K image quality evaluation visual model by Deepseek, and generating a specialized analysis report based on EDID metadata of an HDR monitor by combining a 4K HDR knowledge base, wherein the specialized analysis report comprises lighting angles, camera parameters, parameter optimization schemes of different HDR standards and 4K manufacturing standard conformity scores and withholding item descriptions.
9. The method according to claim 1, wherein the specific step of step S8 comprises: And generating a color cluster distribution pie chart according to an improved K-means clustering algorithm, and marking the duty ratio of each cluster area in a BT.2020 color domain.

Description

4K face quality analysis method based on artificial intelligence Technical Field The invention relates to the technical quality detection field of artificial intelligence and 4K ultra-high definition programs, in particular to a 4K face quality analysis method based on artificial intelligence. Background At present, the main stream of a broadcasting and television station studio adopts 1080P high-definition image quality for program production and broadcasting, and the image quality of the studio is upgraded from 1080P to 4K ultra-high definition, so that the video broadcasting station studio has become a necessary trend for improving the program production quality and the broadcasting effect. In the 4K ultra-high definition image quality evaluation system, the image quality of the front face area of a studio presenter is one of the core evaluation indexes, and parameters such as definition, color reproducibility, detail richness and the like directly determine whether the overall program image quality meets the 4K ultra-high definition manufacturing specification. However, the existing face quality analysis technology has obvious defects when adapting to the 4K upgrading requirement of a studio, and is mainly characterized in that firstly, the accuracy of acquiring a front face image is insufficient, the existing industry mostly adopts a basic YOLO series algorithm or a Haar feature cascade classifier to carry out face detection, but the method is easily interfered by factors such as light change, presenter limb movement and the like in a dynamic scene of the studio, and the images with side faces, half side faces or face posture offset are often doped in the detection result, so that the front face area meeting the 4K quality detection standard cannot be accurately extracted, and then the follow-up human screening is also needed. For example, in a live news scene, a small rotation of the head of a presenter may cause a misjudgment of a detection system, and a non-frontal image is included in an analysis range, so that accuracy of quality evaluation is directly affected. Secondly, the quality parameter analysis process is tedious and the labor cost is high. After the preliminary facial image is acquired, the prior art needs to adopt a mode of combining manual work with professional software and professional equipment to analyze quality parameters, such as manually extracting parameters of brightness components, chromaticity deviation and the like of a YCrCb color space through Photoshop, matlab and other tools, and then checking one by one according to the 4K standard. The process is complex in operation, long in time consumption, depends on experience judgment of professional technicians, and is easy to cause distortion of analysis results due to manual operation errors. Taking a single 30-minute program as an example, only 2-3 technicians need to take more than 1 hour to complete the facial quality parameter analysis link, so that the program production efficiency is greatly reduced. In addition, the efficiency of searching for technical improvement and specification is low, when the existing method cannot meet the 4K analysis requirement, technicians need to search for an improvement thought by consulting a large amount of industry documents, ultra-high definition manufacturing specification and patent data, and the process needs to consume a large amount of manpower for information screening and technical verification, and is difficult to quickly combine the latest technical achievements with the actual scene of a studio, so that the technology upgrading period is long and the adaptation cost is high. Finally, the parameter dimension is single and the intelligent analysis capability is lost, the existing analysis method focuses on basic parameters such as brightness, contrast and the like, high-precision color parameters such as JzAzBz and the like which accord with BT.2020 color gamut standards are difficult to extract, and quality characteristics of the 4K face image cannot be comprehensively reflected. Meanwhile, the fusion capability with a professional knowledge base and a large model is lacking, a targeted improvement proposal cannot be given for the detected quality problem, and an optimization scheme needs to be formulated independently by a technician, so that the difficulty of quality improvement is further increased. Disclosure of Invention In view of the above, the invention provides an artificial intelligence-based 4K face quality analysis method for improving the face detection precision, efficiently acquiring multidimensional image information, realizing intelligent analysis and optimizing suggestion output, replacing manual operation and reducing the labor cost. In a first aspect, the present invention provides an artificial intelligence based 4K face quality analysis method, the method comprising: S1, obtaining an original image and manufacturing a data set; s2, constructing and traini