CN-121998875-A - Multi-mode face restoration and expression recognition system and method based on semantic guidance of facial action units

CN121998875ACN 121998875 ACN121998875 ACN 121998875ACN-121998875-A

Abstract

The invention belongs to the technical field of image restoration and computer vision, and particularly relates to a multi-mode face restoration and expression recognition system and method based on semantic guidance of a face action unit. The system extracts multi-scale features through visual coding, utilizes a graphic neural network to detect AU activation probability, a semantic conversion module converts numerical probability into interpretable biomechanical structured text, a multi-mode reasoning module fuses visual and text information, introduces common sense reasoning capability of a multi-mode large language model, improves network performance of students through knowledge distillation, and finally a conditional generation module realizes image restoration under the guidance of semantic features. The invention uses the facial action unit as a biological intermediary to convert the multi-mode reasoning capability into the restoration constraint, solves the defects of restoration physiological distortion, identification dependence on image quality and isolated two tasks in the prior art, realizes the cooperative enhancement of facial restoration and expression identification, ensures clear vision of restoration results and accords with the physiological rules of facial muscles.

Inventors

XIE JINGYUAN
TIAN CHUNWEI

Assignees

西北工业大学太仓长三角研究院
西北工业大学深圳研究院

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (10)

1. The multi-mode face restoration and expression recognition system based on semantic guidance of the facial action unit is characterized by comprising a core processing module, a visual coding and AU detection module, a semantic conversion and lifting module, a multi-mode reasoning and guidance generation module and a conditional generation module, wherein the core processing module is used for cooperative work; The visual coding and AU detection module is used for extracting multi-scale characteristics of a face image to be restored and outputting multi-scale visual characteristic representation, and simultaneously modeling the relevance of a face action unit by using a graph neural network and outputting an AU activation probability vector representing the facial muscle movement probability; The semantic conversion and lifting module is connected with the visual coding and AU detection module, converts the AU activation probability vector into a structured guide text containing determined actions and possible actions through a predefined biomechanical mapping rule, and realizes conversion from numerical characteristics to high-level interpretable semantic characteristics; The multi-mode reasoning and guiding generation module is respectively connected with the visual coding and AU detection module and the semantic conversion and lifting module, carries out multi-mode fusion on the multi-scale visual characteristic representation, the AU activation probability vector and the structured guiding text, complements potential action units by utilizing multi-mode reasoning capability and deduces muscle movement trend, and generates semantic guiding feature vector and expression label; The conditioning generation module is respectively connected with the visual coding and AU detection module and the multi-mode reasoning and guiding generation module, receives the multi-scale visual characteristic representation and the semanteme guiding characteristic vector, injects the semanteme guiding characteristic vector as a conditioning signal into the convolution generation network to realize guided image restoration, and outputs a restored face image which accords with the facial muscle movement physiological rule; the system performs multi-task joint optimization through pixel reconstruction loss, AU consistency loss and expression classification loss based on biomechanical prior.
2. The system of claim 1, wherein the visual coding and AU detection module comprises a multi-scale visual coding unit and a facial action unit detection branch connected in cascade; The multi-scale visual coding unit adopts a transform architecture based on a shift window self-attention mechanism and sequentially comprises four feature extraction blocks, wherein each feature extraction block sequentially comprises a downsampling layer, a window multi-head self-attention layer and a shift window multi-head self-attention layer and is used for extracting multi-scale visual feature representation from local texture to global structure; the face action unit detection branch is connected with the output of the second and third feature extraction blocks in the multi-scale visual coding unit and is used for outputting AU activation probability vectors.
3. The system of claim 2 wherein the face action unit detection branches comprise a learnable graph structure network and a full-connection layer, wherein nodes of the graph structure network correspond to predefined face action units, edge weights are initialized and dynamically updated through a trainable dependency matrix, the graph structure network achieves message transmission and feature aggregation among the nodes through a graph attention mechanism, and features are converted into AU activation probability vectors through linear mapping and nonlinear activation processing of the full-connection layer.
4. The system of claim 1, wherein the semantic conversion and promotion module comprises a semantic rule mapping library and a text generation unit, wherein the semantic rule mapping library stores a mapping relation between a single face action unit and a corresponding biomechanical semantic phrase, a mapping relation between a face action unit combination and a corresponding biomechanical semantic phrase, and a mapping relation between a face action unit combination and a compound expression or muscle collaborative pattern description, the mapping relation is used for realizing mapping from the single face action unit, the face action unit combination to the corresponding interpretable semantic description, the text generation unit receives the AU activation probability vector, a three-segment judging mechanism is adopted, independent activation degree judging rules are respectively executed on the single face action unit and the face action unit combination through a preset high threshold and a preset low threshold, and three types of activation sets of 'possible activation' are obtained through dividing, wherein the judging rules for the single face action unit are AU activation probability is greater than or equal to the high threshold, the AU activation setting is included in the "non-activation" set, the AU possible activation "AU set between the low threshold is included in the high-threshold, the AU activation probability is included in the high-threshold is included in the all three-activation-possible sets, the AU activation possible sets are included in the high-activation probability is included in the high-threshold, and the AU activation possible combination is included in the high-activation probability is included in the high-threshold, and the AU activation possible activation active threshold is included in the all the combination is included in the high-activation active threshold is not included in the high threshold, the text generation unit generates corresponding structured guide text for the determined activated set and the possible activated set according to the mapping relation of the semantic rule mapping library, the unactivated set does not participate in text generation, and the text generation unit outputs descriptive text representing neutral expression when AU activation probabilities of all face action units are lower than a preset high threshold.
5. The system of claim 1, wherein the multi-modal reasoning and guidance generation module includes a multi-modal reasoning sub-network, a student text encoder, and a guidance feature fusion layer during a reasoning phase; The multi-mode reasoning sub-network converts an AU activation probability vector into an AU semantic embedded vector through a semantic projector, maps a multi-scale visual feature representation into an AU visual embedded vector through a visual projection layer, and obtains a multi-mode fusion feature through a feature fusion layer after splicing two types of embedded vectors in a channel dimension; The student text encoder encodes the structured guide text into a text semantic embedded vector with a fixed length through a character level embedded layer, a bidirectional gating circulating unit and a full-connection projection layer; The guiding feature fusion layer is used for splicing the AU visual semantic fusion guiding vector and the text semantic embedding vector in the feature dimension, compressing the semantic fusion guiding vector and the text semantic embedding vector into a semantic guiding feature vector with a preset dimension through linear mapping, and taking the semantic guiding feature vector and the semantic guiding feature vector as a control signal of the conditional generation module.
6. The system of claim 5, wherein the multi-modal reasoning and guiding generation module adopts a double-layer architecture of a teacher guiding students in a training stage, a multi-modal large language model and a chain type thinking reasoning control unit are additionally arranged on the basis of the architecture of the reasoning stage, and a knowledge distillation mechanism is used for enabling a student network to learn the common sense reasoning capability and semantic expression accuracy of the large model.
7. The system of claim 6, wherein the multi-modal large language model is a transducer-based vision-language joint coding architecture, integrates a vision encoder and a text encoder, realizes the alignment of images and texts in a unified semantic space, and outputs teacher semantic embedded vectors and guided natural language descriptions; the chain type thinking reasoning control unit constructs and injects multi-step reasoning instructions to control the multi-mode large language model to carry out two-stage reasoning, wherein the first-stage reasoning complements an AU state which is not detected due to shielding or degradation, and the second-stage reasoning infers expression categories and accounts for reasons based on the complete AU state; And the multi-mode reasoning sub-network performs end-to-end optimization under the common constraint of pixel reconstruction loss, AU consistency loss and expression classification loss.
8. The system of claim 1, wherein the conditional generation module adopts a progressive upsampling decoding structure with a decoding generation network of an intra-band modulation mechanism as a core, and sequentially comprises three-level upsampling units, convolution blocks and conditional adaptive normalization subunits, which are in one-to-one correspondence with the upsampling units, and an output convolution layer; After the resolution ratio of the feature map is recovered by each level of up-sampling unit, local texture information is extracted by a corresponding convolution block, affine modulation is carried out by a corresponding conditional self-adaptive normalization subunit by taking a semantically guided feature vector as a control signal; The conditioning self-adaptive normalization subunit takes the semantically guided feature vector as a control signal, generates channel scale coefficients and bias items under corresponding scales in real time, carries out affine modulation on the current layer feature map, and obtains a final restoration result by processing the output of the output convolution layer through a normalization nonlinear function.
9. The multi-mode face restoration and expression recognition method based on semantic guidance of the facial action units is characterized by comprising the following steps of: S1, acquiring a face image to be restored, extracting multi-scale features of the face image to be restored to obtain multi-scale visual feature representation, and simultaneously modeling the relevance of a face action unit by using a graph neural network to output an AU activation probability vector representing the facial muscle movement probability; S2, carrying out semantic conversion and lifting on the AU activation probability vector through a predefined biomechanical mapping rule, generating a structured guide text containing determined actions and possible actions, and realizing conversion from numerical characteristics to high-level interpretable semantic characteristics; S3, carrying out multi-mode fusion on the multi-scale visual feature representation, the AU activation probability vector and the structured guide text, complementing potential action units by utilizing multi-mode reasoning capability, deducing muscle movement trend, and generating semantic guide feature vector and expression label; s4, receiving the multi-scale visual feature representation and the semanteme guide feature vector, injecting the semanteme guide feature vector as a condition signal into a convolution generating network to perform guided image restoration, and outputting a restored face image conforming to the physiological rule of facial muscle movement; According to the method, the collaborative enhancement of face restoration and expression recognition is realized by performing multi-task combined optimization through pixel reconstruction loss, AU consistency loss and expression classification loss based on biomechanics priori.
10. The method according to claim 9, wherein the method is implemented by using the multi-modal face restoration and expression recognition system based on semantic guidance of facial action units according to any one of claims 1 to 8.

Description

Multi-mode face restoration and expression recognition system and method based on semantic guidance of facial action units Technical Field The invention belongs to the technical field of image restoration and computer vision, and particularly relates to a multi-mode face restoration and expression recognition system and method based on semantic guidance of a face action unit. Background Face image restoration and expression recognition are two key tasks in the field of computer vision, and have wide application in the fields of security monitoring, telemedicine, human-computer interaction and the like. The prior art generally deals with these two tasks as independent problems. In terms of image restoration, existing methods are based mainly on generating an antagonistic network or diffusion model, aiming at recovering visually clear five sense organs and textures from low quality, occluded or noise contaminated images. However, most of these methods rely on pixel-level visual feature learning, lack of explicit modeling of physiological structures of human faces, and are prone to introduce deformation violating muscle movement rules in the repair process, for example, generating "phoney" images with mouth corners lifted up and not accompanied by eye tightening, which is prone to cause physiological distortion of recovery results. In expression recognition, a deep learning method for high-quality images has been matured, but its performance is drastically reduced when the image quality is severely degraded. Although there have been studies attempting to introduce facial Action Units (AU) as an intermediate representation of expression analysis to promote robustness, detection of AU itself is extremely prone to failure under extremely compromised conditions. In addition, these methods are mainly limited to the mining of underlying visual features, failing to make effective use of the rich psychological and emotional common sense knowledge underlying AU. In summary, the main drawbacks of the prior art are that 1) the restoration process is disjointed from the biomechanical constraint, which may lead to anatomically unreasonable results, 2) the recognition module is highly dependent on the image quality, and 3) the restoration and recognition tasks are performed in isolation, and no synergistic effect is formed. Disclosure of Invention Aiming at the defects existing in the prior art, the invention provides a multi-mode face restoration and expression recognition system and method based on semantic guidance of a facial action unit. The invention aims to provide a novel scheme capable of uniformly processing the problems of low-quality face restoration and recognition and ensuring that the result is truly and credible in vision and physiology. The invention provides a multi-mode face restoration and expression recognition system based on semantic guidance of a face action unit, which comprises a core processing module, a visual coding and AU detection module, a semantic conversion and lifting module, a multi-mode reasoning and guidance generation module and a conditional generation module, wherein the core processing module is used for cooperative work; The visual coding and AU detection module is used for extracting multi-scale characteristics of a face image to be restored and outputting multi-scale visual characteristic representation, and simultaneously modeling the relevance of a face action unit by using a graph neural network and outputting an AU activation probability vector representing the movement probability of facial muscles; The semantic conversion and lifting module is connected with the visual coding and AU detection module, converts the AU activation probability vector into a structured guide text containing determined actions and possible actions through a predefined biomechanical mapping rule, and realizes conversion from numerical characteristics to high-level interpretable semantic characteristics; The multi-mode reasoning and guiding generation module is respectively connected with the visual coding and AU detection module and the semantic conversion and lifting module, carries out multi-mode fusion on the multi-scale visual characteristic representation, the AU activation probability vector and the structured guiding text, complements potential action units by utilizing multi-mode reasoning capability and deduces muscle movement trend, and generates semantic guiding characteristic vector and expression label; The conditioning generation module is respectively connected with the visual coding and AU detection module and the multi-mode reasoning and guiding generation module, receives the multi-scale visual characteristic representation and the semanteme guiding characteristic vector, injects the semanteme guiding characteristic vector as a conditioning signal into the convolution generation network to realize guided image restoration, and outputs a restored face image which accords with the p