CN-121982755-A - Method and device for generating multitasking face information, edge computing equipment and medium

CN121982755ACN 121982755 ACN121982755 ACN 121982755ACN-121982755-A

Abstract

The invention provides a method, a device, edge computing equipment and a medium for generating multi-task face information, wherein the method comprises the steps of inputting a video with a face into a multi-task model, outputting various face information corresponding to multi-tasks by the multi-task model, performing feature extraction on a video frame by the multi-task model based on a common video feature extraction network module to obtain a first feature map sequence, performing face feature extraction and weight matrix generation on each first feature map based on a face position detection branch module to obtain face features and weight matrixes of each first feature map, outputting face position information of each video frame based on the face features of each first feature map, generating face emotion information by a face emotion detection branch module based on the first feature map sequence and the weight matrixes of each first feature map, and generating a face local action unit detection result by a face action unit detection branch module based on the first feature map sequence and the weight matrixes of each first feature map.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity

Assignees

北京津发科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251219

Claims (10)

1. A method for generating multi-task face information, the method comprising: Inputting a video with a face into a pre-trained multitask model, carrying out multitask face information generation processing on the video by the multitask model, and outputting various face information corresponding to the multitask, wherein the multitask model comprises a common video feature extraction network module and a plurality of face information task processing branch modules, each of the plurality of face information task processing branch modules comprises at least one of a face position detection branch module for detecting a face position, a face emotion detection branch module for generating face emotion information and a face action unit detection branch module for generating a face local action unit detection result, and the multitask model carries out the following processing steps on the video: Performing feature extraction on video frames in the video based on the common video feature extraction network module to obtain a first feature map sequence; carrying out face feature extraction and weight matrix generation on each first feature map in the first feature map sequence based on the face position detection branch module to obtain face features and weight matrixes of each first feature map, and outputting face position information of each video frame included in the video based on the face features of each first feature map; the facial emotion detection branch module detects facial emotion based on the first feature map sequence and the weight matrix of each first feature map, and generates facial emotion information; The face action unit detection branch module is used for carrying out face local action unit detection based on the first feature image sequence and the weight matrix of each first feature image to generate a face local action unit detection result; And outputting the face information generated by each face information task processing branch module.
2. The method of claim 1, wherein the plurality of facial information task processing branching modules further comprises a facial keypoint detection branching module, and wherein the processing step performed by the multitasking model on the video further comprises: And carrying out key point feature enhancement on the face features of each first feature map of the first feature map sequence based on the face key point detection branch module, and outputting the face key point information of each video frame in the video based on the enhanced feature map.
3. The method of claim 1, wherein the face position detection branching module includes a face feature extraction network and a face position detection network, the face feature extraction network is configured to perform face feature extraction and weight matrix generation on each first feature map in the first feature map sequence, the face position detection network is configured to output face position information of each video frame based on face features of each first feature map, the face feature extraction network includes a first convolution module and N first feature extraction modules, and the face feature extraction and weight matrix generation on each first feature map in the first feature map sequence based on the face position detection branching module includes: Performing feature extraction on each first feature map in the first feature map sequence by using the first convolution module to obtain a second feature map sequence; Based on the second feature map sequence, each first feature extraction module in the N first feature extraction modules which are sequentially connected performs the following operation steps of extracting features of each first input feature map in the first input feature map sequence to obtain a first output feature map sequence, wherein the first input feature map sequence is the second feature map sequence or the output feature map sequence of the previous first feature extraction module; And generating a weight matrix corresponding to each first output feature map in the first output feature map sequence based on each first output feature map in the first output feature map sequence to obtain a weight matrix sequence.
4. The method of claim 3 wherein the facial emotion detection branching module includes N second feature extraction modules and first classification modules connected in sequence, wherein there is a one-to-one correspondence between the N second feature extraction modules and the N first feature extraction modules, each second feature extraction module includes a first convolution sub-module and a first channel attention module, and wherein the facial emotion detection branching module performs facial emotion detection based on the first feature map sequence and a weight matrix of each first feature map, and generates facial emotion information, including: Processing each second input feature map in a second input feature map sequence by using a first convolution sub-module, and processing a first processing result sequence, wherein the second input feature map sequence is the first feature map sequence or the output feature map sequence of the previous second feature extraction module; multiplying each first processing result in the first processing result sequence with each weight matrix in the weight matrix sequence generated by the first feature extraction module corresponding to the second feature extraction module element by element to obtain a first multiplication result sequence; The first classification module generates facial emotion information based on the output feature map sequence generated by the last second feature extraction module.
5. The method of claim 3, wherein the face action unit detection branching module includes N third feature extraction modules, a spatio-temporal feature extraction module, and a second classification module connected in sequence, wherein a one-to-one correspondence exists between the N third feature extraction modules and the N first feature extraction modules, each third feature extraction module includes a second convolution sub-module and a second channel attention module, and the face action unit detection branching module performs face local action unit detection based on the first feature graph sequence and a weight matrix of each first feature graph, and generates a face local action unit detection result, including: Processing each third input feature map in a third input feature map sequence by using a second convolution sub-module, and processing a second processing result sequence, wherein the third input feature map sequence is the first feature map sequence or the output feature map sequence of the previous third feature extraction module; multiplying each second processing result in the second processing result sequence with each weight matrix in the weight matrix sequence generated by the first feature extraction module corresponding to the third feature extraction module element by element to obtain a second multiplication result sequence; the space-time feature extraction module performs space-time feature extraction based on the output feature map sequence generated by the last third feature extraction module to obtain space-time features; And the second classification module generates a face local action unit detection result based on the space-time characteristics.
6. The method of claim 1, wherein the multitasking model is trained by: acquiring a training sample set, wherein the training sample comprises a sample video and corresponding various sample face information; inputting the sample video into a multitasking model, and branching out a plurality of tasks of the multitasking model to output a plurality of kinds of predicted face information; Determining the difference loss between the plurality of predicted face information and the plurality of sample face information based on a preset loss function, wherein the loss function comprises the loss corresponding to each task branch module in the plurality of face information task processing branch modules and the learnable uncertainty parameter of each task branch module, and the loss used by different task branch modules is different; And adjusting parameters of the multi-task model with the aim of minimizing the difference loss.
7. The method of claim 6, wherein the penalty used by the face position detection branching module includes a classification penalty and a bounding box penalty, the penalty used by the face emotion detection branching module includes a cross entropy penalty, the penalty used by the face action unit detection branching module includes a bi-class cross entropy penalty with a logarithmic element, the penalty function includes a weighted penalty term and a regularization term corresponding to each task branching module, wherein the weighted penalty term of each task branching module weights the penalty of the task branching module based on a learnable uncertainty parameter corresponding to the task branching module, and the regularization term of each task branching module is based on a learnable uncertainty parameter corresponding to the task branching module.
8. A model apparatus for generating multitasking face information, comprising: The input module is used for inputting the video including the human face; The common video feature extraction network module is used for extracting features of video frames in the video to obtain a first feature map sequence; the face information task processing branch modules comprise a face position detection branch module and at least one of a face emotion detection branch module and a face action unit detection branch module; the face position detection branching module is used for carrying out face feature extraction and weight matrix generation on each first feature image in the first feature image sequence to obtain face features and weight matrixes of each first feature image, and outputting face position information of each video frame included in the video based on the face features of each first feature image; The face emotion detection branch module is used for carrying out face emotion detection based on the first feature map sequence and the weight matrix of each first feature map to generate face emotion information; the face action unit detection branch module is used for carrying out face local action unit detection based on the first feature image sequence and the weight matrix of each first feature image, and generating a face local action unit detection result; And the output module is used for outputting the face information generated by the face information task processing branch module.
9. An edge computing device comprising a processor, a memory and computer programs/instructions stored on the memory, wherein the processor is configured to execute the computer programs/instructions, which when executed, implement the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program/instruction is stored, characterized in that the computer program/instruction, when executed by a processor, implements the steps of the method according to any of claims 1 to 7.

Description

Method and device for generating multitasking face information, edge computing equipment and medium Technical Field The present invention relates to the field of computer vision, and in particular, to a method and apparatus for generating multi-task face information, an edge computing device, and a medium. Background With the development of deep learning technology, the deep learning-based face information (e.g., face position information, face emotion information, etc.) generating method gradually replaces the traditional method, and has the core advantage of automatically learning features through a neural network without manual design. However, the existing face information generation method based on deep learning still has a plurality of defects to be solved. For example, the multi-task decoupling, for the multi-tasks such as face position detection, AU (Action Unit) detection, a dedicated model needs to be deployed separately for each task, so the efficiency of multi-task detection for faces is low. Disclosure of Invention One technical problem to be solved by the present disclosure includes how to efficiently and accurately generate various face information. In view of this, the invention provides a method for generating multi-task face information, which generates various face information by adopting an end-to-end multi-task model without deploying multiple models, thus improving the generating efficiency of various face information, and in addition, the target branches can pay more attention to the characteristics of the face area by introducing a weight matrix, so that the accuracy of the target branch generating information is improved. In order to solve the technical problems, the following scheme is adopted in the present disclosure: According to a first aspect, a method for generating multi-task face information is provided, wherein the method inputs a video including a face into a pre-trained multi-task model, the multi-task model performs multi-task face information generation processing on the video, and outputs various face information corresponding to the multi-task, the multi-task model comprises a common video feature extraction network module and a plurality of face information task processing branch modules, the plurality of face information task processing branch modules comprise at least one of a face position detection branch module for detecting a face position, a face emotion detection branch module for generating face emotion information, and a face action unit detection branch module for generating a face emotion information, the multi-task model performs a feature extraction on a video frame in the video based on the common video feature extraction network module to obtain a first feature map sequence, performs a weight extraction on each first feature map in the first feature map sequence based on the face position detection branch module, performs weight extraction on each first feature map in the first feature map sequence, and performs a face weight extraction on each first feature map in the first feature map sequence based on each face feature map, and performs a face action unit detection on each face weight matrix, performs a face action unit detection on each face feature map based on the first feature map and the face feature map, and the face feature map is generated by the face action unit detection branch module, the multi-task model performs a processing step of the multi-task model performs feature extraction on the video based on the video frame to obtain a feature extraction on each face frame in the video, and outputting the face information generated by each face information task processing branch module. In some embodiments, the multi-face information task processing branching module further includes a face key point detection branching module, and the processing step performed by the multi-task model on the video further includes performing key point feature enhancement on face features of each first feature map of the first feature map sequence based on the face key point detection branching module, and outputting face key point information of each video frame in the video based on the enhanced feature map. In some embodiments, the face position detection branching module includes a face feature extraction network and a face position detection network, the face feature extraction network is used for extracting face features and generating a weight matrix for each first feature map in the first feature map sequence, the face position detection network is used for outputting face position information of each video frame based on face features of each first feature map, the face feature extraction network includes a first convolution module and N first feature extraction modules, the face position detection branching module is used for extracting face features and generating a weight matrix for each first feature map in the first feature map sequence, the fac