CN-116129008-B - Mouth-style simulation method, medium and system with three-dimensional image and language

CN116129008BCN 116129008 BCN116129008 BCN 116129008BCN-116129008-B

Abstract

The invention provides a mouth-style simulation method, medium and system with a three-dimensional image and a language, which belong to the technical field of three-dimensional image simulation, wherein the mouth-style simulation method with the three-dimensional image and the language comprises the steps that a tester reads texts with language change marks, and simultaneously, the tester collects the read videos of the tester; the method comprises the steps of establishing a three-dimensional coordinate system, obtaining face images of testers shot at the time of language change, determining the change of the face images of the testers, distinguishing changed foreground points and background points by using a Gaussian mixture background model, establishing a three-dimensional virtual image mouth shape model, taking all Gaussian categories as key points of the foreground as key points of a language mouth shape, carrying out language adjustment on the three-dimensional virtual image mouth shape model by using the key points of the language mouth shape, reading texts with language change marks according to the three-dimensional image, and generating and outputting a mouth shape sequence with the language of the three-dimensional image according to the three-dimensional virtual image mouth shape model after language adjustment by using a phoneme mouth shape driving method.

Inventors

ZHOU ANBIN
YAN WUZHI
LI XIN
PAN JIANJIAN
PENG CHEN

Assignees

山东金东数字创意股份有限公司

Dates

Publication Date: 20260508
Application Date: 20221208

Claims (9)

1. A method for simulating a three-dimensional mouth shape with a mood, comprising the steps of: s10, a tester reads texts with language change marks, and collects the read videos of the tester at the same time; S20, establishing a three-dimensional coordinate system, and acquiring a first image of the face of the tester, which is shot at the time of the change of the mood, and a second image of the face of the tester, which is shot at the time of the stabilization of the mood; S30, determining a first change detection area of the first image and a second change detection area of the second image, wherein the first change detection area is a lip area in the first image, and the second change detection area is a lip area in the second image; s40, acquiring first all patches of the first change detection area and second all patches of the second change detection area; s50, respectively carrying out deformation matching on the first all patches and the second all patches with preset representative patches to obtain a first patch matching result of the first change detection area and a second patch matching result of the second change detection area; s60, carrying out local feature matching on the first panel matching result and the second panel matching result to obtain a change value of the first panel matching result compared with the second panel matching result; S70, converting the first image into a mask image, inputting the mask image into a mixed Gaussian background model, and obtaining Gaussian categories of mouth shape key points in the first change detection area output by the mixed Gaussian background model, wherein the Gaussian categories comprise a foreground and a background; s80, establishing a three-dimensional virtual image mouth shape model, taking all Gaussian categories as key points of the foreground as key points of the speech mouth shape, and carrying out speech adjustment on the three-dimensional virtual image mouth shape model by using the key points of the speech mouth shape; S90, reading a text with a language variation mark according to the required three-dimensional image, and generating a mouth model sequence according to the mouth model of the three-dimensional virtual image after the language adjustment by using a phoneme mouth model driving method to serve as the mouth model sequence with the language of the three-dimensional image; The step S60 specifically includes: Performing local feature matching on the first panel matching result and the second panel matching result to obtain a change value of the first panel matching result compared with the second panel matching result, wherein the change value is specifically as follows: generating first all feature points of the first panel matching result and second all feature points of the second panel matching result through a black plug matrix; respectively convolving the first panel matching result and the second panel matching result through box-shaped filters with different sizes to obtain a first scale space of the first panel matching result and a second scale space of the second panel matching result; Performing feature point positioning according to all first feature points of the first panel matching result and a first scale space of the first panel matching result to obtain a first stable feature point set; performing feature point positioning according to second all feature points of the second panel matching result and a second scale space of the second panel matching result to obtain a second stable feature point set; counting the characteristics of the harr wavelet in a preset radius by taking the first stable characteristic point set and the second stable characteristic point set as circle centers respectively to obtain a first main direction set of the first stable characteristic point set and a second main direction set of the second stable characteristic point set; generating a first characteristic point description subset according to the first main direction set, and generating a second characteristic point description subset according to a second main direction set; Calculating the matching degree of any point in the first stable characteristic point set and all points in the second stable characteristic point set by using a least square method according to the first characteristic point descriptor set and the second characteristic point descriptor set, and obtaining the corresponding point of any point in the first stable characteristic point set in the second stable characteristic point according to the matching degree; Forming a matching point pair according to any point in the first stable characteristic point set and a corresponding point of any point in the first stable characteristic point set in the second stable characteristic point; and calculating a change value of the first panel matching result compared with the second panel matching result according to the matching point pairs.
2. The method of mouth modeling with three-dimensional image according to claim 1, wherein said step S20 specifically comprises: establishing a three-dimensional coordinate system according to the MPEG-4 standard; The method for obtaining the first image of the face of the tester photographed at the time of the change of the mood and the second image of the face of the tester photographed at the time of the stabilization of the mood comprises the following steps: Acquiring an original shooting image of the face of the tester at the time of the change of the mood and an original shooting image of the face of the tester at the time of the stable mood; extracting features of an original photographed image of the face of the tester at the mood change moment and an original photographed image of the face of the tester at the mood stabilization moment, wherein the features comprise facial key points and texture features; Judging the mood of the face of the tester at the mood change moment and the mood stabilization moment according to the characteristics; Performing corresponding defogging treatment or denoising image enhancement treatment on an original photographed image of the face of the tester at the mood change moment according to the mood of the face of the tester at the mood change moment, so as to obtain a first image of the face of the tester photographed at the mood change moment; And carrying out corresponding defogging treatment or denoising image enhancement treatment on the original photographed image of the face of the tester at the mood stabilization moment according to the mood of the face of the tester at the mood stabilization moment, and obtaining a second image of the face of the tester photographed at the mood stabilization moment.
3. The method of mouth modeling with three-dimensional image according to claim 1, wherein said step S30 specifically comprises: Respectively carrying out image enhancement on the first change detection region and the second change detection region to obtain an enhanced image of the first change detection region and an enhanced image of the second change detection region; Filtering the enhanced image of the first change detection region and the enhanced image of the second change detection region through an average filter respectively to obtain a filtered image of the first change detection region and a filtered image of the second change detection region; And respectively carrying out edge detection and surface patch searching on the filtered image of the first change detection region and the filtered image of the second change detection region to obtain a first surface patch of the first change detection region and a second surface patch of the second change detection region.
4. The method of mouth modeling with three-dimensional image according to claim 1, wherein said step S50 comprises: respectively calculating a first normalized center moment of the first all patches and a second normalized center moment of the second all patches; calculating according to the first normalized center moment to obtain a first panel characteristic, and calculating according to the second normalized center moment to obtain a second panel characteristic, wherein the first panel characteristic and the second panel characteristic comprise centers, arc lengths and areas; Calculating first similarity of the first all patches and the representative patches according to the first normalized center moment and the first patch characteristics, and calculating second similarity of the second all patches and the representative patches according to the second normalized center moment and the second patch characteristics; And obtaining a first panel matching result of the first change detection area according to the first similarity, and obtaining a second panel matching result of the second change detection area according to the second similarity.
5. The method of mouth modeling with three-dimensional image according to claim 1, wherein said step S80 specifically comprises: firstly, adopting blendshape to build a three-dimensional virtual image mouth model according to MPEG-4 standard; Selecting basic key points in the three-dimensional virtual image mouth shape model corresponding to the voice mouth shape key points, and updating the coordinates of the basic key points by utilizing the coordinates of the voice mouth shape key points; and thirdly, replacing the coordinates of the corresponding key points in the three-dimensional virtual image mouth model with the coordinates of the updated basic key points.
6. The method of claim 5, wherein the updating the coordinates of the basic key points by using the coordinates of the mood mouth shape key points comprises the steps of: Calculating the similarity between the coordinates of each voice mouth shape key point and the coordinates of the corresponding basic key point; Selecting a voice mouth type key point with the similarity less than 0.618, and updating the coordinates of the basic key point by taking the central point of the coordinates of the voice mouth type key point and the coordinates of the corresponding basic key point as the coordinates of the basic key point; And thirdly, carrying out similarity calculation on the updated basic key point coordinates, and if the similarity result is smaller than 0.618, iterating the following steps: and updating the coordinates of the basic key points by taking the central point of the coordinates of the Chinese mouth shape key points and the coordinates of the corresponding basic key points as the coordinates of the basic key points until the similarity is more than or equal to 0.618.
7. The method of mouth-style modeling with three-dimensional avatar with mood as claimed in claim 6, wherein said step S90 specifically comprises: Firstly, acquiring a text which needs three-dimensional image reading and has a language variation mark; Step two, establishing a phoneme set according to phonemes corresponding to characters in the text; searching and preloading a mouth shape of a corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape; fourthly, replacing the mouth shape corresponding to the text with the language variation mark in the text by using the three-dimensional virtual image mouth shape model after language adjustment; and fifthly, obtaining the mouth shape sequence with the three-dimensional image and the language.
8. A computer readable storage medium, wherein computer program instructions are stored on the computer readable storage medium, and when executed by a processor, the computer program instructions implement a three-dimensional, mood-bearing mouth-opening simulation method according to any one of claims 1-7.
9. The three-dimensional, mood-bearing mouth modeling system as defined in claim 8, comprising a computer-readable storage medium.

Description

Mouth-style simulation method, medium and system with three-dimensional image and language Technical Field The invention belongs to the technical field of three-dimensional image simulation, and particularly relates to a mouth-style simulation method, medium and system for a three-dimensional image with a mood. Background In life, people often speak with a mood, most of three-dimensional image speaking at present often does not consider the mood, and mouth-shaped driving is directly adopted, so that the generated three-dimensional image speaking process is poor in reality. The Chinese invention with the authorization number of CN111081270B (application number of CN 201911314031.3) discloses a real-time audio-driven virtual character mouth shape synchronous control method. The method comprises the steps of identifying the probability of the visual element from the real-time voice stream, filtering the probability of the visual element, converting the sampling rate of the probability of the visual element into the sampling rate which is the same as the rendering frame rate of the virtual character, converting the probability of the visual element into the standard mouth shape configuration and performing mouth shape rendering. The method can avoid the requirement of synchronously transmitting the phoneme sequence or the mouth shape sequence information when transmitting the audio stream, can obviously reduce the complexity, the coupling degree and the realization difficulty of the system, and is suitable for various application scenes for rendering virtual characters on display equipment. Said invention can synchronously control virtual character mouth form according to real-time audio drive, but can not solve the technical problem of controlling virtual character mouth form by means of audio with tone. Disclosure of Invention In view of the above, the present invention provides a method, medium and system for controlling a mouth shape of a virtual character by using audio with a mood. The invention is realized in the following way: The first aspect of the invention provides a mouth-style modeling method with three-dimensional image and language, which comprises the following steps: s10, a tester reads texts with language change marks, and collects the read videos of the tester at the same time; S20, establishing a three-dimensional coordinate system, and acquiring a first image of the face of the tester, which is shot at the time of the change of the mood, and a second image of the face of the tester, which is shot at the time of the stabilization of the mood; S30, determining a first change detection area of the first image and a second change detection area of the second image, wherein the first change detection area is a lip area in the first image, and the second change detection area is a lip area in the second image; s40, acquiring first all patches of the first change detection area and second all patches of the second change detection area; s50, respectively carrying out deformation matching on the first all patches and the second all patches with preset representative patches to obtain a first patch matching result of the first change detection area and a second patch matching result of the second change detection area; s60, carrying out local feature matching on the first panel matching result and the second panel matching result to obtain a change value of the first panel matching result compared with the second panel matching result; S70, converting the first image into a mask image, and inputting the mask image into a mixed Gaussian background model to obtain Gaussian categories of mouth shape key points in the region to be detected, wherein the Gaussian categories comprise a foreground and a background, and the Gaussian categories are output by the mixed Gaussian background model; s80, establishing a three-dimensional virtual image mouth shape model, taking all Gaussian categories as key points of the foreground as key points of the speech mouth shape, and carrying out speech adjustment on the three-dimensional virtual image mouth shape model by using the key points of the speech mouth shape; And S90, reading the text with the language variation mark according to the required three-dimensional image, and generating a mouth model sequence according to the mouth model of the three-dimensional virtual image after the language adjustment by using a phoneme mouth model driving method to serve as the mouth model sequence with the language of the three-dimensional image. On the basis of the technical scheme, the mouth-shaped simulation method with the three-dimensional image and the language can be further improved as follows: the step S20 specifically includes: Establishing a three-dimensional coordinate system according to the MPEG-4 standard; The method for obtaining the first image of the face of the tester photographed at the time of the change of the mood and the second image of the face of the