CN-116129012-B - Virtual image lip driving method, device, medium and electronic equipment

CN116129012BCN 116129012 BCN116129012 BCN 116129012BCN-116129012-B

Abstract

The present disclosure relates to an avatar lip driving method, apparatus, medium, and electronic device. The method comprises the steps of determining a target IPA sequence corresponding to a target text and the pronunciation time length of each phonetic symbol in the sequence, performing time length expansion on the target IPA sequence according to each pronunciation time length to obtain an expansion sequence, extracting N-gram phonetic symbol strings containing the phonetic symbols from the expansion sequence aiming at each phonetic symbol in the expansion sequence to obtain a target phonetic symbol sequence, determining lip shape parameters matched with the target phonetic symbol sequence according to the corresponding relation between a reference phonetic symbol sequence and lip shape parameters, and performing lip shape rendering on a target virtual image based on the lip shape parameters corresponding to the phonetic symbols. When the lip shape parameters corresponding to each target phonetic symbol in the extension sequence are acquired, the target phonetic symbol and the context information thereof are considered, so that the lip shape parameters under different IPA combinations are more in line with the real face situation, the lip shape of the target virtual image is as lifelike as possible, and the animation effect is improved.

Inventors

BI CHENG
MA ZEJUN

Assignees

北京有竹居网络技术有限公司

Dates

Publication Date: 20260505
Application Date: 20230131

Claims (12)

1. An avatar lip driving method, comprising: Determining a target IPA sequence corresponding to a target text and target pronunciation time of each phonetic symbol in the target IPA sequence; Performing duration expansion on the target IPA sequence according to each target pronunciation duration to obtain an expansion sequence; Extracting N-gram phonetic symbol strings containing target phonetic symbols from the extended sequence aiming at each target phonetic symbol in the extended sequence to obtain a target phonetic symbol sequence, wherein N is more than or equal to 2, and the target phonetic symbols are positioned at preset positions of the target phonetic symbol sequence; If the corresponding relation between the pre-established reference phonetic symbol sequence and the lip-shape parameter does not have the reference phonetic symbol sequence which is completely matched with the target phonetic symbol sequence, determining a candidate phonetic symbol sequence with the maximum matching degree with the target phonetic symbol sequence from the reference phonetic symbol sequence containing preset characters, wherein the preset characters are not international phonetic symbols and are positioned at other positions except the preset position in the reference phonetic symbol sequence, and the phonetic symbols at other positions except the position of the preset characters in the candidate phonetic symbol sequence are identical with the phonetic symbols at the corresponding positions in the target phonetic symbol sequence; Determining lip parameters matched with the target phonetic symbol sequence according to the lip parameters corresponding to the candidate phonetic symbol sequence, wherein the lip parameters corresponding to the reference phonetic symbol sequence are used for representing the lip parameters corresponding to phonetic symbols in the reference phonetic symbol sequence at the preset position; And performing lip rendering on the target virtual image based on the lip parameters corresponding to the target phonetic symbol.
2. The method according to claim 1, characterized in that the method further comprises: and if the reference phonetic symbol sequence which is completely matched with the target phonetic symbol sequence exists in the corresponding relation, determining the lip parameter corresponding to the completely matched reference phonetic symbol sequence as the lip parameter matched with the target phonetic symbol sequence.
3. The method of claim 1, wherein determining lip parameters matching the target phonetic symbol sequence based on the lip parameters corresponding to the candidate phonetic symbol sequence comprises: If the candidate phonetic symbol sequences are multiple, determining the mean value or the median of the lip-shape parameters corresponding to each candidate phonetic symbol sequence as the lip-shape parameters matched with the target phonetic symbol sequence; And if the candidate phonetic symbol sequence is one, determining the lip parameter corresponding to the candidate phonetic symbol sequence as the lip parameter matched with the target phonetic symbol sequence.
4. The method of claim 1, wherein the correspondence is constructed by: Acquiring a plurality of face videos; The method comprises the steps of extracting lip parameters from each video frame of the face video according to each face video, determining a sample IPA sequence corresponding to audio in the face video and sample pronunciation time of each sample phonetic symbol in the sample IPA sequence, determining a video segment corresponding to each sample phonetic symbol from the face video according to the sample IPA sequence and each sample pronunciation time, extracting N-gram phonetic symbol strings containing the sample phonetic symbol from the sample IPA sequence according to each sample phonetic symbol to obtain a reference phonetic symbol sequence, and determining a parameter sequence formed by lip parameters of each video frame in the video segment corresponding to the sample phonetic symbol as a first candidate parameter sequence corresponding to the reference phonetic symbol sequence; and determining lip parameters corresponding to the reference phonetic symbol sequences according to at least one first candidate parameter sequence corresponding to the reference phonetic symbol sequences aiming at each reference phonetic symbol sequence.
5. The method of claim 4, wherein determining the lip-shape parameter corresponding to the reference phonetic symbol sequence based on the at least one first candidate parameter sequence corresponding to the reference phonetic symbol sequence comprises: Determining a reference lip parameter from the first candidate parameter sequence for each of the first candidate parameter sequences corresponding to the reference phonetic symbol sequence; and determining the mean value or the median of each reference lip parameter as the lip parameter corresponding to the reference phonetic symbol sequence.
6. The method of claim 5, wherein determining a reference lip parameter from the first candidate parameter sequence comprises: calculating the absolute value of the difference between adjacent lip-shaped parameters in the first candidate parameter sequence to obtain a gradient sequence; The lip parameter at the nth position in the first candidate parameter sequence is determined as a reference lip parameter, wherein the maximum value in the gradient sequence is located at the nth position of the gradient sequence.
7. The method of claim 5, wherein prior to the step of determining the mean or median of each of the reference lip parameters as the lip parameter corresponding to the reference phonetic symbol sequence, the step of determining the lip parameter corresponding to the reference phonetic symbol sequence from at least one first candidate parameter sequence corresponding to the reference phonetic symbol sequence further comprises: removing outlier data from all the reference lip parameters corresponding to the reference phonetic symbol sequence; The determining the mean value or the median of each reference lip parameter as the lip parameter corresponding to the reference phonetic symbol sequence includes: and determining the mean value or the median of the reference lip shape parameters obtained after the outlier data are removed as the lip shape parameters corresponding to the reference phonetic symbol sequence.
8. The method according to any one of claims 4-7, wherein the manner of constructing the correspondence further comprises the steps of: Let k=1; For each sample phonetic symbol, extracting an (N-k) -gram phonetic symbol string containing the sample phonetic symbol from the sample IPA sequence to obtain a sample phonetic symbol sequence; determining a parameter sequence formed by lip-shaped parameters of each video frame in the video segment corresponding to the sample phonetic symbol as a second candidate parameter sequence corresponding to the sample phonetic symbol sequence; For each sample phonetic symbol sequence, determining lip shape parameters corresponding to the sample phonetic symbol sequence according to at least one second candidate parameter sequence corresponding to the sample phonetic symbol sequence, expanding the sample phonetic symbol sequence into N-gram phonetic symbol strings by adding preset characters into the sample phonetic symbol sequence to obtain reference phonetic symbol sequences; Let k add 1 and return to the step of extracting the (N-k) -gram phonetic symbol string containing the sample phonetic symbol from the sample IPA sequence until N-k=1.
9. The method according to any one of claims 1-7, wherein lip rendering the target avatar based on the lip parameters corresponding to the target phonetic symbol, comprises: according to lip parameters corresponding to adjacent phonetic symbols of the target phonetic symbol, carrying out filtering processing on the lip parameters corresponding to the target phonetic symbol, wherein the adjacent phonetic symbol is a phonetic symbol with a position distance smaller than a preset distance threshold value between the adjacent phonetic symbol and the target phonetic symbol in the extended sequence; and inputting the lip parameters corresponding to the target phonetic symbols and obtained after the filtering treatment into a preset rendering engine so as to perform lip rendering on the target virtual image.
10. An avatar lip driving apparatus, comprising: The first determining module is used for determining a target IPA sequence corresponding to the target text and target pronunciation time length of each phonetic symbol in the target IPA sequence; the first expansion module is used for performing duration expansion on the target IPA sequence according to each target pronunciation duration to obtain an expansion sequence; The first extraction module is used for extracting N-gram phonetic symbol strings containing target phonetic symbols from the extended sequence aiming at each target phonetic symbol in the extended sequence to obtain a target phonetic symbol sequence, wherein N is more than or equal to 2, and the target phonetic symbols are positioned at preset positions of the target phonetic symbol sequence; The second determining module is used for determining a candidate phonetic symbol sequence with the largest matching degree with the target phonetic symbol sequence from the reference phonetic symbol sequence containing preset characters when the reference phonetic symbol sequence which is completely matched with the target phonetic symbol sequence does not exist in the corresponding relation between the preset reference phonetic symbol sequence and the lip-shaped parameter, wherein the preset characters are not international phonetic symbols and are positioned at other positions except the preset position in the reference phonetic symbol sequence, and the phonetic symbols at other positions except the position of the preset characters in the candidate phonetic symbol sequence are identical with the phonetic symbols at the corresponding positions in the target phonetic symbol sequence; and the rendering module is used for performing lip rendering on the target virtual image based on the lip parameters corresponding to the target phonetic symbol.
11. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-9.
12. An electronic device, comprising: A storage device having at least one computer program stored thereon; at least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method according to any one of claims 1-9.

Description

Virtual image lip driving method, device, medium and electronic equipment Technical Field The present disclosure relates to the field of computer vision, and in particular, to an avatar lip driving method, apparatus, medium, and electronic device. Background With the rapid development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) and big data technology, AI has penetrated into aspects of life, and virtual object technology is a relatively important sub-field in AI technology, which can construct an avatar through AI technology, and simultaneously drive the facial expression of the avatar to simulate human speaking. Applications of facial expression driving include lip driving to achieve an avatar by inputting text. Among them, how to make the lip shape of the avatar approach the real person as much as possible has important meaning for improving the animation effect. Disclosure of Invention This section is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This section is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. In a first aspect, the present disclosure provides an avatar lip driving method, comprising: Determining a target IPA sequence corresponding to a target text and target pronunciation time of each phonetic symbol in the target IPA sequence; Performing duration expansion on the target IPA sequence according to each target pronunciation duration to obtain an expansion sequence; Extracting N-gram phonetic symbol strings containing target phonetic symbols from the extended sequence aiming at each target phonetic symbol in the extended sequence to obtain a target phonetic symbol sequence, wherein N is more than or equal to 2, the target phonetic symbols are positioned at preset positions of the target phonetic symbol sequence, determining lip parameters matched with the target phonetic symbol sequence according to the corresponding relation between a pre-established reference phonetic symbol sequence and the lip parameters, wherein the lip parameters corresponding to the reference phonetic symbol sequence are used for representing the lip parameters corresponding to phonetic symbols in the preset positions in the reference phonetic symbol sequence, and performing lip rendering on a target virtual image based on the lip parameters corresponding to the target phonetic symbols. In a second aspect, the present disclosure provides an avatar lip driving apparatus, comprising: The first determining module is used for determining a target IPA sequence corresponding to the target text and target pronunciation time length of each phonetic symbol in the target IPA sequence; the first expansion module is used for performing duration expansion on the target IPA sequence according to each target pronunciation duration to obtain an expansion sequence; The system comprises a first extraction module, a second determination module and a rendering module, wherein the first extraction module is used for extracting N-gram phonetic symbol strings containing target phonetic symbols from the expansion sequences aiming at each target phonetic symbol in the expansion sequences to obtain target phonetic symbol sequences, N is more than or equal to 2, the target phonetic symbols are located at preset positions of the target phonetic symbol sequences, the second determination module is used for determining lip parameters matched with the target phonetic symbol sequences according to the corresponding relation between the pre-established reference phonetic symbol sequences and the lip parameters, the lip parameters corresponding to the reference phonetic symbol sequences are used for representing the lip parameters corresponding to phonetic symbols in the preset positions in the reference phonetic symbol sequences, and the rendering module is used for performing lip rendering on a target virtual image based on the lip parameters corresponding to the target phonetic symbols. In a third aspect, the present disclosure provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the avatar lip driving method provided in the first aspect of the present disclosure. In a fourth aspect, the present disclosure provides an electronic device comprising: A storage device having at least one computer program stored thereon; At least one processing means for executing the at least one computer program in the storage means to implement the steps of the avatar lip driving method provided in the first aspect of the present disclosure. In the above technical solution, after the extended sequence corresponding to the target text is obtained, for each target phonetic symbol in the extended sequence, when the corresponding lip shape parameter is obtained, n