CN-121999777-A - Digital human interaction method, device, equipment and program product

CN121999777ACN 121999777 ACN121999777 ACN 121999777ACN-121999777-A

Abstract

The application discloses a digital human interaction method, a device, equipment and a program product, which aim to solve the problem of misalignment of interaction content caused by inaccurate term identification in the professional field of the existing digital human system. The method comprises the steps of carrying out text transcription on user voice data in a man-machine interaction process to obtain at least two candidate texts, determining the acoustic probability score and the knowledge enhancement score of each candidate text, wherein the acoustic probability score represents the confidence level that the corresponding candidate text is recognized correctly in an acoustic layer, the knowledge enhancement score represents the semantic association degree of the knowledge graph of the corresponding candidate text and the target field, determining the comprehensive score of each candidate text based on the acoustic probability score and the knowledge enhancement score of each candidate text, selecting the candidate text with the comprehensive score meeting preset conditions as the target text, and generating a digital person video stream for answering the target text and outputting the digital person video stream.

Inventors

LI YUNXIAO
JIANG WENLONG
KONG LINGJUN
XIANG XIAOMING
YIN YAOYAO
Tu Zhengyang
ZHANG YUDONG
LI YONG

Assignees

中移物联网有限公司
中国移动通信集团有限公司

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. A digital human interaction method, comprising: Performing text transcription on user voice data in the human-computer interaction process to obtain at least two candidate texts; Determining the acoustic probability score and the knowledge enhancement score of each candidate text, wherein the acoustic probability score characterizes the confidence that the corresponding candidate text is identified correctly at the acoustic level, the knowledge enhancement score characterizes the correlation degree of the corresponding candidate text and the knowledge graph of the target field on the semanteme, the determination mode of the knowledge enhancement score of any target candidate text in all the candidate texts comprises the steps of determining the basic knowledge score of the target candidate text based on the semantic similarity between a content entity in the target candidate text and a first type knowledge entity in the knowledge graph and the semantic similarity between a content entity in the target candidate text and a second type knowledge entity in the knowledge graph, wherein the knowledge enhancement score comprises the knowledge entity in the target field, the attribute of the knowledge entity and the correlation relation between the knowledge entity, the first type knowledge entity refers to the knowledge entity matched with the content entity in the target candidate text, and the second knowledge entity refers to the knowledge entity adjacent to the first type knowledge entity; determining a comprehensive score of each candidate text based on the acoustic probability score and the knowledge enhancement score of each candidate text, and selecting the candidate text with the comprehensive score meeting a preset condition as a target text; and generating a digital human video stream for answering the target text, and outputting the digital human video stream.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises, Performing text transcription on user voice data in a human-computer interaction process to generate at least two candidate texts, wherein the text transcription comprises the following steps: Inputting user voice data in a man-machine interaction process into a voice recognition model, and performing voice recognition on the user voice data by the voice recognition model to obtain at least two candidate texts; The voice recognition model is obtained through training of sample voice data marked with true value text.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises, Determining an acoustic probability score for each of the candidate texts, comprising: For any target candidate text of all the candidate texts, determining an acoustic probability score of the target candidate text based on probability distribution of the target candidate text in the speech recognition model.
4. The method of claim 1, further comprising capturing a user facial video corresponding to the user voice data, extracting a mouth shape motion feature from the user facial video, determining a mouth shape matching degree score for each of the candidate texts, the mouth shape matching degree score characterizing a degree of matching between a pronunciation mouth shape of the corresponding candidate text and the mouth shape motion feature; determining a composite score for each of the candidate texts based on the acoustic probability score and the knowledge enhancement score for each of the candidate texts, comprising: a composite score for each of the candidate texts is determined based on the acoustic probability score, the knowledge enhancement score, and the mouth shape matching degree score for each of the candidate texts.
5. The method of claim 1, wherein the step of determining the position of the substrate comprises, Determining a composite score for each of the candidate texts based on the acoustic probability score and the knowledge enhancement score for each of the candidate texts, comprising: Determining a gating coefficient of any target candidate text in all the candidate texts based on semantic features of the target candidate text and combining signal-to-noise ratio features of the user voice data and semantic features of context information in a man-machine interaction process; And weighting and summing the acoustic probability score and the knowledge enhancement score of the target candidate text based on the gating coefficient of the target candidate text to obtain the comprehensive score of the target candidate text.
6. The method of claim 1, wherein the step of determining the position of the substrate comprises, Generating a digital human video stream for answering the target text, comprising: generating an answer text corresponding to the target text; Converting the answer text into voice audio and inputting the voice audio into a voice-driven portrait generation model to obtain an initial digital portrait frame sequence which is generated by the portrait generation model and is synchronous with the voice audio; inputting the initial digital portrait frame sequence to a time sequence compensation characteristic prediction network for time sequence optimization to obtain an optimized digital portrait video stream; The time sequence compensation characteristic prediction network is configured to extract image characteristic representation of a target frame aiming at the target frame to be optimized in the initial digital portrait frame sequence, fuse the image characteristic representation of the target frame with a prestored historical compensation characteristic based on an attention mechanism to obtain a prediction compensation characteristic of the target frame, generate a space weight mask corresponding to the size of the target frame, conduct weighted fusion on the prediction compensation characteristic and the image characteristic representation of the target frame based on the space weight mask to obtain an optimized target frame and store the prediction compensation characteristic as a new historical compensation characteristic, and the time sequence compensation characteristic prediction network is obtained by training by using a video sequence containing continuous speaker images as training data and taking differences between adjacent frames in the video sequence as one of optimization targets.
7. The method according to any one of claim 1 to 6, wherein, The target domain includes an ecological environment domain, and the user voice data corresponds to an ecological environment-related question.
8. A digital human interaction device, comprising: The transcription module is used for carrying out text transcription on the user voice data in the human-computer interaction process to obtain at least two candidate texts; The system comprises a scoring module, a judgment module and a judgment module, wherein the scoring module is used for determining the acoustic probability score and the knowledge enhancement score of each candidate text, the acoustic probability score characterizes the confidence degree that the corresponding candidate text is identified correctly at the acoustic level, the knowledge enhancement score characterizes the degree of semantic association of the corresponding candidate text with the knowledge map of the target field, the knowledge enhancement score of any target candidate text in all candidate texts is determined by the aid of semantic similarity between a content entity in the target candidate text and a first type knowledge entity in the knowledge map and semantic similarity between a content entity in the target candidate text and a second type knowledge entity in the knowledge map, the basic knowledge score of the target candidate text is determined, the knowledge map comprises knowledge entities in the target field, attributes of the knowledge entities and association relations between the knowledge entities, the first type knowledge entity refers to the knowledge entity matched with the content entity in the target candidate text, the second type knowledge entity refers to the knowledge entity adjacent to the first type knowledge entity, the candidate text is determined by the aid of semantic similarity between the content entity in the target candidate text and the first type knowledge entity, the candidate text is adjusted by the aid of the semantic similarity between the content entity in the target candidate text and the second type knowledge entity in the human-machine interaction score, and the target text is adjusted by the candidate text; The screening module is used for determining the comprehensive score of each candidate text based on the acoustic probability score and the knowledge enhancement score of each candidate text, and selecting the candidate text with the comprehensive score meeting the preset condition as a target text; and the generation module is used for generating a digital human video stream for answering the target text and outputting the digital human video stream.
9. An electronic device comprising a processor and a memory arranged to store computer executable instructions, wherein the executable instructions when executed cause the processor to perform the method of any of claims 1 to 7.
10. A computer program product comprising a computer readable storage medium storing a computer program, characterized in that the computer program is operable to cause a computer to perform the method of any one of claims 1to 7.

Description

Digital human interaction method, device, equipment and program product Technical Field The present application relates to the field of man-machine interaction technologies, and in particular, to a digital man-machine interaction method, apparatus, device, and program product. Background In man-machine interaction in the professional field, traditional approaches often rely on structured text queries or pre-recorded video. The text query requires users to be familiar with technical terms and expression specifications, the returned results are mostly technical documents or data reports, the text query is not friendly to common users, the content of the fixed video is single, the interactivity is lacking, and the requirements of dynamically-changing personalized questions and answers are difficult to adapt. Along with the development of digital man-made technology, the personification image and multi-mode interaction capability of the digital man-made technology provide a new path for improving user experience, but under the scene of dense terms and strong specialization, the existing system still has an obvious short board, wherein most digital man-made schemes focus on appearance rendering and action simulation, the voice interaction core of the digital man-made technology still adopts a general voice recognition technology, the field knowledge cannot be fused deeply, the recognition accuracy of the professional terms is insufficient, the semantic understanding capability is weak, the interaction result is easy to deviate, and accurate and reliable professional services and questions and answers are difficult to support. Disclosure of Invention The application provides a digital human interaction method, a device, equipment and a program product, which aim to solve the problem of misalignment of interaction content caused by inaccurate term identification in the professional field of the existing digital human system. Correspondingly, the technical scheme of the application is as follows: In a first aspect, a digital human interaction method is provided, including: Performing text transcription on user voice data in the human-computer interaction process to obtain at least two candidate texts; Determining the acoustic probability score and the knowledge enhancement score of each candidate text, wherein the acoustic probability score characterizes the confidence that the corresponding candidate text is identified correctly at the acoustic level, the knowledge enhancement score characterizes the correlation degree of the corresponding candidate text and the knowledge graph of the target field on the semanteme, the determination mode of the knowledge enhancement score of any target candidate text in all the candidate texts comprises the steps of determining the basic knowledge score of the target candidate text based on the semantic similarity between a content entity in the target candidate text and a first type knowledge entity in the knowledge graph and the semantic similarity between a content entity in the target candidate text and a second type knowledge entity in the knowledge graph, wherein the knowledge enhancement score comprises the knowledge entity in the target field, the attribute of the knowledge entity and the correlation relation between the knowledge entity, the first type knowledge entity refers to the knowledge entity matched with the content entity in the target candidate text, and the second knowledge entity refers to the knowledge entity adjacent to the first type knowledge entity; determining a comprehensive score of each candidate text based on the acoustic probability score and the knowledge enhancement score of each candidate text, and selecting the candidate text with the comprehensive score meeting a preset condition as a target text; and generating a digital human video stream for answering the target text, and outputting the digital human video stream. In a second aspect, there is provided a digital human interaction device comprising: The transcription module is used for carrying out text transcription on the user voice data in the human-computer interaction process to obtain at least two candidate texts; The system comprises a scoring module, a judgment module and a judgment module, wherein the scoring module is used for determining the acoustic probability score and the knowledge enhancement score of each candidate text, the acoustic probability score characterizes the confidence degree that the corresponding candidate text is identified correctly at the acoustic level, the knowledge enhancement score characterizes the degree of semantic association of the corresponding candidate text with the knowledge map of the target field, the knowledge enhancement score of any target candidate text in all candidate texts is determined by the aid of semantic similarity between a content entity in the target candidate text and a first type knowledge entity in the knowledge map and semantic