CN-121999099-A - Multi-mode intelligent sound box control method based on scholars large model and related equipment

CN121999099ACN 121999099 ACN121999099 ACN 121999099ACN-121999099-A

Abstract

The application discloses a multi-mode intelligent sound box control method based on a scholars large model and related equipment, which can be applied to the technical field of artificial intelligence. According to the application, after the real-time input text is generated according to the real-time voice data, the real-time reply text corresponding to the real-time input text is generated through the learner large model, high-quality and deep academic vertical field question-answer information can be provided, so that the question-answer specialty is improved, then after the target voice stream is generated according to the real-time reply text, the reference picture in the source face picture list and the target voice stream are input into the digital human driving module to execute real-time short frame reasoning to generate the lip driving sequence synchronous with the target voice stream, the hardware applicability can be improved, then the target face picture is obtained from the source face picture list according to the lip driving sequence and the time step, and the output sound picture of the intelligent sound box is rendered, so that the head action of the intelligent sound box is more natural through the source face picture migration mechanism, and the sense of reality of multi-mode interaction is improved.

Inventors

Yuan Chengzhe
WU SHIJIE
SHEN WEIQIANG
Kuang Mingchen
LI QING
Wu Junxuan

Assignees

广东技术师范大学

Dates

Publication Date: 20260508
Application Date: 20251224

Claims (10)

1. A multi-mode intelligent sound box control method based on a learner large model is characterized by comprising the following steps: constructing a source face picture list, wherein the source face picture list comprises a plurality of source face pictures and head posture reference data; Acquiring real-time voice data; generating a real-time input text according to the real-time voice data; generating a real-time reply text corresponding to the real-time input text through a learner large model; Generating a target voice stream according to the real-time reply text; Inputting the reference pictures in the source face picture list and the target voice stream into a digital human driving module to execute real-time short frame reasoning and generate a lip driving sequence synchronous with the target voice stream; And according to the lip-shaped driving sequence and the time step, acquiring a target face picture from the source face picture list, and rendering the output sound and picture of the intelligent sound box.
2. The method of claim 1, wherein the constructing the source face picture list comprises: Acquiring an original digital human picture and head posture reference data; migrating the head gesture reference data into the original digital human picture through a motion migration algorithm to generate a voice-free long video, wherein the digital human in the voice-free long video contains preset slight head movements; extracting the voice-free long video frame by frame to obtain a digital human picture to be cut; And cutting out a face area from the digital human picture to be cut to obtain all source human face pictures corresponding to the digital human picture to be cut to form the source human face picture list.
3. The method of claim 1, wherein generating real-time input text from the real-time speech data comprises: and identifying and transcribing the real-time voice data to obtain the real-time input text.
4. The method of claim 1, wherein the learner large model includes an input encoding layer, a domain feature extraction layer, a retrieval enhancement layer, an inference generation layer, an answer optimization layer, and an output layer; The input coding layer is used for carrying out word segmentation, vectorization and position coding on the real-time input text to obtain a coding vector; The domain feature extraction layer is used for carrying out entity identification, keyword extraction and domain matching on the real-time input text to obtain a target entity, a target keyword and a target domain feature; The retrieval enhancement layer is used for carrying out retrieval enhancement on the target entity, the target keyword and the target domain feature to obtain a retrieval enhancement vector; The reasoning generation layer is used for fusing the coding vector and the retrieval enhancement vector to obtain a professional reply text; the answer optimization layer is used for obtaining the real-time answer text after checking, optimizing and correcting the professional answer text; The output layer is used for outputting the real-time reply text.
5. The method of claim 1, wherein the digital person drive module includes SADTALKER model, wherein the inputting the reference pictures in the source face picture list and the target voice stream into the digital person drive module performs real-time short frame reasoning, and generating a lip drive sequence synchronized with the target voice stream includes: framing the target voice stream to obtain a plurality of phrase batches; Extracting features of all the phrase batches to obtain time-frequency signals; and inputting the time-frequency signal and the reference pictures in the source face picture list to the SADTALKER model to obtain the lip-shaped driving sequence corresponding to the phrase-sound batch.
6. The method of claim 5, wherein framing the target speech stream to obtain a plurality of batches of phrase sounds comprises: Calculating the time consumption of the reasoning execution of the previous phrase batch; The time consumption is adjusted to the number of reasoning frames of the next phrase batch according to the reasoning execution; and framing the target voice stream according to the adjusted reasoning frame number to obtain a plurality of phrase-sound batches.
7. The method of claim 1, wherein the obtaining a target face picture from the source face picture list according to the lip driving sequence and the time step to render the output audio and video of the smart speaker comprises: Obtaining a target face picture frame by frame from the source face picture list according to the time step; and rendering the lip-shaped driving sequence into the target face picture of the corresponding image frame, and synchronously matching with the target voice stream to control a display screen and a loudspeaker in the intelligent sound box to output audio and video synchronous digital broadcasting contents.
8. A multi-modal intelligent speaker control device based on a learner large model, the device comprising: the first module is used for constructing a source face picture list, wherein the source face picture list comprises a plurality of source face pictures and head posture reference data; The second module is used for acquiring real-time voice data; a third module for generating a real-time input text according to the real-time voice data; a fourth module, configured to generate a real-time reply text corresponding to the real-time input text through a learner large model; A fifth module for generating a target voice stream according to the real-time reply text; The sixth module is used for inputting the reference pictures in the source face picture list and the target voice stream into the digital human driving module to execute real-time short frame reasoning and generate a lip driving sequence synchronous with the target voice stream; And a seventh module, configured to obtain a target face picture from the source face picture list according to the lip driving sequence and the time step, and render an output sound and a picture of the intelligent sound box.
9. An electronic device, comprising: at least one processor; at least one memory for storing at least one program; The at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1 to 7.
10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.

Description

Multi-mode intelligent sound box control method based on scholars large model and related equipment Technical Field The application relates to the technical field of artificial intelligence, in particular to a multi-mode intelligent sound box control method based on a learner large model and related equipment. Background In the related art, the intelligent sound box device is an intelligent device integrating artificial intelligence, voice recognition and network connection technologies, and can realize diversified functions through voice interaction. Current intelligent speaker devices rely primarily on speech for single-mode interactions, thereby making it difficult to provide an immersive experience. In addition, in the aspect of voice driving of digital people, the related model (such as SADTALKER) is directly deployed on low-power consumption embedded hardware platforms such as orange pie, so that reasoning delay is high, resources are occupied greatly, and digital people head motion is stiff and unnatural when a static picture is used. In addition, the existing intelligent sound box cannot locally realize multi-mode visual feedback, and especially the problems of insufficient knowledge depth, response delay and single interaction exist in complex semantic scenes such as academic questions and answers. In summary, the technical problems in the related art are to be improved. Disclosure of Invention The embodiment of the application mainly aims to provide a multi-mode intelligent sound box control method and related equipment based on a learner large model, which can effectively improve the professionality, interactivity and hardware adaptability of an intelligent sound box. In order to achieve the above objective, an aspect of the embodiments of the present application provides a method for controlling a multi-mode intelligent speaker based on a learner large model, the method including the following steps: constructing a source face picture list, wherein the source face picture list comprises a plurality of source face pictures and head posture reference data; Acquiring real-time voice data; generating a real-time input text according to the real-time voice data; generating a real-time reply text corresponding to the real-time input text through a learner large model; Generating a target voice stream according to the real-time reply text; inputting the reference pictures in the source face picture list and the target voice stream into a digital human driving module to execute real-time short frame reasoning, and generating a lip driving sequence synchronous with the target voice stream; And according to the lip-shaped driving sequence and the time step, acquiring a target face picture from the source face picture list, and rendering the output sound and picture of the intelligent sound box. In some embodiments, the constructing the source face picture list includes: Acquiring an original digital human picture and head posture reference data; migrating the head gesture reference data into the original digital human picture through a motion migration algorithm to generate a voice-free long video, wherein the digital human in the voice-free long video contains preset slight head movements; extracting the voice-free long video frame by frame to obtain a digital human picture to be cut; And cutting out a face area from the digital human picture to be cut to obtain all source human face pictures corresponding to the digital human picture to be cut to form the source human face picture list. In some embodiments, the generating real-time input text from the real-time speech data includes: and identifying and transcribing the real-time voice data to obtain the real-time input text. In some embodiments, the learner large model includes an input encoding layer, a domain feature extraction layer, a retrieval enhancement layer, an inference generation layer, an answer optimization layer, and an output layer; The input coding layer is used for carrying out word segmentation, vectorization and position coding on the real-time input text to obtain a coding vector; The domain feature extraction layer is used for carrying out entity identification, keyword extraction and domain matching on the real-time input text to obtain a target entity, a target keyword and a target domain feature; The retrieval enhancement layer is used for carrying out retrieval enhancement on the target entity, the target keyword and the target domain feature to obtain a retrieval enhancement vector; The reasoning generation layer is used for fusing the coding vector and the retrieval enhancement vector to obtain a professional reply text; the answer optimization layer is used for obtaining the real-time answer text after checking, optimizing and correcting the professional answer text; The output layer is used for outputting the real-time reply text. In some embodiments, the digital person driving module includes SADTALKER a model, the inputtin