CN-121978974-A - Scene control method and device based on multi-mode data and electronic equipment

CN121978974ACN 121978974 ACN121978974 ACN 121978974ACN-121978974-A

Abstract

The application belongs to the technical field of intelligent home/intelligent families, and particularly relates to a scene control method and device based on multi-mode data and electronic equipment, wherein the method comprises the steps of obtaining multi-mode information, wherein the multi-mode information comprises at least one of physiological information and/or voice information of a user; and determining a corresponding scene control instruction set according to the multi-mode information, wherein the scene control instruction set comprises at least one target instruction, the type of the instruction comprises at least one of equipment control, information inquiry and message pushing, and executing at least one target instruction in the scene control instruction set. The method remarkably improves the accuracy and the robustness of man-machine interaction in a complex environment. Meanwhile, deep understanding and cross-equipment cooperative control of user intention are realized, a single interaction instruction can be automatically expanded into a complete service chain comprising equipment control, information inquiry and other operations, and continuity and initiative of intelligent scene service are greatly improved.

Inventors

LI SHILONG
HUANG TAO
YIN FEI
WANG XIANQING

Assignees

青岛海尔科技有限公司
海尔优家智能科技（北京）有限公司
青岛海尔智能家电科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251219

Claims (10)

1. A scene control method based on multi-modal data, applied to a server, the method comprising: Acquiring multi-modal information, wherein the multi-modal information comprises physiological information and/or voice information of a user; Extracting features of the multi-modal information to obtain at least one multi-modal feature; Determining a target instruction associated with the at least one multi-modal feature to generate a scene control instruction set, the multi-modal feature and the target instruction having a correspondence, the scene control instruction set including at least one target instruction; executing at least one of the target instructions in the scene control instruction set.
2. The method of claim 1, wherein the determining the target instruction associated with the at least one multi-modal feature to generate the scene control instruction set comprises: Determining a candidate instruction set corresponding to each multi-modal feature, wherein the candidate instruction set comprises at least one candidate instruction, and the multi-modal features and the candidate instruction set have a corresponding relation; And determining at least one target instruction according to the candidate instruction set corresponding to each multi-mode feature, and generating a scene control instruction set based on at least one target instruction.
3. The method of claim 2, wherein said determining at least one of said target instructions from a set of candidate instructions for each of said multi-modal characteristics comprises: acquiring scene information, wherein the scene information comprises at least one dimension information of life scene type, user state and environment parameters; Determining a first weight list corresponding to each multi-mode feature based on a preset mapping relation, wherein the mapping relation is used for indicating the corresponding relation between the multi-mode feature and the first weight list, and the first weight list comprises first weights corresponding to each dimension information in the scene information; for each multi-modal feature, carrying out weighted summation processing on the scene information and a corresponding first weight list to obtain a corresponding first sum value, and taking the first sum value as a second weight corresponding to the multi-modal feature; And determining at least one target instruction according to the candidate instruction set corresponding to each multi-mode feature and the second weight corresponding to the multi-mode feature.
4. The method of claim 3, wherein the candidate instruction set further comprises a probability for each candidate instruction, wherein the determining at least one target instruction based on the candidate instruction set for each multi-modal feature and the second weight for the multi-modal feature comprises: Performing de-coincidence and processing on candidate instruction sets corresponding to the multi-modal features to obtain a target task set, wherein the target task set comprises at least one target candidate instruction and probabilities of the target candidate instructions under different multi-modal features respectively; For each target candidate instruction, carrying out weighted summation processing on the basis of the probabilities of the target candidate instruction under different multi-modal characteristics and the second weights corresponding to the multi-modal characteristics to obtain corresponding second sum values; taking the second sum value as a fraction of the target candidate instruction; and determining at least one target instruction according to at least one target candidate instruction in the target task set and the score corresponding to each target candidate instruction.
5. The method of claim 2, wherein the multi-modal features include consciousness features, eye movement features, and voice features, and wherein prior to determining the candidate instruction set for each of the multi-modal features, further comprising: Mapping each multi-modal feature to the same common semantic subspace, wherein the common semantic subspace is used for indicating a vector space with the same dimension and comparable semantics of each multi-modal feature; in the public semantic subspace, the mapped consciousness features are used as query vectors, the spliced vectors of the mapped eye movement features and the mapped voice features are used as key vectors and value vectors, and multi-head cross attention calculation is carried out on the query vectors and the key vectors and the value vectors to obtain attention weights; According to the attention weight, carrying out weighted summation on the value vector to obtain a third sum value, and taking the third sum value as a joint characteristic representation; Correspondingly, the determining the candidate instruction set corresponding to each multi-modal feature includes: And determining a candidate instruction set corresponding to each multi-modal feature according to each multi-modal feature and the joint feature representation.
6. The method of claim 1, wherein said executing at least one of said target instructions in said scene control instruction set comprises: Carrying out semantic analysis on at least one target instruction to obtain corresponding semantic information; Determining a service sequence corresponding to the semantic information from a scene graph according to the semantic information, wherein the service sequence comprises at least one subtask and target equipment associated with each subtask; For each subtask, determining an execution strategy corresponding to the subtask according to the real-time state information of the corresponding target equipment and a preset rule, wherein the execution strategy comprises immediate execution and delayed execution; and issuing the subtasks to corresponding target equipment according to the execution strategy corresponding to each subtask so as to enable the target equipment to execute the subtasks.
7. The method of claim 3, wherein after the deriving the scene control instruction set, further comprising: Judging whether at least one target instruction in the scene control instruction set has at least one group of conflict instructions or not; if yes, determining the confidence coefficient corresponding to each multi-mode feature; determining the target multi-modal feature with the highest confidence; upwardly correcting the second weight corresponding to the target multi-modal feature to obtain an updated second weight; and re-determining at least one target instruction based on the updated second weight corresponding to the target multi-modal feature.
8. The method of claim 1, wherein the multimodal information is collected by smart glasses, the method further comprising: judging whether a special target instruction for executing display operation exists in the scene control instruction set or not; if yes, determining a corresponding augmented reality display mode according to the special target instruction and the real-time state of the user, wherein the augmented reality display mode comprises at least one of a driving navigation mode, a home control mode and a virtual input mode; generating a virtual interactive interface corresponding to the augmented reality display mode, wherein the virtual interactive interface comprises at least one of a navigation chart, a device control panel or a virtual input method keyboard; and sending the virtual interaction interface to the intelligent glasses so that the intelligent glasses display the virtual interaction interface.
9. A scene control device based on multi-modal data, applied to a server, the device comprising: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring multi-modal information, and the multi-modal information comprises physiological information and/or voice information of a user; The extraction module is used for extracting the characteristics of the multi-mode information to obtain at least one multi-mode characteristic; the determining module is used for determining a target instruction associated with the at least one multi-modal feature to generate a scene control instruction set, wherein the multi-modal feature and the target instruction have a corresponding relationship, and the scene control instruction set comprises at least one target instruction; And the execution module is used for executing at least one target instruction in the scene control instruction set.
10. An electronic device comprising at least one processor and a memory, wherein, The memory stores computer-executable instructions; The at least one processor executing computer-executable instructions stored in the memory causes the at least one processor to perform the method of any one of claims 1-8.

Description

Scene control method and device based on multi-mode data and electronic equipment Technical Field The application belongs to the technical field of intelligent home/smart home, and particularly relates to a scene control method and device based on multi-mode data and electronic equipment. Background Along with the rapid development of artificial intelligence and the internet of things technology, an intelligent terminal gradually becomes a core entrance of man-machine interaction, and is widely applied to complex scenes such as driving, home, travelling and the like. In the prior art, man-machine interaction of an intelligent terminal mainly depends on a single mode. For example, some products implement functional control via voice commands only, and their effectiveness may be reduced in noisy environments. Meanwhile, the interaction logic of existing schemes is typically limited to a single device, and the system cannot understand and execute for user instructions involving multiple devices. In addition, existing systems interact in a manner that typically operates in response to a single, explicit instruction from a user. The system completes corresponding single operation according to the instruction, and can not perform autonomous identification and response for other related requirements possibly existing by a user in a complex scene, so that the system has limitations. Therefore, an intelligent method for realizing scene adaptation is needed to improve user experience and solve the problem of the prior art of splitting. Disclosure of Invention The application provides a scene control method and device based on multi-mode data and electronic equipment. In a first aspect, the present application provides a scene control method based on multi-modal data, applied to a server, the method comprising: Acquiring multi-modal information, wherein the multi-modal information comprises physiological information and/or voice information of a user; Extracting features of the multi-modal information to obtain at least one multi-modal feature; Determining a target instruction associated with the at least one multi-modal feature to generate a scene control instruction set, the multi-modal feature and the target instruction having a correspondence, the scene control instruction set including at least one target instruction; executing at least one of the target instructions in the scene control instruction set. In one possible implementation, the determining the target instruction associated with the at least one multi-modal feature to generate the scene control instruction set includes: Determining a candidate instruction set corresponding to each multi-modal feature, wherein the candidate instruction set comprises at least one candidate instruction, and the multi-modal features and the candidate instruction set have a corresponding relation; And determining at least one target instruction according to the candidate instruction set corresponding to each multi-mode feature, and generating a scene control instruction set based on at least one target instruction. In a possible implementation manner, the determining at least one target instruction according to the candidate instruction set corresponding to each multi-mode feature includes: acquiring scene information, wherein the scene information comprises at least one dimension information of life scene type, user state and environment parameters; Determining a first weight list corresponding to each multi-mode feature based on a preset mapping relation, wherein the mapping relation is used for indicating the corresponding relation between the multi-mode feature and the first weight list, and the first weight list comprises first weights corresponding to each dimension information in the scene information; for each multi-modal feature, carrying out weighted summation processing on the scene information and a corresponding first weight list to obtain a corresponding first sum value, and taking the first sum value as a second weight corresponding to the multi-modal feature; And determining at least one target instruction according to the candidate instruction set corresponding to each multi-mode feature and the second weight corresponding to the multi-mode feature. In one possible implementation manner, the candidate instruction set further includes a probability corresponding to each candidate instruction, and the determining at least one target instruction according to the candidate instruction set corresponding to each multi-modal feature and the second weight corresponding to the multi-modal feature includes: Performing de-coincidence and processing on candidate instruction sets corresponding to the multi-modal features to obtain a target task set, wherein the target task set comprises at least one target candidate instruction and probabilities of the target candidate instructions under different multi-modal features respectively; For each target candidate instruction,