CN-121455345-B - Man-machine interaction testing method and device for intelligent equipment

CN121455345BCN 121455345 BCN121455345 BCN 121455345BCN-121455345-B

Abstract

The application provides a method and a device for testing human-computer interaction of intelligent equipment, and relates to the technical field of human-computer interaction. The method comprises the steps of collecting original multi-mode data of a preset test target and carrying out unified processing to generate multi-mode fusion semantic features, determining a multi-mode prediction intermediate result according to the multi-mode fusion semantic features, generating scene description of a high-fidelity interaction scene corresponding to the test target in a virtual environment according to the multi-mode fusion semantic features and the test target, and determining and outputting a comprehensive quantification result of a key evaluation index and corresponding interpretable text description according to the multi-mode prediction intermediate result, the scene description and operation data in a scene operation process. The application provides a novel intelligent equipment man-machine interaction testing system based on a native multi-mode unified large model, which fundamentally improves multi-mode understanding, scene generation, collaborative evaluation and real-time processing capacity so as to meet the severe requirements on interaction testing in a high-complexity task environment.

Inventors

MA XULONG
HE YING
PING SHANTAO
CHEN YAN
LIN JIAZHEN
ZHAO QIANCHUAN

Assignees

启元实验室

Dates

Publication Date: 20260512
Application Date: 20260106

Claims (8)

1. The method for testing the man-machine interaction of the intelligent equipment is characterized by comprising the following steps of: collecting original multi-modal data of a preset test target and carrying out unified processing to generate multi-modal fusion semantic features; Determining a multi-task prediction intermediate result according to the multi-mode fusion semantic features; Generating a scene description of a high-fidelity interaction scene corresponding to the test target in a virtual environment according to the multi-mode fusion semantic features and the test target; Determining and outputting a comprehensive quantification result of a key evaluation index and a corresponding interpretable text description according to the multitask prediction intermediate result, the scene description and the operation data in the scene operation process; The generating, in a virtual environment, a scene description of a high-fidelity interaction scene corresponding to the test target according to the multi-modal fusion semantic feature and the test target includes: Inputting the multi-mode fusion semantic features into a preset scene generator to determine corresponding task types and interaction modes; Selecting a scene template according to the task type and the interaction mode and determining task parameters; Establishing the high-fidelity interaction scene in a virtual environment based on the scene template and the task parameters; Dynamically adjusting the task parameters based on a reinforcement learning mechanism to generate the scene description of the adaptive difficulty; The determining and outputting the comprehensive quantization result of the key evaluation index and the corresponding interpretable text description according to the intermediate result of the multi-task prediction, the scene description and the operation data in the scene operation process comprises the following steps: Using a preset multi-dimensional evaluation engine to evaluate and calculate the multi-task prediction intermediate result according to the operation data so as to determine key evaluation indexes in the multi-task prediction intermediate result and the comprehensive quantification result; And generating the interpretable text description according to the scene description and the comprehensive quantification result by using a preset large language model.
2. The method of claim 1, wherein determining a multi-tasking prediction intermediate result from the multi-modal fusion semantic features comprises: And inputting the multi-modal fusion semantic features into a preset original multi-modal unified large model to output the multi-task prediction intermediate result, wherein the multi-task prediction intermediate result comprises an intention category, an emotion label, a task state score and an interaction quality level.
3. The method according to claim 2, wherein inputting the multi-modal fusion semantic features into a preset native multi-modal unified big model to output the multi-task prediction intermediate result comprises: inputting the multi-modal fusion semantic features into a preset original multi-modal unified large model; And performing multi-task analysis on the multi-modal fusion semantic features by utilizing the multi-task learning head of the original multi-modal unified large model so as to determine the multi-task prediction intermediate result.
4. The method of claim 1, wherein the collecting and unifying original multi-modal data of the preset test target to generate multi-modal fusion semantic features comprises: collecting the original multi-modal data corresponding to the test target; extracting features of the original multi-mode data to determine initial features; Processing the initial characteristics by time and semantic consistency to determine unified characteristics; And carrying out depth fusion on the unified features to generate the multi-mode fusion semantic features.
5. The method of any one of claims 1-4, further comprising: And adjusting the custom difficulty information according to the comprehensive quantization result by using a reinforcement learning mechanism, wherein the custom difficulty information comprises task density, interference intensity and information shielding proportion.
6. A man-machine interaction testing device of intelligent equipment is characterized by comprising: The data acquisition processing module is used for acquiring original multi-modal data of a preset test target and carrying out unified processing so as to generate multi-modal fusion semantic features; the intermediate result determining module is used for determining a multi-task prediction intermediate result according to the multi-mode fusion semantic features; The scene generation module is used for generating scene description of the high-fidelity interaction scene corresponding to the test target in the virtual environment according to the multi-mode fusion semantic features and the test target; The evaluation module is used for determining and outputting a comprehensive quantification result of the key evaluation index and a corresponding interpretable text description according to the multi-task prediction intermediate result, the scene description and the operation data in the scene operation process; The scene generation module is specifically configured to: Inputting the multi-mode fusion semantic features into a preset scene generator to determine corresponding task types and interaction modes; Selecting a scene template according to the task type and the interaction mode and determining task parameters; Establishing the high-fidelity interaction scene in a virtual environment based on the scene template and the task parameters; Dynamically adjusting the task parameters based on a reinforcement learning mechanism to generate the scene description of the adaptive difficulty; wherein, the evaluation module is specifically configured to: Using a preset multi-dimensional evaluation engine to evaluate and calculate the multi-task prediction intermediate result according to the operation data so as to determine key evaluation indexes in the multi-task prediction intermediate result and the comprehensive quantification result; And generating the interpretable text description according to the scene description and the comprehensive quantification result by using a preset large language model.
7. An electronic device, comprising: A processor; A memory storing a computer program which, when executed by the processor, causes the processor to perform the method of any one of claims 1-5.
8. A non-transitory computer-readable storage medium having stored thereon computer-readable instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1-5.

Description

Man-machine interaction testing method and device for intelligent equipment Technical Field The application relates to the technical field of human-computer interaction, in particular to a method and a device for testing human-computer interaction of intelligent equipment. Background Smart devices are those electronic devices that are capable of networking, have data computing and processing capabilities, and are capable of sensing an environment, and intelligently interacting with humans or other devices. In recent years, artificial intelligence technology has rapidly developed in the directions of intelligent equipment, intelligent algorithm systems, intelligent command control systems and the like, so that the systems have higher requirements on the real-time performance, accuracy and reliability of man-machine interaction. Especially in the tasks of intelligent algorithm verification, equipment control evaluation, command decision analysis and the like, the test technology not only needs to understand complex multi-mode information with high precision, but also needs to be capable of carrying out real and effective verification in a dynamically changing scene. At present, research and application of man-machine interaction testing are related to various industries, including finger control testing of intelligent equipment, control and evaluation of industrial robots, verification of intelligent traffic control systems and the like. These systems often rely on multiple perception channels such as voice, vision, gestures, physiological signals, etc. for information interaction. In the related art, in the process of multi-mode data processing, the existing test technology lacks of the capability of uniform understanding and fusion of multi-mode data, so that the requirement of high-complexity tasks on interactive test cannot be fully met. Disclosure of Invention The application aims to provide a method and a device for testing man-machine interaction of intelligent equipment. According to an aspect of the present application, a method for testing man-machine interaction of an intelligent device is provided, including: collecting original multi-modal data of a preset test target and carrying out unified processing to generate multi-modal fusion semantic features; Determining a multi-task prediction intermediate result according to the multi-mode fusion semantic features; According to the multi-mode fusion semantic features and the test targets, generating scene descriptions of high-fidelity interaction scenes corresponding to the test targets in the virtual environment; and determining and outputting a comprehensive quantification result of the key evaluation index and a corresponding interpretable text description according to the intermediate result of the multi-task prediction, the scene description and the operation data in the scene operation process. According to an aspect of the present application, a device for testing man-machine interaction of an intelligent device is provided, including: The data acquisition processing module is used for acquiring original multi-modal data of a preset test target and carrying out unified processing so as to generate multi-modal fusion semantic features; the intermediate result determining module is used for determining a multi-task prediction intermediate result according to the multi-mode fusion semantic features; the scene generation module is used for generating scene description of the high-fidelity interaction scene corresponding to the test target in the virtual environment according to the multi-mode fusion semantic features and the test target; and the evaluation module is used for determining and outputting the comprehensive quantification result of the key evaluation index and the corresponding interpretable text description according to the multitask prediction intermediate result, the scene description and the operation data in the scene operation process. According to an aspect of the application an electronic device is presented comprising a processor, a memory storing a computer program which, when executed by the processor, causes the processor to perform the method as described above. According to an aspect of the application, a non-transitory computer readable medium is presented, having stored thereon readable instructions, which when executed by a processor, cause the processor to perform a method as described above. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed. The beneficial effects are that: according to the embodiment provided by the application, the original multi-mode data in the interaction process is collected and uniformly processed to generate the fusion semantic features, so that the limitation of the traditional single-mode test is overcome, and the complex information of the real int