CN-121979969-A - Interaction method and device with machine model, electronic equipment and storage medium

CN121979969ACN 121979969 ACN121979969 ACN 121979969ACN-121979969-A

Abstract

The application relates to an interaction method and device for a machine model, electronic equipment and a storage medium, wherein the method comprises the steps of collecting multi-modal information associated with a user, wherein the multi-modal information represents information associated with an environment where the user is located and self attributes, normalizing all information in the multi-modal information to obtain corresponding feature vectors, extracting image space features and time sequence features from the feature vectors, determining interaction modes of the user and the machine model based on the image space features and the time sequence features, wherein the image space features represent illumination intensity and scene complexity, the time sequence features represent noise level, extracting multi-modal emotion features from the feature vectors, and determining interaction responses of the machine model to the user by combining the multi-modal emotion features and historical interaction voices of the user and the machine model. The application solves the problem of poor accompanying effect of the accompanying robot in the prior art.

Inventors

Request for anonymity
CHEN FANGXIONG

Assignees

深圳市广和通无线股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251210

Claims (10)

1. A method of interacting with a machine model, comprising: Acquiring multi-modal information associated with a user, wherein the multi-modal information characterizes information associated with an environment where the user is located and self-attribute; normalizing all the information in the multi-mode information to obtain corresponding feature vectors; Extracting image space features and time sequence features from the feature vectors, and determining an interaction modality of the user with the machine model based on the image space features and the time sequence features, wherein the image space features characterize illumination intensity and scene complexity, and the time sequence features characterize noise levels; and extracting multi-mode emotion characteristics from the characteristic vector, and determining the interactive response of the machine model to the user by combining the multi-mode emotion characteristics and the historical interactive voice of the user and the machine model.
2. The method of claim 1, wherein collecting multimodal information associated with a user comprises: acquiring facial expressions, limb actions and two-dimensional visual characteristics of an environmental object of a user through a camera to obtain corresponding image data, and generating corresponding visual mode information from the image data; Generating corresponding depth visual mode information by representing skeletal point coordinates, facial depth profiles and point cloud representations of environmental objects of the limb actions of the user, which are acquired by a depth sensor; Generating corresponding auditory mode information by user voice and environmental sound collected by the microphone array; Generating corresponding touch mode information according to the user touch information detected by the touch sensor; And generating corresponding environmental physical mode information according to the illumination intensity monitored by the ambient light sensor.
3. The method according to claim 2, wherein normalizing all information in the multi-modal information to obtain a corresponding feature vector comprises: Determining the average value of each mode information in the multi-mode information; And determining the difference value between each mode information and the corresponding mean value, and determining the ratio of the difference value to the standard deviation as a feature vector corresponding to the mode information.
4. The method of claim 1, wherein determining an interaction modality of the user with the machine model based on the image space features and the time series features comprises: respectively carrying out pooling treatment on the image space characteristics and the time sequence characteristics; Fusing the image space characteristics and the time sequence characteristics after the pooling treatment through an attention mechanism to obtain an intermediate environment description vector; Mapping the intermediate environment description vector to three components by a regression function, wherein the three components are respectively illumination intensity, scene complexity and noise level; comparing the sum value obtained by multiplying the three components by corresponding preset weight values with an interaction mode threshold value; And determining the interaction mode according to the comparison result, wherein the interaction mode represents the interaction form of the machine model and the user.
5. The method of claim 1, wherein extracting multi-modal emotion features from the feature vectors comprises: and extracting target vectors associated with the facial expression, the voice intonation and the limb actions of the user from the feature vectors, and determining all the target vectors as the multi-modal emotion features.
6. The method of claim 5, wherein determining the interactive response of the machine model to the user in combination with the multimodal emotion feature and the historical interactive speech of the user with the machine model comprises: Determining the value of each target vector in the multi-mode emotion characteristics; comparing the sum obtained by multiplying the value by the corresponding weight with a preset mapping table to determine the emotion type of the user, wherein the preset mapping table represents the mapping relation between the emotion type and the value; and determining the interactive response by combining the emotion type and the historical interactive voice.
7. The method according to claim 1, wherein the method further comprises: and continuously optimizing and adjusting the machine model according to the satisfaction degree of the user on the interactive response and the interactive duration of the interactive response.
8. An interactive apparatus for interacting with a machine model, comprising: the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring multi-mode information associated with a user, and the multi-mode information characterizes information associated with an environment where the user is and self attribute; the first processing module is used for carrying out normalization processing on all the information in the multi-mode information to obtain corresponding feature vectors; A second processing module for extracting image space features and time series features from the feature vectors, and determining an interaction modality of the user with the machine model based on the image space features and the time series features, wherein the image space features characterize illumination intensity and scene complexity, and the time series features characterize noise levels; And the third processing module is used for extracting multi-mode emotion characteristics from the characteristic vector and determining the interactive response of the machine model to the user by combining the multi-mode emotion characteristics and the historical interactive voice of the user and the machine model.
9. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface, the memory are in communication with each other via the communication bus, the memory is configured to store a computer program, and the processor is configured to implement the method of interacting with a machine model according to any one of claims 1-7 when the computer program is executed.
10. A storage medium having stored thereon a computer program, which when executed by a processor, implements the method of interaction with a machine model according to any of claims 1-7.

Description

Interaction method and device with machine model, electronic equipment and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to a method and apparatus for interacting with a machine model, an electronic device, and a storage medium. Background Currently, companion robot technology focuses on multi-modal awareness and interaction, aiming at improving naturalness and intelligence of communication with users. However, the current robots have weak self-adaptive capability to environmental changes (such as noise, illumination and disordered scenes), are easy to interfere in voice recognition and visual perception, and lack the capability of dynamically switching interaction modes, so that the interaction reliability is low. Moreover, current emotion recognition by robots is limited to basic emotion, it is difficult to capture complex emotion or combine context presumption reasons, response flows to the surface, and co-emotion is lacking. Therefore, the robot in the prior art has poor accompanying effect and cannot meet the requirements of users. Aiming at the technical problems in the prior art, no effective solution exists at present. Disclosure of Invention The application provides an interaction method and device with a machine model, electronic equipment and a storage medium, and aims to solve the problem that a accompany robot in the prior art is poor in accompany effect. The application provides an interaction method with a machine model, which comprises the steps of collecting multi-modal information associated with a user, wherein the multi-modal information represents information associated with an environment where the user is located and self attributes, carrying out normalization processing on all information in the multi-modal information to obtain corresponding feature vectors, extracting image space features and time sequence features from the feature vectors, and determining an interaction mode of the user with the machine model based on the image space features and the time sequence features, wherein the image space features represent illumination intensity and scene complexity, the time sequence features represent noise level, extracting multi-modal emotion features from the feature vectors, and determining interaction response of the machine model to the user by combining the multi-modal emotion features and historical interaction voice of the user with the machine model. The application provides an interaction device with a machine model, which comprises an acquisition module, a first processing module, a second processing module and a third processing module, wherein the acquisition module is used for acquiring multi-modal information associated with a user, the multi-modal information represents information associated with an environment where the user is located and self attributes, the first processing module is used for carrying out normalization processing on all information in the multi-modal information to obtain corresponding feature vectors, the second processing module is used for extracting image space features and time sequence features from the feature vectors and determining an interaction mode of the user with the machine model based on the image space features and the time sequence features, the image space features represent illumination intensity and scene complexity, the time sequence features represent noise level, and the third processing module is used for extracting multi-modal emotion features from the feature vectors and combining the multi-modal emotion features and historical interaction voices of the user and the machine model to determine interaction responses of the machine model to the user. In a third aspect, the application provides an electronic device comprising at least one communication interface, at least one bus connected to the at least one communication interface, at least one processor connected to the at least one bus, and at least one memory connected to the at least one bus, wherein the processor is configured to perform the method of interacting with a machine model according to the first aspect of the application. In a fourth aspect, the present application also provides a computer storage medium storing computer executable instructions for performing the method of interacting with a machine model according to the first aspect of the present application. Compared with the prior art, the technical scheme provided by the embodiment of the application has the advantages that the method provided by the embodiment of the application firstly collects the multi-modal information related to the user, the multi-modal information characterizes the information related to the environment where the user is located and the attribute of the user, further normalizes all the information in the multi-modal information to obtain the corresponding feature vector, then extracts the image space fe