CN-121983037-A - Man-machine interaction method, device, electronic equipment, storage medium and program product

CN121983037ACN 121983037 ACN121983037 ACN 121983037ACN-121983037-A

Abstract

The application relates to the technical field of man-machine interaction, and discloses a man-machine interaction method, a device, electronic equipment, a storage medium and a program product, wherein the method comprises the steps of obtaining an original voice signal of a driver; according to the original voice signal of the driver, determining the intention type of the target voice text, wherein the target voice text is the recognition result of the original voice signal; the technical scheme of the application solves the problems that the speech recognition accuracy of AI customer service is low and the personalized instructions of the driver cannot be responded efficiently, and improves the speech recognition accuracy and the response speed to the personalized instructions of the driver.

Inventors

XUE YANWEN

Assignees

北京白龙马云行科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251217

Claims (10)

1. A human-machine interaction method, characterized in that the method comprises: Acquiring an original voice signal of a driver; determining the intention type of a target voice text according to the original voice signal of the driver, wherein the target voice text is the recognition result of the original voice signal; Extracting key information from the target voice text; Determining a user representation of the driver according to the historical operation data of the driver, the key information and the intention category; And outputting a multi-mode feedback signal to the driver according to the user portrait.
2. The method of claim 1, wherein said determining the intent class of the target speech text from the original speech signal of the driver comprises: preprocessing the original voice signal to obtain a target voice signal; performing voice recognition on the target voice signal to obtain a target voice text and recognition confidence; and determining the intention category of the target voice text according to the target voice text and the recognition confidence.
3. The method of claim 2, wherein preprocessing the original speech signal to obtain a target speech signal comprises: extracting effective voice content from the original voice signal to obtain a first voice signal; noise reduction processing is carried out on the first voice signal to obtain a second voice signal; performing signal enhancement on the second voice signal to obtain a third voice signal; and dividing the third voice signal into the target voice signal according to a preset voice length.
4. The method of claim 2, wherein performing speech recognition on the target speech signal to obtain a target speech text and a recognition confidence comprises: Converting acoustic wave characteristics of the target voice signal into a probability sequence of a basic pronunciation unit which changes with time based on a first acoustic model, wherein the first acoustic model is used for calculating the probability of identifying an audio signal as the basic pronunciation unit and is constructed through historical voice data of the driver; based on a first language model, converting the probability sequence into a first text sequence according to preset grammar and semantic rules, wherein the first language model is used for converting the probability of a basic pronunciation unit into a corresponding text, and the first language model is obtained by optimizing a preset voice recognition model deployed locally by using historical voice data of a driver; taking the first text sequence as the target voice text; and determining the recognition confidence according to the probability sequence corresponding to the first text sequence.
5. The method of claim 2, wherein the determining the intent category of the target voice text based on the target voice text and the recognition confidence comprises: judging whether the identification confidence is larger than a confidence threshold; And if the recognition confidence coefficient is larger than the confidence coefficient threshold value, carrying out intention classification on the target voice text based on a natural language model to obtain the intention type of the target voice text.
6. The method of claim 1, wherein the extracting key information from the target speech text comprises: extracting candidate texts of a preset type from the target voice text through a natural language model; if the candidate text is of an entity type, adding the candidate text into the key information; and if the candidate text is of a time type, converting the candidate text into an intermediate text of a time format, and adding the intermediate text into the key information.
7. The method of claim 1, wherein said outputting a multi-modal feedback signal to the driver based on the user representation comprises: Determining the information receiving type of the user and outputting text based on the target of the current scene according to the user portrait; if the information receiving type is audio, generating a voice reply signal of the target output text according to the complexity of the target output text, and controlling a loudspeaker to play the voice reply signal; If the information receiving type is vibration, determining the target emergency degree of the target output text, and outputting a vibration signal with the intensity corresponding to the target emergency degree.
8. A human-machine interaction device, the device comprising: The acquisition module is used for acquiring an original voice signal of a driver; The intention determining module is used for determining the intention type of a target voice text according to the original voice signal of the driver, wherein the target voice text is the recognition result of the original voice signal; The key information extraction module is used for extracting key information from the target voice text; the user portrait determining module is used for determining the user portrait of the driver according to the historical operation data of the driver, the key information and the intention type; and the feedback module is used for outputting multi-mode feedback signals to the driver according to the user portrait.
9. An electronic device, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the human-machine interaction method of any of claims 1 to 7.
10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the human-machine interaction method of any one of claims 1 to 7.

Description

Man-machine interaction method, device, electronic equipment, storage medium and program product Technical Field The application relates to the technical field of man-machine interaction, in particular to a man-machine interaction method, a man-machine interaction device, electronic equipment, a storage medium and a program product. Background In recent years, with the development of technologies such as artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) and speech recognition, more and more platforms begin to attempt to improve customer service operation efficiency and service quality by using artificial intelligence. In the interaction between a driver and an AI customer service, the AI customer service has the problems that the voice recognition accuracy is low and the personalized instruction of the driver cannot be responded efficiently. Disclosure of Invention The application provides a man-machine interaction method, a man-machine interaction device, electronic equipment, a storage medium and a program product, which are used for solving the problems that in the interaction between a driver and AI customer service, the AI customer service has low voice recognition accuracy and cannot respond to personalized instructions of the driver efficiently. In a first aspect, the present application provides a human-computer interaction method, which includes: Acquiring an original voice signal of a driver; According to the original voice signal of the driver, determining the intention type of the target voice text, wherein the target voice text is the recognition result of the original voice signal; Extracting key information from a target voice text; Determining a user portrait of the driver according to the historical operation data, the key information and the intention category of the driver; a multi-modal feedback signal is output to the driver based on the user representation. In an alternative embodiment, determining the intent class of the target speech text from the original speech signal of the driver includes: Preprocessing an original voice signal to obtain a target voice signal; performing voice recognition on the target voice signal to obtain a target voice text and recognition confidence; and determining the intention category of the target voice text according to the target voice text and the recognition confidence. In an alternative embodiment, preprocessing an original voice signal to obtain a target voice signal, including: Extracting effective voice content from an original voice signal to obtain a first voice signal; noise reduction processing is carried out on the first voice signal to obtain a second voice signal; performing signal enhancement on the second voice signal to obtain a third voice signal; and dividing the third voice signal into target voice signals according to the preset voice length. In an alternative embodiment, performing speech recognition on the target speech signal to obtain a target speech text and a recognition confidence, including: Converting sound wave characteristics of a target voice signal into a probability sequence of a basic pronunciation unit which changes with time based on a first acoustic model, wherein the first acoustic model is used for calculating the probability of identifying an audio signal as the basic pronunciation unit and is constructed through historical voice data of a driver; based on a first language model, converting a probability sequence into a first text sequence according to preset grammar and semantic rules, wherein the first language model is used for converting the probability of a basic pronunciation unit into a corresponding text, and the first language model is obtained by optimizing a preset voice recognition model deployed locally by using historical voice data of a driver; taking the first text sequence as a target voice text; And determining the recognition confidence according to the probability sequence corresponding to the first text sequence. In an alternative embodiment, determining the intention category of the target voice text according to the target voice text and the recognition confidence comprises: judging whether the recognition confidence is larger than a confidence threshold; If the recognition confidence is greater than the confidence threshold, performing intention classification on the target voice text based on the natural language model to obtain the intention category of the target voice text. In an alternative embodiment, extracting key information from the target speech text includes: Extracting candidate texts of a preset type from the target voice text through a natural language model; If the candidate text is of the entity type, adding the candidate text into the key information; And if the candidate text is of a time type, converting the candidate text into an intermediate text in a time format, and adding the intermediate text into the key information. In an alternative embodiment, outputting