CN-121983048-A - Voice interaction method and device based on graphic element context information and time sequence alignment

CN121983048ACN 121983048 ACN121983048 ACN 121983048ACN-121983048-A

Abstract

The application relates to the technical field of man-machine interaction and artificial intelligence, and particularly discloses a voice interaction method and device based on graphic primitive context information and time sequence alignment. When a trigger event is detected, voice is acquired in real time, the context of a target graphic element is analyzed, semantic binding is carried out after the time overlap rate reaches the standard, fusion data is generated, a target intelligent agent processes the fusion data, a prompt word is constructed and transmitted to a large language model, a structured interaction response instruction is generated, and finally the interaction response instruction is executed, so that interaction is realized. The method solves the problem of context missing in the traditional voice interaction by acquiring the context information of the target primitive in real time. When the user triggers the interaction event, the context information and the semantic environment corresponding to the target graphic element can be accurately identified, so that the voice command is associated with the interaction focus, a high-quality interaction response prompt word is constructed, and finally an accurate interaction response command is generated, and the interaction efficiency and the interaction accuracy are improved.

Inventors

WEI RONGJIE

Assignees

深圳市瑞尔麦斯科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260202

Claims (10)

1. A method of voice interaction based on primitive context information and timing alignment, comprising: When an interaction triggering event is detected, collecting user voice and obtaining context information of a target graphic element corresponding to the interaction triggering event; when the time sequence overlapping rate of the user voice and the interactive triggering event is larger than or equal to a preset overlapping rate threshold value, carrying out semantic binding on a voice text corresponding to the user voice and the context information to obtain fusion data; processing the fusion data based on the target agent to obtain an interactive response prompt word; analyzing the interactive response prompt word based on a large language model to generate an interactive response instruction; And executing the interactive response instruction and displaying an interactive response result to the user.
2. The method for voice interaction based on primitive context information and time sequence alignment according to claim 1, wherein the obtaining the context information of the target primitive corresponding to the interaction triggering event comprises: acquiring an information structure tree of a window where the target primitive is located, traversing the information structure tree, extracting node information, and acquiring first context information and a structure tree traversing result; When the traversing result of the structural tree is a node with information missing, performing image cutting on a triggering area corresponding to the interactive triggering event to obtain an image to be identified; respectively carrying out image recognition and text recognition on the image to be recognized based on the target detection model and the text recognition model to obtain second context information; and integrating the first context information and the second context information to generate the context information.
3. The method for voice interaction based on primitive context information and time sequence alignment according to claim 1, wherein before the target-based agent processes the fusion data to obtain the interaction response prompt word, the method further comprises: Extracting the primitive type of the target primitive from the context information, and extracting semantic features of the voice text to obtain semantic feature vectors; Based on the primitive types, matching is carried out on a preset agent set, and an initial agent list is obtained; and matching the semantic feature vector with the intention feature vector of each agent in the agent list in a similarity manner to obtain the target agent.
4. The method for voice interaction based on primitive context information and time sequence alignment according to claim 1, wherein the processing the fusion data based on the target agent to obtain the interaction response prompt word comprises the following steps: processing the fusion data based on the target agent to generate a search vector; Searching in a preset database according to the search vector to obtain at least one interactive association data and a target prompt word template; And filling a target prompt word template based on the fusion data and the interaction association data, and generating the interaction response prompt word.
5. The method for voice interaction based on primitive context information and time sequence alignment according to claim 1, wherein before the target-based agent processes the fusion data to obtain the interaction response prompt word, the method further comprises: the fusion data is subjected to sensitive data identification to obtain data to be desensitized, and desensitization treatment is carried out on the data to be desensitized to obtain desensitized fusion data; When the data to be desensitized comprises a sensitive control, triggering identity authentication to authenticate the identity of the user and obtain an identity authentication result; and when the identity authentication result is passed, transmitting the desensitized fusion data to the target intelligent agent.
6. The primitive context information and timing alignment based voice interaction method of claim 1, wherein the executing the interaction response instruction and presenting the interaction response result to the user comprises: Based on the interactive response instruction, matching an interactive display template, and determining response content and rendering parameters of a user interface; and rendering the response content display interface based on the interactive display template, the response content and the rendering parameters to obtain and display the response content display interface.
7. The method for voice interaction based on primitive context information and time sequence alignment according to any one of claims 1 to 6, wherein when the time sequence overlapping rate of the user voice and the interaction triggering event is greater than or equal to a preset overlapping rate threshold, performing semantic binding on a voice text corresponding to the user voice and the context information, and before obtaining the fusion data, further comprising: based on a monotone clock, performing time stamping on the interaction triggering event and the user voice, and determining a time sequence overlapping period according to the time stamp; and obtaining the time sequence overlapping rate based on the time sequence overlapping period and the total voice duration of the user voice.
8. A voice interaction device based on primitive context information and timing alignment, comprising: The information acquisition module is used for acquiring user voice and acquiring context information of a target graphic element corresponding to the interaction trigger event when the interaction trigger event is detected; The information fusion module is used for carrying out semantic binding on the voice text corresponding to the user voice and the context information when the time sequence overlapping rate of the user voice and the interactive triggering event is larger than or equal to a preset overlapping rate threshold value so as to obtain fusion data; the prompt word obtaining module is used for processing the fusion data based on the target intelligent agent to obtain interactive response prompt words; The instruction generation module is used for analyzing the interactive response prompt word based on the large language model to generate an interactive response instruction; And the instruction execution module is used for executing the interaction response instruction and displaying the interaction response result to a user.
9. A computer device, the computer device comprising a memory and a processor; The memory is used for storing a computer program; The processor is configured to execute the computer program and implement the primitive context information and time alignment based voice interaction method according to any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the primitive context information and time alignment based voice interaction method according to any of claims 1 to 7.

Description

Voice interaction method and device based on graphic element context information and time sequence alignment Technical Field The application relates to the technical field of man-machine interaction and artificial intelligence, in particular to a voice interaction method and device based on graphic element context information and time sequence alignment. Background The current mainstream intelligent terminal voice interaction system mainly relies on a global wake-up mechanism, a user activates a voice assistant through a wake-up word, inputs a voice command, can only process the global command, and mostly adopts a serial mode of 'wake-up voice first and touch trigger later', and lacks a synchronous cooperative mechanism. The existing interaction mode has the problems of lack of context awareness and low multi-mode fusion precision, a user voice command cannot be associated with user operation, a time sequence alignment mechanism is lacked, the operation is complex and easy to make mistakes, irrelevant touch events and voice commands are easy to bind in error, and therefore the user interaction efficiency and the accuracy are low. Therefore, how to improve the interaction efficiency and accuracy becomes a problem to be solved. Disclosure of Invention The application provides a voice interaction method and device based on graphic element context information and time sequence alignment, so as to improve interaction efficiency and accuracy. In a first aspect, the present application provides a method for voice interaction based on primitive context information and time alignment, the method comprising: When an interaction triggering event is detected, collecting user voice and obtaining context information of a target graphic element corresponding to the interaction triggering event; when the time sequence overlapping rate of the user voice and the interactive triggering event is larger than or equal to a preset overlapping rate threshold value, carrying out semantic binding on a voice text corresponding to the user voice and the context information to obtain fusion data; processing the fusion data based on the target agent to obtain an interactive response prompt word; analyzing the interactive response prompt word based on a large language model to generate an interactive response instruction; And executing the interactive response instruction and displaying an interactive response result to the user. In a second aspect, the present application further provides a voice interaction device based on primitive context information and time alignment, where the device includes: The information acquisition module is used for acquiring user voice and acquiring context information of a target graphic element corresponding to the interaction trigger event when the interaction trigger event is detected; The information fusion module is used for carrying out semantic binding on the voice text corresponding to the user voice and the context information when the time sequence overlapping rate of the user voice and the interactive triggering event is larger than or equal to a preset overlapping rate threshold value so as to obtain fusion data; the prompt word obtaining module is used for processing the fusion data based on the target intelligent agent to obtain interactive response prompt words; The instruction generation module is used for analyzing the interactive response prompt word based on the large language model to generate an interactive response instruction; And the instruction execution module is used for executing the interaction response instruction and displaying the interaction response result to a user. In a third aspect, the present application also provides a computer device, the computer device including a memory and a processor, the memory being configured to store a computer program, and the processor being configured to execute the computer program and implement the method of voice interaction based on primitive context information and time alignment as described above when the computer program is executed. In a fourth aspect, the present application also provides a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to implement a method of voice interaction based on primitive context information and time alignment as described above. The application discloses a voice interaction method and a voice interaction device based on graphic element context information and time sequence alignment, which are used for acquiring user voice and acquiring context information of a target graphic element corresponding to an interaction trigger event when the interaction trigger event is detected; when the time sequence overlapping rate of the user voice and the interaction triggering event is larger than or equal to a preset overlapping rate threshold value, carrying out semantic binding on a voice text corresponding to the user voice and the context i