CN-122019714-A - Multi-mode scene interaction control method and device based on pre-generated resource map

CN122019714ACN 122019714 ACN122019714 ACN 122019714ACN-122019714-A

Abstract

The invention provides a multimode scene interaction control method and a device based on a pre-generated resource map, which belong to the technical field of computer interaction, the multimode scene interaction control method based on the pre-generated resource map can load the pre-constructed scene interaction map through the mode of the pre-generated resource map, reduce the high delay of training, only the time of millisecond is spent for intention recognition, video playing is also instant and high-immersion, meanwhile, as the trend of training nodes is strictly limited by the directed edges of the atlas, the risk of incorrect guidance caused by illusion generated by the artificial intelligent model is reduced, and further, the high-simulation immersion can be maintained and the teaching/training logic can be strictly followed.

Inventors

CHENG YUAN
YANG ZHI
Hu Shenan

Assignees

上海凌极信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (10)

1. A multimode scene interaction control method based on a pre-generated resource map is characterized by comprising the following steps: Loading a pre-constructed scene interaction map, wherein the scene interaction map comprises a plurality of state nodes and directed edges connected with the state nodes, each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intention characteristic; Responding to a starting instruction of an interaction session, activating a first state node according to a current state pointer, and rendering visual resources associated with the first state node at a user terminal; collecting a multi-mode input signal of a user in the playing process of the first state node, and converting the multi-mode input signal into a text sequence; Performing mixed intention recognition processing on the text sequence, calculating the matching degree of standard intention characteristics associated with each directed edge sent by the current state node, and determining a target intention and a target directed edge; And based on a second state node pointed by the target directed edge, updating the current state pointer, triggering state transition logic, and switching and rendering visual resources associated with the second state node at the user terminal.
2. The multi-modal contextual model interaction control method based on pre-generated resource patterns according to claim 1, wherein the pre-generated visual resources are constructed by: receiving structured scenario data comprising character settings, dialog scripts and action descriptions; analyzing the structured script data, and calling a generating type artificial intelligent interface to generate corresponding digital human action video or three-dimensional animation data aiming at each script fragment; Storing the generated video or animation data to a content distribution network, and generating a unique resource uniform resource locator; and establishing a mapping relation between the resource uniform resource locator and a corresponding state node in the scene interaction map.
3. The multi-modal contextual model interaction control method based on pre-generated resource patterns according to claim 1, wherein the method further comprises: monitoring intention recognition matching degree scores and response time lengths of users in N continuous state nodes in real time; and if the average matching degree score is lower than a first threshold value or the average response length is higher than a second threshold value, automatically adjusting the interaction parameters of the nodes in the subsequent state, wherein the adjustment of the interaction parameters comprises the steps of reducing the matching threshold value of subsequent intention recognition or switching to a simplified version scene sub-map containing more auxiliary prompt information.
4. The method of claim 1, wherein the pre-constructed scenario interaction map further comprises a global variable set, and the trigger state transition logic comprises: Analyzing an attribute change instruction preset by the target directed edge; According to the attribute change instruction, updating the numerical value in the global variable set; judging whether the updated global variable value meets the admission condition of the second state node; And if not, redirecting to a preset default branch node or an error prompt node.
5. The multi-modal contextual model interaction control method based on pre-generated resource patterns according to claim 1, wherein the target intent is determined by: Converting the text sequence into a user input vector by using a pre-training sentence embedding model; acquiring reference corpus vectors of all associated standard intentions of the current state node; calculating cosine similarity of the user input vector and each reference corpus vector to obtain an initial similarity score; Performing weighted correction on the initial similarity score based on a preset keyword library to obtain a final matching score; and selecting the standard intention with the highest final matching score and exceeding a preset threshold as the target intention.
6. The multi-modal scenario interaction control method based on pre-generated resource patterns according to claim 5, wherein the performing weighted correction on the initial similarity score based on a preset keyword library specifically comprises: Performing word segmentation and part-of-speech reduction on the text sequence, and extracting a user real word set; Searching a strong-association keyword set preset by the current standard intention; judging whether the user real word set contains elements in the strong-correlation keyword set or not; And if the initial similarity score is not included, keeping the initial similarity score unchanged or carrying out punishment attenuation.
7. The multi-modal scenario interaction control method based on pre-generated resource patterns according to claim 6, wherein the performing part-of-speech reduction processing on the text sequence comprises: performing part-of-speech tagging on the text sequence by using a natural language processing tool, and identifying a verb position changing form and a noun complex form; Restoring the verb modification form into a verb original form, and restoring the noun plural form into a noun singular form to generate a standardized morphological sequence for keyword matching.
8. The method for controlling multi-modal contextual interaction based on pre-generated resource patterns according to claim 1, wherein before the step of collecting the multi-modal input signals of the user during the playing of the first state node, the method further comprises: determining the visual resource playing progress of the current first state node; And triggering a timeout processing mechanism under the condition that the playing progress is finished and the user input is not detected to reach a preset time threshold, wherein the timeout processing mechanism comprises prompting timeout information, or automatically calling prompting audio resources associated with the current state node to play, or automatically carrying out state transition along a preset default silent path.
9. A multi-modal contextual model interaction control device based on pre-generated resource patterns, comprising: The loading module is used for loading a pre-constructed scene interaction map, the scene interaction map comprises a plurality of state nodes and directed edges connected with the state nodes, each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intention characteristic; the response module is used for responding to the starting instruction of the interaction session, activating the first state node according to the current state pointer and rendering the visual resource associated with the first state node at the user terminal; the acquisition module is used for acquiring multi-mode input signals of a user in the playing process of the first state node and converting the multi-mode input signals into text sequences; The first processing module is used for carrying out mixed intention recognition processing on the text sequence, calculating the matching degree of standard intention characteristics associated with each directed edge sent by the current state node, and determining target intention and target directed edges; and the second processing module is used for updating the current state pointer based on a second state node pointed by the target directed edge, triggering state transition logic and switching and rendering visual resources associated with the second state node at the user terminal.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-modal contextual model interaction control method based on pre-generated resource profiles as claimed in any one of claims 1 to 8 when the program is executed by the processor.

Description

Multi-mode scene interaction control method and device based on pre-generated resource map Technical Field The invention relates to the technical field of computer interaction, in particular to a multi-mode scene interaction control method and device based on a pre-generated resource map. Background With the rapid development of artificial intelligence technology and multimedia technology, contextual interactive training (Scenario-based INTERACTIVE TRAINING) has been widely used in the fields of language learning, professional skill training, psychological consultation simulation, and the like. Conventional interactive training systems rely mainly on preset static options (such as ABCD selection questions) or simple keyword matching techniques. When facing the virtual characters on the screen, the user can only select from limited options, and the 'duck-filling' or 'mechanical' interaction mode cannot simulate the openness and uncertainty in the real dialogue, so that the training effect is seriously distorted, and the user is difficult to generate immersion. In recent years, with the explosion of Large Language Models (LLM) and generated Artificial Intelligence (AIGC) technologies, some interactive schemes based on full real-time generation have emerged. Such schemes allow the user to enter freely and generate reply text in real time by LLM and feedback in real time by TTS (speech synthesis). However, this technical route has significant technical drawbacks in practical floor application. The process of purely relying on large model generation often lacks constraints. In professional skill training such as air traffic control calls, medical emergency procedures, specific standard operating procedures must be strictly followed. LLM is prone to creating hallucinations, generating instructional information that deviates from teaching goals and even is erroneous, resulting in training failure. Therefore, the existing technical scheme cannot solve the contradiction between high-fidelity immersion and strict compliance with teaching/training logic at the same time. A new technical solution is needed in the industry, which not only can utilize the high efficiency and immersion brought by the generated artificial intelligence technology, but also can ensure the controllability of logic. Disclosure of Invention The invention provides a multimode scene interaction control method and device based on a pre-generated resource map, which are used for solving the defect that training logic is difficult to control when an artificial intelligence technology is adopted in the prior art, and realizing the effects of maintaining high simulation immersion and strictly following teaching/training logic. The invention provides a multimode scene interaction control method based on a pre-generated resource map, which comprises the following steps: Loading a pre-constructed scene interaction map, wherein the scene interaction map comprises a plurality of state nodes and directed edges connected with the state nodes, each state node is associated with a pre-generated visual resource identifier, and each directed edge is associated with a preset standard intention characteristic; Responding to a starting instruction of an interaction session, activating a first state node according to a current state pointer, and rendering visual resources associated with the first state node at a user terminal; collecting a multi-mode input signal of a user in the playing process of the first state node, and converting the multi-mode input signal into a text sequence; Performing mixed intention recognition processing on the text sequence, calculating the matching degree of standard intention characteristics associated with each directed edge sent by the current state node, and determining a target intention and a target directed edge; And based on a second state node pointed by the target directed edge, updating the current state pointer, triggering state transition logic, and switching and rendering visual resources associated with the second state node at the user terminal. According to the multi-mode scene interaction control method based on the pre-generated resource map, the pre-generated visual resource is constructed through the following processes: receiving structured scenario data comprising character settings, dialog scripts and action descriptions; analyzing the structured script data, and calling a generating type artificial intelligent interface to generate corresponding digital human action video or three-dimensional animation data aiming at each script fragment; Storing the generated video or animation data to a content distribution network, and generating a unique resource uniform resource locator; and establishing a mapping relation between the resource uniform resource locator and a corresponding state node in the scene interaction map. According to the multi-mode scene interaction control method based on the pre-generated resourc