CN-122021873-A - Agent dialogue method, system and related device based on multi-round reinforcement learning

CN122021873ACN 122021873 ACN122021873 ACN 122021873ACN-122021873-A

Abstract

The application discloses an agent dialogue method, system and related device based on multi-round reinforcement learning, which comprises the steps of obtaining dialogue action paths of reference agents, configuring a plurality of dialogue sets among the agents under the dialogue action paths, configuring target content to be recalled of the reference agents, adjusting the target agents by the target rewards, and determining the whole-round rewards of each word element in the target content in the dialogue sets based on the multi-round target content in the dialogue sets and the recall rate of the multi-round reference content compared with the recall content, determining the single-round rewards of each word element in the target content based on the target content and the dialogue content in the dialogue sets, determining the target rewards of each word element based on the whole-round rewards and the single-round rewards of each word element, and adjusting the target agents by the target rewards. By the scheme, the accuracy of the intelligent agent dialogue can be improved.

Inventors

PENG QIYU
LIU CONG
HU GUOPING
DENG CHENGHAO
DU QIANYUN
HU JIAXUE
ZHAO JINGHE
HE ZHIYANG
LU XIAOLIANG
WEI SI
WANG SHIJIN

Assignees

讯飞医疗科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20251222

Claims (11)

1. An agent dialogue method based on multi-round reinforcement learning, which is characterized by comprising the following steps: The method comprises the steps of obtaining a dialogue action path of a reference intelligent agent and a plurality of dialogue sets between the reference intelligent agent and a target intelligent agent under the dialogue action path, wherein the reference intelligent agent is configured with contents to be recalled, and the dialogue sets comprise a plurality of rounds of target contents of the target intelligent agent and reference contents of the reference intelligent agent; Determining an integral rewards of each word element in target content in a conversation set based on multiple rounds of the target content in the conversation set and recall rates of multiple rounds of the reference content compared with the content to be recalled, and determining a single round rewards of each word element in the target content based on the target content and the conversation content before the target content in the conversation set; And determining target rewards of each word element based on the whole rewards and the single-round rewards of each word element, and adjusting target agents by using the target rewards, wherein the adjusted target agents are used for dialogue with the reference agents or dialogue with target objects.
2. The multi-round reinforcement learning-based agent dialogue method according to claim 1, wherein the reference agent dialogue acts path includes a plurality of types, and the dialogue acts path is obtained based on the steps of: Acquiring a plurality of dialogue action labels, and respectively setting one dialogue action label for at least part of rounds in the dialogue flow of the reference intelligent agent; And respectively taking the conversation processes of which the conversation behavior labels are at least partially different from each other in turn as a type of conversation behavior path.
3. The multi-round reinforcement learning-based agent dialogue method of claim 2, wherein at least a portion of the dialogue action tags include corresponding sub-tags for adjusting the output ratio of the content to be recalled in a single round; The dialog flow for distinguishing the dialog behavior labels of at least partial rounds from each other is respectively used as a type of dialog behavior path, and comprises: And determining the turn comparison result of each turn based on the dialogue action label of each turn and the sub-labels included in the dialogue flow, and taking the turn comparison result of at least part of turns as the dialogue flow with difference as a type of dialogue action path.
4. The multi-round reinforcement learning-based agent dialogue method of claim 1, wherein the determining the full-length rewards for each word element in the target content in the dialogue set based on the recall rates of the target content for a plurality of rounds in the dialogue set and the reference content for a plurality of rounds compared to the to-be-recalled content comprises: Determining dialogue rewards of each dialogue set under the dialogue action path based on the total round of dialogue in each dialogue set under the dialogue action path, the content redundancy rate of multiple rounds of target content and the recall rate of multiple rounds of reference content compared with the content to be recalled; based on the dialogue rewards of each dialogue set in the dialogue action path, determining dialogue comparison rewards corresponding to all the dialogue sets in the dialogue action path; And determining the whole-pass reward of each word element in target content in the dialogue set based on the dialogue rewards and the dialogue comparison rewards of the dialogue set.
5. The multi-round reinforcement learning-based agent dialogue method according to claim 1, wherein the determining a single round of rewards for each word element in the target content based on the target content and the preceding dialogue content in the dialogue set comprises: Obtaining a current word in the target content, and estimating rewards of the current word based on the current word and the dialogue content before the current word to obtain word comparison rewards of the current word, wherein the current word is recursively obtained from the tail end of the target content; Performing content evaluation and fluency evaluation on the current word element based on the current word element and the dialogue content before the current word element to obtain the current reward of the current word element, obtaining the expected reward matched with the current word element, and generating the word element reward of the current word element by utilizing the current reward and the expected reward, wherein the expected reward is related to the recursive word element; determining single-round rewards of the current word elements based on the word element rewards of the current word elements and the word element comparison rewards until all the word elements of the target content are recursively completed, and obtaining the single-round rewards of each word element in the target content.
6. The method for intelligent dialogue based on multi-round reinforcement learning according to claim 5, wherein estimating the rewards of the current vocabulary based on the current vocabulary and the previous dialogue content to obtain the vocabulary comparison rewards of the current vocabulary comprises: Inputting the current word and the dialogue content before the current word into a value model to obtain a word comparison reward for the current word, wherein the expected reward of the current word is determined based on the word comparison reward of the recursive word and the interval between the recursive word and the current word, and the value model and the target agent are adjusted together.
7. The multi-round reinforcement learning-based agent dialogue method according to claim 6, wherein the value model is co-adjusted with the target agent after reaching a preset training round, before the preset training round, a word element comparison reward for each word element in the target content is obtained based on the target content and a matched reference content thereof, the reference content being generated based on the target content and a preceding dialogue content thereof.
8. The multi-round reinforcement learning based agent dialogue method of any one of claims 1-7, wherein when the target agent comprises one of a task agent and an interactive agent, the reference agent comprises the other of a task agent and an interactive agent; When the target agent is an interactive agent and the reference agent is a task agent, the adjusted interactive agent is used for collecting the dialogue set with the task agent, and when the target agent is a task agent and the reference agent is an interactive agent, the adjusted task agent is used for collecting the dialogue set with the interactive agent or with the target object.
9. An agent dialogue system based on multi-round reinforcement learning, comprising: The system comprises an acquisition module, a reference intelligent agent, a target intelligent agent and a reference intelligent agent, wherein the acquisition module is used for acquiring a dialogue action path of the reference intelligent agent and a plurality of dialogue sets between the reference intelligent agent and the target intelligent agent under the dialogue action path, wherein the reference intelligent agent is configured with contents to be recalled, and the dialogue sets comprise target contents of the target intelligent agent and reference contents of the reference intelligent agent in a plurality of rounds; The confirmation module is used for determining the full-pass rewards of each word element in the target content in the dialogue set based on multiple rounds of the target content in the dialogue set and the recall rate of multiple rounds of the reference content compared with the content to be recalled, and determining the single-round rewards of each word element in the target content based on the target content and the dialogue content before the target content in the dialogue set; And the adjustment module is used for determining target rewards of each word element based on the whole rewards and the single-round rewards of each word element and adjusting target agents by using the target rewards, wherein the adjusted target agents are used for dialogue with the reference agents or dialogue with target objects.
10. An electronic device comprising a memory and a processor coupled to each other, wherein the memory stores program data and the processor invokes the program data to perform the method of any of claims 1-8.
11. A computer readable storage medium having stored thereon program data, which when executed by a processor, implements the method of any of claims 1-8.

Description

Agent dialogue method, system and related device based on multi-round reinforcement learning Technical Field The application relates to the technical field of artificial intelligence, in particular to an agent dialogue method, system and related device based on multi-round reinforcement learning. Background With the development of artificial intelligence, the intelligent agent is more widely applied, wherein the application of the intelligent agent in interactive dialogue becomes an important branch. However, the conventional agent dialogue still depends on a preset flow, and cannot perform adaptive dialogue with the dialogue object, so that it is difficult to accurately collect the required information, resulting in lower accuracy of the agent dialogue. In view of this, how to improve the accuracy of the agent conversation is a highly desirable problem. Disclosure of Invention The application mainly solves the technical problem of providing an agent dialogue method, an agent dialogue system and a related device based on multi-round reinforcement learning, which can improve the accuracy of agent dialogue. In order to solve the technical problems, the first aspect of the application provides an agent dialogue method based on multi-round reinforcement learning, which comprises the steps of obtaining a dialogue action path of a reference agent and a plurality of dialogue sets between the reference agent and a target agent under the dialogue action path, wherein the reference agent is configured with contents to be recalled, the dialogue sets comprise target contents of the multi-round target agent and reference contents of the reference agent, determining the whole-round rewards of each word element in the target contents in the dialogue sets based on the multi-round target contents in the plurality of dialogue sets and the recall rate of the multi-round reference contents compared with the contents to be recalled, determining single-round rewards of each word element in the target contents based on the target contents in the dialogue sets and the previous dialogue contents, determining the target rewards of each word element based on the whole-round rewards and single-round rewards of each word element, and adjusting the target agent by using the target, wherein the adjusted target agent is used for dialogue with the reference agent or the target object. In order to solve the technical problems, the second aspect of the application provides an intelligent agent dialogue system based on multi-round reinforcement learning, which comprises an acquisition module, an adjustment module and a confirming module, wherein the acquisition module is used for acquiring a dialogue action path of a reference intelligent agent and a plurality of dialogue sets between the reference intelligent agent and a target intelligent agent under the dialogue action path, the reference intelligent agent is configured with content to be recalled, the dialogue sets comprise target content of the target intelligent agent and reference content of the reference intelligent agent, the confirmation module is used for determining a plurality of rounds of target content in the dialogue sets and a plurality of rounds of recall rate of the reference content compared with the content to be recalled, the whole-through rewarding of each word element in the target content in the dialogue sets is determined, the single round of each word element in the target content is determined based on the target content and the dialogue content before the target content in the dialogue sets, the adjustment module is used for determining a target of each word element based on the whole-through rewarding and the single round of rewarding, and the target rewarding intelligent agent is utilized, and the target intelligent agent is adjusted for the target dialogue rewarding or the target intelligent agent and the target intelligent agent. In order to solve the technical problem, a third aspect of the application provides an electronic device, which comprises a memory and a processor, wherein the memory and the processor are mutually coupled, the memory stores program data, and the processor calls the program data to execute the method in the first aspect. To solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium having stored thereon program data which, when executed by a processor, implements the method described in the first aspect. The method has the advantages that different from the situation in the prior art, the method acquires the dialogue action path of the reference intelligent agent and a plurality of dialogue sets when the reference intelligent agent and the target intelligent agent perform dialogue under the corresponding dialogue action path, so that different types of dialogue modes are restrained through the dialogue action path, and the target intelligent agent can learn and adapt to di