CN-121981073-A - Multi-mode agent collaboration system and method

CN121981073ACN 121981073 ACN121981073 ACN 121981073ACN-121981073-A

Abstract

The invention discloses a multi-mode intelligent agent cooperative system and a method, and belongs to the technical field of artificial intelligence. The method comprises the following steps of converting multi-mode input information provided by a user into a unified coding format through a coding method, outputting standardized data, configuring an analysis instruction of a multi-mode large model based on format requirements of application program interface parameters of a target service system, outputting a text instruction and a structured data set, driving the multi-mode large model to execute information fusion and logic operation through tool call and format protocol, outputting the information fusion and logic operation as a formatting operation instruction, analyzing and executing the formatting operation instruction through an application program programming interface of a calling service execution platform, and outputting an execution result. The invention can realize the end-to-end automatic conversion from complex natural interaction to accurate system operation through multi-mode agent cooperation and structured output, and remarkably improves the reliability, accuracy and efficiency of task execution.

Inventors

LU YI
XU FAN
SHI TONG

Assignees

南京审计大学

Dates

Publication Date: 20260505
Application Date: 20260119

Claims (10)

1. A multi-mode intelligent agent cooperation method is characterized by comprising the steps of converting a voice instruction provided by a user and an associated image into a unified coding format through a coding method and outputting standardized data which can be analyzed by a multi-mode large model; based on the format requirement of the application program interface parameters of the target service system, configuring an analysis instruction of the multi-mode large model, and outputting a text instruction and a structured instruction set matched with the application program interface parameter format; Based on the text instruction and the structured instruction set, calling and format reduction through a tool, driving a multi-mode large model to execute information fusion and logic operation, and outputting a formatted operation instruction which is directly compatible with an application program interface of a target business system; and analyzing and executing the formatting operation instruction by calling an application programming interface of the target service system, and outputting a result.
2. The method of claim 1, wherein the step of driving the multi-modal large model to perform information fusion and logic operation by means of tool call and format specification based on the text command and the structured command set, and outputting the formatted operation command directly compatible with the application program interface of the target business system comprises inputting the text command into the multi-modal large model of the configured image analysis command, providing a callable tool for the multi-modal large model by means of a model context protocol, setting response format parameters for the multi-modal large model, restricting an output data structure of the multi-modal large model, and integrating multi-source information and calling the tool to generate the formatted operation command matched with the application program interface parameter format.
3. A multi-modal agent collaboration method as described in claim 2 wherein the model context protocol supports dynamic tool extensions and call prioritization, the adaptation tools being loaded autonomously according to instruction complexity.
4. The multi-modal agent collaboration method of claim 1 wherein the parsing and executing the formatted operation instructions by invoking an application programming interface of the target business system to output results includes invoking the application programming interface of the target business system, passing the formatted operation instructions as invocation parameters to the application programming interface, executing the formatted operation instructions via the application programming interface to complete the operation and returning the execution results.
5. The multi-mode intelligent agent coordination system is characterized by being applied to the multi-mode intelligent agent coordination method of any one of claims 1 to 4, and comprises an input information processing intelligent agent, an instruction generation intelligent agent and an execution intelligent agent, wherein the input information processing intelligent agent, the instruction generation intelligent agent and the execution intelligent agent are sequentially connected, the input information processing intelligent agent comprises a coding component and a voice-to-text component, the instruction generation intelligent agent comprises a picture information extraction component, a semantic understanding component and an instruction generation component, and the execution intelligent agent comprises a platform interface calling component.
6. The multi-modal agent collaboration system of claim 5 wherein the voice-to-text component is configured with a first prompt word that directs the multi-modal large model to convert encoded audio data into text instructions, the voice-to-text component invoking the multi-modal large model, reading the encoded audio data and outputting the text instructions.
7. The multi-modal agent collaboration system of claim 5 wherein the picture information extraction component is configured with a second hint word that directs the multi-modal large model to extract stock information from the encoded image data, the picture information extraction component invoking the multi-modal large model to output the structured instruction set in a predetermined format.
8. The multi-modal intelligent agent collaboration system of claim 5, wherein the instruction generating component further comprises a multi-modal large model cross-source information verification mechanism for performing consistency verification on text instructions obtained by converting voice into characters and structural information extracted by pictures, when key information conflicts are detected, automatically triggering the multi-modal large model to re-analyze conflict source data, and outputting a uniquely-compliant formatting operation instruction in combination with a preset priority rule.
9. The multi-modal agent collaboration system of claim 5 wherein the model context protocol supports dynamic tool extension and call prioritization, wherein adaptive tools can be loaded autonomously according to instruction complexity, tool call prioritization is based on compute class tool, query class tool, check class tool ordering, ensuring efficiency and accuracy of complex instruction processing, wherein compute class tool is a first priority, query class tool is a second priority, check class tool is a third priority.
10. The multi-modal agent collaboration system as in claim 5 wherein the multi-modal agent collaboration system comprises one or more agents.

Description

Multi-mode agent collaboration system and method Technical Field The present invention relates to the field of artificial intelligence, and more particularly, to a transaction agent system and method based on a multi-modal large model. Background With the breakthrough of large language models and multi-modal understanding capability, an intelligent-based automated task execution system has become a research hotspot in the field of artificial intelligence. Agents are typically given the ability to sense, plan, use and execute tools, aimed at autonomously invoking resources, cooperatively completing complex tasks according to a given goal. This architecture provides a new technological paradigm for building intelligent interfaces that understand natural intent and operate digital systems. However, existing agent systems still face significant technical challenges in the face of high reliability, high accuracy, and complex and varied actual scenarios of input information. Firstly, it is often difficult for a single model or agent to accurately process multi-modal information such as voice, image, text, etc. simultaneously, and perform cross-modal strict logic fusion and reasoning. Secondly, from free form and multi-mode natural interactive input to strictly formatted instructions which can be directly executed by a system, the chain is converted into a long chain, the links are more, errors are easy to accumulate, and the intention fidelity is low. Furthermore, existing architectures generally lack efficient detection and robustness checking mechanisms for multi-source information collisions, which constitute a key drawback in high risk or high value decision scenarios. The technical challenges are particularly prominent in typical application scenarios such as generation of financial transaction instructions. The prior art exploration is concentrated on a single mode or depends on a preset simple instruction template, and the comprehensive situation understanding and reasoning capability of the multi-mode large model is not fully utilized. More importantly, there is currently a lack of an end-to-end technology framework that is elaborate in design and interoperable with multiple agents. The framework needs to be able to integrate core capabilities such as self-verification, dynamic tool invocation, etc. to truly enable reliable, automated conversion from user-complex, fuzzy, even conflicting threads of natural interaction, to accurate, compliant, safe executable system operation. Disclosure of Invention The invention aims to provide a multi-mode intelligent agent cooperative system and a multi-mode intelligent agent cooperative method, which are used for solving the technical problem of reliable and automatic conversion from complex multi-mode natural interaction to accurate structural system instructions. The technical scheme of the invention provides a multi-mode intelligent agent cooperation method which comprises the following steps of converting voice instructions and associated images provided by a user into unified coding formats through a coding method, outputting standardized data which can be analyzed by a multi-mode large model, configuring analysis instructions of the multi-mode large model based on format requirements of application program interface parameters of a target service system, outputting text instructions and a structured instruction set which are matched with the format of the application program interface parameters, calling and format conventions through tools based on the text instructions and the structured instruction set, driving the multi-mode large model to execute information fusion and logic operation, outputting the formatted operation instructions which are directly compatible with the application program interface of the target service system, analyzing and executing the formatted operation instructions through calling the application program programming interface of the target service system, and outputting results. The method comprises the steps of inputting a text instruction into a multi-mode large model of a configured image analysis instruction, providing a callable tool for the multi-mode large model through a model context protocol, setting response format parameters for the multi-mode large model, restricting an output data structure of the multi-mode large model, integrating multi-source information by the multi-mode large model, and calling the tool to generate the formatted operation instruction matched with the application program interface parameter format. Optionally, the model context protocol supports dynamic tool extension and call prioritization, and the adaptation tool may be loaded autonomously according to instruction complexity. Optionally, the formatting operation instruction is analyzed and executed by calling an application programming interface of the target service system, and a result is output, wherein the method comprises the steps of calli