CN-121997323-A - Large language model security defense system and method based on multi-agent collaboration

CN121997323ACN 121997323 ACN121997323 ACN 121997323ACN-121997323-A

Abstract

The application provides a large language model security defense system and method based on multi-agent cooperation, belongs to the technical field of artificial intelligence security, and aims to solve the problems of low detection efficiency, poor robustness and low decision reliability in the prior art. The method comprises the steps of executing detection tasks through a plurality of safety agents in parallel, performing mixed countermeasure training on a detection model by using discrete and continuous countermeasure samples, converging detection results to generate comprehensive risk assessment, quantifying uncertainty of the risk assessment based on Monte Carlo Dropout, depth integration or model calibration technology to generate final confidence score, and executing self-adaptive defense actions such as blocking, overwriting and the like based on the confidence score. The application can efficiently and accurately identify various attacks, realizes reliable self-adaptive defense, improves robustness and reduces false alarm rate.

Inventors

Gan Maozhao

Assignees

深圳艾钜思科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251229

Claims (10)

1. A large language model security defense system based on multi-agent collaboration, comprising: A plurality of security agent modules configured to perform different security detection tasks in parallel for received data to be detected to generate respective detection results; the decision fusion module is configured to aggregate the detection results generated by the plurality of safety agent modules and generate a comprehensive risk assessment result based on a preset fusion algorithm; A model training module configured for mixed challenge training of a security detection model in the security agent module, wherein the mixed challenge training comprises training the security detection model using a discrete challenge sample generated in text space and a continuous challenge sample generated in embedded space together; A confidence assessment module configured to quantify an uncertainty of the composite risk assessment result to generate a final confidence score, wherein the quantification of uncertainty is based on at least one technique selected from the group consisting of Monte Carlo Dropout technique, depth integration technique, model calibration technique; an adaptive defensive module configured to select and execute one defensive action from a plurality of preset defensive actions corresponding to different intervention classes based on the final confidence score.
2. The system of claim 1, wherein the decision fusion module is further configured to dynamically adjust the weight of each of the secure agent modules in the fusion algorithm based on historical detection accuracy of the secure agent modules.
3. The system of claim 1, wherein the confidence assessment module is configured to generate the final confidence score by fusing at least one of a cognitive uncertainty estimated by a monte carlo Dropout technique, a predictive variance calculated by a depth integration technique, and a post-calibration probability obtained by a model calibration technique.
4. The system of claim 1, wherein the adaptive defensive module is configured to compare the final confidence score to at least one preset threshold to select and perform the defensive action from the plurality of defensive actions including at least two of blocking, content overwriting and releasing.
5. The system of claim 1, wherein the model training module is configured to perform the hybrid challenge training by fusing training losses of clean samples, the discrete challenge samples, and the continuous challenge samples.
6. The system of claim 1, wherein the security detection model is a multi-task learning model comprising a shared encoder and a plurality of independent task heads each corresponding to a different security detection task.
7. The system of claim 1, further comprising: a threat intelligence library configured to rapidly match and filter the data to be detected before the plurality of secure agent modules perform detection tasks, and And the online learning module is configured to carry out incremental updating on the safety detection model according to the newly confirmed threat sample.
8. The system of claim 1, further comprising: the tool call security detection module is configured to perform security check on the call request when the large language model decides to call the external tool, and intercept when the risk is detected.
9. The system of claim 1, wherein the plurality of security agent modules are configured to perform the different security detection tasks in parallel based on an asynchronous concurrency model.
10. A large language model security defense method based on multi-agent cooperation is characterized by comprising the following steps: Performing different security detection tasks in parallel through a plurality of security agents aiming at received data to be detected to generate respective detection results, wherein the security detection tasks are performed by a security detection model, the security detection model is subjected to mixed countermeasure training, and the mixed countermeasure training comprises training the security detection model by using discrete countermeasure samples generated in a text space and continuous countermeasure samples generated in an embedded space together; Step two, converging detection results generated by the plurality of safety agents, and generating a comprehensive risk assessment result based on a preset fusion algorithm; quantifying uncertainty of the comprehensive risk assessment result to generate a final confidence score, wherein the quantification of uncertainty is based on at least one technique selected from the group consisting of Monte Carlo Dropout technique, depth integration technique, model calibration technique; and step four, selecting and executing a defending action from a plurality of preset defending actions corresponding to different intervention grades based on the final confidence score.

Description

Large language model security defense system and method based on multi-agent collaboration Technical Field The application relates to the technical field of artificial intelligence safety, in particular to a large-scale language model safety defense system and method based on multi-agent cooperation. Background In recent years, artificial intelligence technology with a large language model as a core has been widely used, but the security risk thereof is increasingly prominent. Existing large language model security defense techniques often face multiple challenges. In terms of processing efficiency, some defense systems employ a serial detection architecture that is sequentially executed by multiple detection modules, and this architecture can cause processing delays to linearly accumulate as the number of modules increases, so that it is difficult to meet the real-time interaction requirements of high concurrency and low delay. Even if some systems introduce the concept of multiple agents, the delay problem may not be solved fundamentally due to the lack of an efficient parallel collaboration mechanism. In terms of robustness of defenses, the prior art mostly relies on simple rule filtering or detection models for specific attack patterns. When confronted with discrete challenge samples structured in a text space by means of synonym substitution, character disturbance, or the like, or with continuous challenge samples generated by applying a minute disturbance to an embedding space inside a model, the detection performance of these models may be significantly degraded, and the defending success rate is not high. Although the concept of challenge training exists in the prior art, training is usually performed only for a single type of challenge sample, and a lack of a hybrid training mechanism capable of simultaneously resisting attacks from text space and embedded space results in insufficient model robustness. In terms of reliability and flexibility of decision making, the existing system generally lacks an evaluation mechanism for reliability of self detection results. Deep learning models often suffer from the problem of "overstrain", whose output probability values are not directly equivalent to true confidence. This results in defense strategies often based on fixed, empirical thresholds, which are difficult to flexibly adjust to the specific confidence of a single detection, and which are prone to "over-defense" for normal users or "under-defense" for high risk attacks. The prior art, even if proposing the idea of utilizing uncertainty, fails to provide a set of systematic confidence quantization schemes based on fusion of multiple mature techniques to solve this problem. Therefore, a new large-scale language model security defense technology is needed, which can efficiently and accurately identify various types of attacks, has strong countermeasure robustness, and can realize self-adaptive defense strategies based on reliable confidence assessment. Disclosure of Invention The application aims to solve the technical problems of low detection efficiency, poor countermeasure robustness, lack of reliable confidence assessment mechanism, single defense strategy stiffness, difficulty in cooperatively coping with various attack types and the like in the existing large-scale language model defense technology, and provides an efficient and robust comprehensive security defense system and method with self-adaptive defense capability. In order to achieve the above object, the present application provides a large-scale language model security defense system based on multi-agent collaboration, comprising: A plurality of security agent modules configured to perform different security detection tasks in parallel for received data to be detected to generate respective detection results; the decision fusion module is configured to aggregate the detection results generated by the plurality of safety agent modules and generate a comprehensive risk assessment result based on a preset fusion algorithm; A model training module configured for mixed challenge training of the security detection models in the security agent module, wherein the mixed challenge training comprises training the security detection models using discrete challenge samples generated in text space and continuous challenge samples generated in embedded space together, a confidence assessment module configured for quantifying uncertainty of the comprehensive risk assessment result to generate one final confidence score, wherein the quantification of the uncertainty is based on at least one technique selected from the group consisting of Monte Carlo Dropout technique, deep integration technique, model calibration technique; an adaptive defensive module configured to select and execute one defensive action from a plurality of preset defensive actions corresponding to different intervention classes based on the final confidence score. Further, the decisio