CN-121998025-A - Intelligent decision method based on reinforcement learning large model

CN121998025ACN 121998025 ACN121998025 ACN 121998025ACN-121998025-A

Abstract

The invention discloses an intelligent decision method based on a reinforcement learning big model, which comprises the steps of obtaining environment state data, carrying out feature extraction processing, coding fusion to form unified semantic state representation, inputting the semantic state into the reinforcement learning big model, generating a structured decision plan and a decision certificate in parallel, carrying out constraint perception reconstruction based on the plan and the certificate, constructing a sub-decision unit and generating a joint action space, evaluating the joint action based on a hierarchical QMIX network, calculating joint value and repair cost to determine an optimal action, verifying the optimal action through a constraint shield, executing minimum repair to generate an execution action if the constraint is not met, and obtaining feedback data to update model parameters by the execution action to generate constraint increment for updating the constraint shield. The invention realizes verifiable collaborative optimization and security decision through the joint modeling of the decision certificate and the layering QMIX and the counter-example driven constraint updating.

Inventors

HUANG AMIN
Dou Shiwen
WANG SHIBIN

Assignees

江苏鼎峯云计算有限公司

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (8)

1. An intelligent decision method based on a reinforcement learning large model is characterized by comprising the following steps: acquiring environment state data, and processing the environment state data to obtain uniform semantic state representation; Inputting the semantic state representation into a reinforcement learning large model, and outputting a structured decision plan and a corresponding decision certificate in parallel through a plan generation head and a certificate generation head; based on the mapping relation in the structured decision plan and the decision certificate, carrying out constraint perception reconstruction on each sub-decision step to form sub-decision units, generating an action candidate set according to action types, parameter ranges and association constraints of each sub-decision unit, and establishing a dependency relationship diagram among the sub-decision units to form a joint action space; Taking candidate action combinations in the joint action space as evaluation objects, calculating joint value and restoration cost predicted value of each candidate action combination through a hierarchical QMIX network with conditional certificates based on semantic state representation, structural decision plans, decision certificates and decision dependency graphs, and determining optimal action combinations; Inputting the optimal action combination and the decision certificate into a constraint shield for consistency verification, executing the optimal action combination when verification is passed, and carrying out minimum restoration on the optimal action combination according to a verification result when verification is not passed, so as to obtain restoration actions and generate corresponding counterexample information; And executing the optimal action combination or repair action, acquiring environment feedback data, updating the reinforcement learning large model and the layered QMIX network according to the environment feedback data, and generating constraint increment according to counterexample information to update the constraint shield.
2. The reinforcement-learning large model-based intelligent decision method of claim 1, wherein the environmental state data comprises structured data and unstructured data, wherein the structured data comprises system operating parameters, resource state information, task attribute information, and time-series data, and the unstructured data comprises text data, log data, and event description data.
3. The intelligent decision-making method based on the reinforcement learning big model according to claim 1, wherein the processing the environmental state data to obtain a unified semantic state representation comprises: carrying out numerical normalization and feature vectorization processing on the structured data, carrying out text coding processing on the unstructured data, fusing the processed structured features and unstructured features, and carrying out unified representation through a coding network to obtain semantic state representation.
4. The reinforcement-learning large model-based intelligent decision method of claim 1, wherein the parallel output of the structured decision plan and the corresponding decision certificate by the plan generation head and the certificate generation head comprises: Inputting the semantic state representation into a large model backbone network of a reinforcement learning large model to perform feature extraction to obtain a state feature representation for decision generation, wherein the large model backbone network adopts a Transformer architecture; Inputting the state characteristic representation into a plan generation head connected with the large model backbone network to generate a structured decision plan, wherein the structured decision plan comprises a plurality of sub-decision steps which are arranged in sequence, action types, parameter ranges and dependency relations among the steps, wherein the action types and the parameter ranges correspond to the sub-decision steps; Inputting the state characteristic representation into a certificate generation head connected with the large model backbone network to generate a decision certificate corresponding to the structured decision plan, wherein the decision certificate comprises a precondition, an invariant, a post-assertion and a constraint reference corresponding to each sub-decision step; According to the structured decision plan and the decision certificate, establishing a mapping relation between each sub-decision step and corresponding pre-condition, invariant and post-assertion, and storing the mapping relation, the structured decision plan and the decision certificate in an associated manner; based on environment feedback data, the reinforcement learning large model is subjected to parameter updating through a reinforcement learning updating unit connected with the large model main network, wherein the reinforcement learning updating unit adopts actor-critic mode for updating, and the parameter updating mode is low-rank adaptation updating.
5. The method for intelligent decision making based on reinforcement learning big model according to claim 1, wherein the generating action candidate set, establishing a dependency graph between sub-decision units, forming a joint action space, comprises: Reading the mapping relation of each sub-decision step in the structured decision plan and the decision certificate, and extracting the action type, the parameter range, the pre-condition, the invariant, the post-assertion and the constraint quotation corresponding to each sub-decision step; Generating constraint description fragments based on corresponding pre-conditions, invariants and constraint references for each sub-decision step, and binding the constraint description fragments with action types, parameter ranges and execution sequences of the corresponding sub-decision steps to form a step-level constraint semantic unit; Performing constraint perception reconstruction on each step-level constraint semantic unit, wherein the constraint perception reconstruction comprises the following steps: Splitting action elements, parameter elements and execution conditions in the original sub-decision step according to constraint types, constraint intensity and constraint action objects in constraint description fragments, recombining the action elements and parameter elements which have the same constraint action objects and meet compatibility conditions, isolating action elements and parameter elements with constraint conflicts, and generating corresponding sub-decision units from the recombined action elements, parameter elements and execution conditions; Generating a candidate action set according to the reconstructed action elements, parameter elements and execution conditions aiming at each sub-decision unit, carrying out constraint filtering on the action candidate set according to the corresponding pre-condition and invariant, and deleting candidate actions which do not meet the pre-condition or destroy the invariant; according to the step sequence relation in the structured decision plan, the constraint transfer relation between the step-level constraint semantic units and the constraint conflict relation between the sub-decision units, a dependency relation diagram between the sub-decision units is established, specifically: extracting input constraint, output constraint and assertion triggering condition corresponding to each sub-decision unit; matching the output constraint of one sub-decision unit with the input constraint of the other sub-decision unit, and establishing a constraint transfer edge; Matching the post-assertion of one sub-decision unit with the pre-condition of another sub-decision unit, and establishing an assertion trigger edge; And superposing the constraint transfer edge and the assertion triggering edge to form a dependency graph among the sub-decision units comprising the sequence dependence, the constraint dependence and the assertion dependence, and generating a joint action space based on the action candidate set of each sub-decision unit and the dependency graph.
6. The reinforcement-learning large model-based intelligent decision method of claim 1, wherein the computing the joint value and repair cost prediction value of each candidate action combination and determining the optimal action combination through the hierarchical QMIX network conditioned by certificates comprises: Extracting sub-decision unit local observation information, the semantic state representation, the structured decision plan, the decision certificate and adjacency relation information in a sub-decision unit dependency graph corresponding to each candidate action combination by taking the candidate action combination in the joint action space as an evaluation object, wherein the hierarchical QMIX network comprises a certificate conditional local value network, an intra-group mixed layer, a global mixed layer and a repair cost prediction branch; The local observation information of each sub-decision unit, the semantic state representation, the preconditions and invariants bound with the corresponding sub-decision units in the decision certificate and the adjacency relation information are fused through the certificate conditional local value network, so that the conditional local value of each sub-decision unit under the corresponding candidate action combination is obtained; Grouping sub-decision units with strong dependency relationships according to the dependency strength, constraint transfer relationships and assertion triggering relationships in the sub-decision unit dependency relationship graph through the intra-group mixing layer, and performing first-layer mixing on the conditional local values of the sub-decision units in each group to obtain group-level values corresponding to each group; Performing second-layer mixing on group-level values corresponding to all groups, the semantic state representation, the plan coding result of the structured decision plan and the certificate coding result of the decision certificate through the global mixing layer to obtain joint values corresponding to all candidate action combinations, and outputting repair cost prediction values corresponding to all candidate action combinations according to matching results between all candidate action combinations and pre-conditions, invariants, post-assertions and constraint references in the decision certificate through repair cost prediction branches; And carrying out joint sorting according to the joint value and the repair cost predicted value corresponding to each candidate action combination, and determining the candidate action combination with the highest joint value and the lowest repair cost predicted value as the optimal action combination.
7. The intelligent decision method based on the reinforcement learning big model according to claim 1, wherein the inputting the optimal action combination and the decision certificate into the constraint shield for consistency verification comprises: The constraint shield is constructed based on the decision certificate and the current constraint set, and specifically comprises the following steps: extracting a pre-condition, an invariant, a post-assertion and constraint references in the decision certificate, and combining the pre-condition, the invariant and the post-assertion associated with the same sub-decision unit into corresponding local constraint fragments according to the mapping relation between each sub-decision step and corresponding constraint; According to constraint transfer relation and assertion triggering relation among all sub-decision units in the sub-decision unit dependency relation diagram, carrying out series connection and layered encapsulation on all local constraint fragments to form local shield units corresponding to all the sub-decision units one by one and global shield units for coordinating all the local shield units; Inputting the optimal action combination into a constraint shield, and respectively carrying out constraint matching verification on action content, parameter values and execution sequences of corresponding sub-decision units by each local shield unit, wherein each local shield unit carries out pre-execution verification according to corresponding pre-conditions, carries out execution process verification according to corresponding invariants, carries out execution result verification according to corresponding post-assertion, and transmits verification results of each local shield unit to a global shield unit; The global shield unit performs consistency summarization verification on the optimal action combination according to the verification result of each local shield unit, the constraint transfer relation, the assertion triggering relation and the constraint conflict relation among all sub-decision units, identifies the sub-decision units with constraint conflict, the action content, the parameter value, the execution sequence and the corresponding constraint reference of the triggering conflict, and generates counterexample information; when the optimal action combination fails consistency summary verification, the constraint shield performs minimum restoration on the optimal action combination according to counterexample information, and after each restoration, the corresponding local shield unit and the global shield unit are re-called for verification until restoration actions meeting the current constraint set are generated; And when the optimal action combination passes the consistency summary verification, outputting the optimal action combination as an execution action, and when the optimal action combination fails the consistency summary verification, outputting a repair action as the execution action.
8. The reinforcement learning large model-based intelligent decision method of claim 1, wherein generating constraint increments to update constraint shields from counterexample information comprises: Executing an optimal action combination or repair action, and collecting environment feedback data after the execution of the action, wherein the environment feedback data comprises execution result data, state change data, rewarding feedback data and constraint triggering data; Constructing an update sample based on the environment feedback data, wherein the update sample comprises semantic state representation, a structured decision plan, a decision certificate, an optimal action combination or repair action, the environment feedback data and counterexample information corresponding to the current decision process; Inputting an update sample into a reinforcement learning update unit, performing parameter update on the reinforcement learning large model, and performing collaborative update on the large model backbone network, a plan generation head and a certificate generation head; Inputting an update sample into the hierarchical QMIX network, and updating parameters of the certificate-conditional local value network, the intra-group mixed layer, the global mixed layer and the repair cost prediction branch; And extracting the violation sub-decision unit, conflict constraint references, triggering conditions, violation action content, violation parameter values and violation execution sequences according to the counterexample information, generating constraint increments and writing the constraint increments into a current constraint set to update the constraint shields.

Description

Intelligent decision method based on reinforcement learning large model Technical Field The invention relates to the technical field of reinforcement learning, in particular to an intelligent decision method based on a reinforcement learning large model. Background With the development of artificial intelligence technology, the intelligent decision method based on reinforcement learning is widely applied to the scenes such as task scheduling, resource allocation, intelligent control, automatic operation and maintenance, risk decision and the like. In the prior art, a state, an action and a rewarding mechanism are generally constructed, so that a decision model continuously optimizes strategies in the process of continuously interacting with the environment, and automatic decision on complex tasks is realized. With the development of deep learning technology, the neural network is utilized to characterize the high-dimensional state information, and the decision performance is improved by combining with the reinforcement learning algorithm, so that the method has become an important development direction in the field of intelligent decision. In some complex applications, there are also techniques to attempt to introduce large models for semantic understanding of environmental states or to generate preliminary decision schemes to enhance the decision system's processing power for unstructured data and complex contexts. However, the prior art still has significant shortcomings when oriented to complex multidimensional coupling decision scenarios. On one hand, the traditional reinforcement learning method generally models the whole decision as a single action or simply splits multiple sub-tasks, and lacks unified modeling capability on dependency relationships, constraint transfer relationships and assertion trigger relationships among multiple sub-decisions, so that the joint optimization effect is limited, and the global optimal decision result is difficult to obtain. On the other hand, most of decision generation processes in the prior art belong to black box type output, even if a deep neural network or a large model is introduced, action or strategy results are usually generated only, and preconditions, invariants and post-assertions bound with sub-decision steps are absent, so that the decision results are difficult to verify, and consistency check and safety constraint control cannot be performed on the decision processes before execution. In the prior art, most constraint processing modes adopt static rule filtering or fixed penalty term design, and self-adaptive updating is difficult to carry out according to the exposed violation conditions and failure modes in the actual execution process. When the decision result violates the constraint condition, the existing method generally only can give out failure results or re-search actions, and lacks a constraint increment generation mechanism aiming at accurate positioning and local restoration of a violation decision unit and based on counterexample information, so that a closed loop optimization process of decision-verification-restoration-constraint evolution cannot be formed by the system. The prior art is difficult to simultaneously consider policy optimality, constraint controllability and decision interpretability, and is difficult to meet the application requirements of verifiable collaborative optimization and self-adaptive safety control in a complex multi-sub-decision coupling scene. Therefore, how to provide an intelligent decision method based on a reinforcement learning large model is a problem to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide an intelligent decision method based on a reinforcement learning large model, which realizes collaborative optimization and verifiable control on complex multi-sub-decision coupling problems by introducing a decision certificate generation mechanism, a layered QMIX joint value modeling method for certificate conditioning and a dynamic constraint shield updating mechanism driven by counterexamples, and constructs a closed-loop decision flow from decision generation, joint evaluation and consistency verification to constraint self-adaptive updating in detail, and has the advantages of strong decision interpretation, high constraint control capability, good joint optimization effect and strong system safety and stability. According to the embodiment of the invention, an intelligent decision method based on a reinforcement learning large model comprises the following steps: acquiring environment state data, and processing the environment state data to obtain uniform semantic state representation; Inputting the semantic state representation into a reinforcement learning large model, and outputting a structured decision plan and a corresponding decision certificate in parallel through a plan generation head and a certificate generation head; based on the mapping rel