Search

CN-121981159-A - Reasoning and training method of symbol-guided neural symbol reinforcement learning agent and hardware architecture thereof

CN121981159ACN 121981159 ACN121981159 ACN 121981159ACN-121981159-A

Abstract

The invention provides a reasoning and training method of a symbol-oriented neural symbol reinforcement learning agent and a hardware architecture thereof. And meanwhile, judging the effectiveness and safety of each action in the current environment based on an internal symbolization rule, generating an action mask and acting on the neural network output in a probability bias or equivalent mode, thereby applying interpretable and verifiable symbol constraint to a strategy space before final action selection. During training, a symbol constraint mechanism is introduced, action selection in the exploration and utilization stage is firstly performed through action_mask pruning or biasing, dangerous or nonsensical interactive samples are reduced from the source, the quality and effectiveness of the samples in the experience playback pool are improved, and training stability and convergence efficiency are improved in back propagation updating.

Inventors

  • YAN BONAN
  • ZHAO HONGXIAO
  • GUO FENGYUAN

Assignees

  • 北京大学

Dates

Publication Date
20260505
Application Date
20260107

Claims (8)

  1. 1. A reasoning method of a symbol-guided neural symbol reinforcement learning agent is characterized in that the agent comprises a neural network and a symbol guiding system, the symbol guiding system extracts environment information and then delivers the extracted information to the neural network for processing, and finally, the symbol guiding system judges whether the action is safe or not, and the reasoning method comprises the following steps: (1) Inputting environment information obs into a symbol guidance system; (2) The symbol guidance system completes symbol understanding and information extraction according to the internal rules, and outputs a high-dimensional state vector nn_input and an action mask, wherein nn_input comprises the result of the symbol guidance system performing symbol understanding and information extraction according to the internal rules and based on environment information obs, and the action mask comprises the tendency of the internal rules of the symbol guidance system to all actions in the current environment and the information of whether the actions are safe or not; (3) The nn_input high-dimensional state vector is input into a neural network, and the neural network forwards propagates the nn_input by using one or more of an embedded layer, a convolution layer, a full connection layer, an attention mechanism or an activation function to finally obtain the probability of executing various actions under the environmental input; (4) And (3) the symbol guidance system changes the probability that various actions output by the neural network in the step (3) are executed according to the action_mask, and samples are carried out in the probability to obtain the finally selected action.
  2. 2. The reasoning method of claim 1, wherein in step (2), the process of symbol guided system information extraction is rule matching based on user-written rules.
  3. 3. The reasoning method of claim 1, wherein the step (4) is specifically that if the trend of an action_mask is recommended/neutral/not recommended, the probability of the action is increased/maintained/decreased, and the probability obtained after the processing is completed fuses the understanding of the current environment by the neural network and the symbol guidance system.
  4. 4. A training method of a neural symbol reinforcement learning agent oriented to symbol guidance is characterized in that during training, actions executed by the agent are judged to be safe by a symbol guidance system and then implemented, during the training process of reinforcement learning, action selection generates an action mask in two modes of exploration and strategy utilization by the symbol guidance system, and then the probability of each action is changed according to the action mask, and the training method comprises the following steps: (1) The symbol guidance system performs symbol understanding and information extraction on the current environment information ob and outputs a high-dimensional state vector nn_input and an action mask_mask; (2) If the entering strategy utilizes the branch, the neural network calculates an action probability vector according to nn_input, then changes the action probability vector according to the action_mask, and finally samples according to the changed action probability vector to obtain the selected action; (3) After the environment performs the action, the environment returns new environment information obs ', rewards rwd and environment ending signal done, the symbol guidance system generates new high-dimensional state vector nn_input ' according to the new environment information, then, nn_ input, action, rwd, nn _input ' done is stored in the experience playback pool, and then, samples are collected from the experience playback pool to perform back propagation operation on the neural network so as to update the weight.
  5. 5. A neural symbol reinforcement learning hardware architecture for implementing the reasoning method of claim 1, comprising three types of tightly coupled components, namely a cognitive processing unit, a neural processing unit and an on-chip shared memory, wherein the neural network operation is mapped to the neural processing unit, and the symbol guidance system is mapped to the cognitive processing unit for execution; The cognitive processing unit is used as a symbol reasoning engine and optimizes symbolized rule matching, receives external environment information, utilizes an internal rule base to execute rule matching and completes two key outputs, namely abstracting the external environment information into a high-dimensional state vector nn_input, generating an action mask action_mask, applying constraint on an action space through symbol reasoning, storing the generated high-dimensional state vector nn_input into an on-chip shared memory, and transmitting the generated action mask action_mask to a decision module; in one-time pushing, the nerve processing unit firstly reads the weight and the configuration of the nerve network from an on-chip shared memory, then reads a high-dimensional state vector nn_input output by the cognitive processing unit, performs forward propagation of the nerve network by utilizing an internal parallel processing unit array, writes the obtained action probability vector back to the shared memory, and sends an reasoning completion signal to the cognitive processing unit to mark the end of one-time nerve reasoning; the on-chip shared memory is used as a unified storage hub for minimizing data handling and providing a communication path between the cognitive processing unit and the neural processing unit, and the storage content at least comprises neural network weight and configuration and data interacted between the cognitive processing unit and the neural processing unit in reasoning.
  6. 6. The neural symbol reinforcement learning hardware architecture of claim 5, wherein said on-chip shared memory is implemented using high-speed static random access memory SRAM.
  7. 7. The neural symbol reinforcement learning hardware architecture of claim 5, wherein said cognitive processing unit performs parallel rule matching by integrating a plurality of rule matching circuits that work in parallel.
  8. 8. The neural symbol reinforcement learning hardware architecture of claim 5, wherein the architected reasoning work is performed by the cognitive processing unit uniformly scheduling components, one reasoning comprising four phases: 1) The external environment information firstly enters a system through a cognitive processing unit, and the system starts the reasoning after receiving the external environment information; 2) The cognitive processing unit executes rule matching on the input external environment information based on a preloading rule base and generates nn_input and action_mask; then, writing nn_input into an on-chip shared memory by the cognitive processing unit, forwarding an address for storing nn_input to the nerve processing unit, and triggering subsequent nerve reasoning reading; 3) The neural network forward propagation, namely, the neural processing unit reads nn_input from the shared memory under the triggering of the cognitive processing unit, performs high-efficiency forward reasoning by utilizing the parallel processing array, writes the probabilities of all actions obtained by the reasoning back to the shared memory after the reasoning is finished, and sends a finishing signal to the cognitive processing unit; 4) And after receiving the completion signal of the nerve processing unit, the cognitive processing unit retrieves the action probability output by the nerve network from the shared memory through the decision module, fuses the action probability with the action_mask, acts the action_mask on the action probability output by the nerve network, and obtains the selection and output of the action after sampling based on the new action probability, thereby ending one-time reasoning.

Description

Reasoning and training method of symbol-guided neural symbol reinforcement learning agent and hardware architecture thereof Technical Field The invention relates to the field of reinforcement learning acceleration, nerve-symbol calculation and hardware-software collaborative design, in particular to a reasoning and training method of a symbol-oriented neural symbol reinforcement learning agent and a hardware architecture thereof. Background Deep reinforcement learning learns complex strategies in a specific task through large-scale trial and error sampling, and a neural network inside an intelligent agent can automatically extract features from a high-dimensional environment and learn complex optimal strategies. The deep reinforcement learning relies on trial-and-error sampling and back propagation updating strategies, has strong expression capability in complex decision tasks, but generally has the problems of low sample efficiency, poor adaptability to slight environmental changes and the like. The training effect of deep reinforcement learning depends on high-quality exploration samples, but in an actual environment, high-quality samples are difficult to obtain efficiently, and strategies are easy to generate performance dip after an environment map is slightly modified. The symbol guidance system takes the displayed symbol rule as a core, the symbolization rule in the form of 'if-then' can ensure the safety and strong robustness of the intelligent agent, and the cognitive processing unit (a hardware accelerator taking the symbol guidance system as the core) can provide high parallelism of symbol reasoning, so that the running delay of the intelligent agent is reduced. The cognitive processing unit internally comprises a plurality of rule matching circuits which are used for carrying out parallel symbol matching. The cognitive processing unit also comprises a decision module, and a decision scheme can be defined according to requirements. When the system is in operation, the input of the cognitive processing unit is environment information, and the result obtained after reasoning based on the internal symbolization rule is output. The symbol guidance system has the advantages of parallelizable rule matching, interpretable behavior, avoidance of dangerous actions by rule constraint, and the like, for example, in a grid class task, a class rule of 'no need to walk into traps/holes' can still be valid after the environment changes, thereby improving the robustness. However, pure symbol reasoning is difficult to express a complex strategy, and a cognitive processing unit architecture is incompatible with a modern numerical reinforcement learning algorithm, whereas a graphic processing unit and a Neural Processing Unit (NPU) are good at numerical calculation but are not good at symbol rule matching, so that the problems of high data carrying cost, end-to-end delay and non-ideal energy consumption exist in the neural-symbol reinforcement learning on the existing heterogeneous platform. Therefore, a software and hardware integrated scheme for neural-symbol reinforcement learning is needed, which not only can utilize symbolized rules to restrict the action space so as to improve the quality and safety of training samples, but also can efficiently execute neural network reasoning/training calculation, and can reduce end-to-end delay and energy consumption through on-chip tight coupling data flow. Disclosure of Invention Aiming at the problems in the prior art, the invention provides a reasoning and training method of a neural symbol reinforcement learning agent oriented to symbol guidance and a hardware architecture thereof, designs a method for combining deep reinforcement learning based on a neural network and a symbol guidance system based on rules, and provides a hardware architecture for realizing the method. The method and hardware architecture can be used for a variety of reinforcement learning problems, not for a particular application. Aiming at the problems that a deep reinforcement learning sample is low in efficiency and poor in adaptability to slight change of environment, and a symbol guidance system is difficult to express a complex strategy, the invention provides a method for mutually fusing the deep reinforcement learning sample and the symbol guidance system, so that the fused method can express the complex strategy and has high learning sample efficiency and strong adaptability, and a hardware architecture is provided for solving the problem that the conventional heterogeneous platform cannot efficiently perform neural-symbol reinforcement learning, so that the deep reinforcement learning system and the symbol guidance system can operate in the same hardware architecture efficiently. The technical scheme of the invention is as follows: A reasoning method of a neural symbol reinforcement learning agent oriented to symbol guidance. The method fuses the deep reinforcement learning and the symbol