CN-122024553-A - Electric power operation verification multi-mode training system based on reinforcement learning

CN122024553ACN 122024553 ACN122024553 ACN 122024553ACN-122024553-A

Abstract

The invention discloses an electric power operation verification multi-mode practical training system based on reinforcement learning, which comprises a multi-source data acquisition module, a robust state generation module, a flow state machine module, an action and reward construction module, an action acceptance judging module, an off-line reinforcement learning training module, an actual training reasoning and verification module and a step-level error correction instruction, wherein the multi-source data acquisition module is used for acquiring multi-source data and generating a unified time stamp, the robust state generation module is used for constructing a multi-source sample, generating a robust state sequence through time offset processing, state coding and lower bound aggregation, the flow state machine module is used for constructing a flow state machine and determining an action set allowed to be executed in each flow state, the action and reward construction module is used for generating an action sequence and a reward labeling sequence, the action acceptance judging module is used for determining an action acceptance set, the off-line reinforcement learning training module is used for obtaining an operation strategy model through improved IQL algorithm training, and the practical training reasoning and verification module is used for outputting recommended operation actions and generating operation verification results and step-level error correction instructions. The intelligent verification and error correction assisting device for the electric power operation training is achieved.

Inventors

ZHANG BO
LIU YANG
LIU WENZHAO
ZHAO YUEQUN
HOU TIANWEI
CHEN ZHONGKAI

Assignees

山东网瑞物产有限公司

Dates

Publication Date: 20260512
Application Date: 20260312

Claims (9)

1. Electric power operation verification multi-mode practical training system based on reinforcement learning is characterized by comprising: The multi-source data acquisition module is used for acquiring operation multi-source data in the practical training process and generating a multi-source sample sequence according to the time stamp; The robust state generation module is used for generating a plurality of groups of time alignment samples in a preset time offset range, carrying out state coding on the time alignment samples, and carrying out lower bound aggregation processing on a plurality of groups of state coding results to generate a robust state sequence; the flow state machine module is used for constructing a flow state machine based on the structured field data of the operation bill and determining an action set allowed to be executed under each flow state; The action and rewards construction module is used for generating an action sequence according to the historical operation behavior corresponding to the multi-source sample, and associating the action sequence with the flow state machine to generate a rewards labeling sequence; The action acceptance judging module is used for constructing an action scoring model based on historical operation data in the operation multi-source data, calibrating an action scoring result by using a calibration data set, determining a first consistency proportion threshold value and a second consistency proportion threshold value corresponding to the operation step criticality level, and generating an action acceptance set in each state; the offline reinforcement learning training module is used for training to obtain an operation strategy model by adopting an improved IQL algorithm based on a robust state sequence, an action sequence and a reward labeling sequence in a limited action space formed by an intersection set of an allowed action set and an action receiving set; the practical training reasoning and checking module is used for outputting recommended operation actions based on the operation strategy model in the practical training process and generating operation checking results and step-level error correction instructions.
2. The reinforcement learning-based power operation verification multi-mode practical training system according to claim 1, wherein the modules are realized by the following method: S1, acquiring operation multi-source data in a training scene, and generating a multi-source sample sequence according to a time stamp; S2, generating a plurality of groups of time alignment samples in a preset time offset range for each multi-source sample, performing state coding on the time alignment samples, and performing lower bound aggregation processing on a plurality of groups of state coding results to generate a robust state sequence; s3, constructing an operation step transfer relation based on operation bill structured field data, mapping the operation step transfer relation into a flow state machine, and determining an action set allowed to be executed in each state in a robust state sequence according to the flow state machine; s4, generating an action sequence according to the historical operation behaviors corresponding to the multi-source samples, and associating the action sequence with a flow state machine to generate a reward annotation sequence; s5, constructing an action scoring model based on historical operation data in the operation multi-source data, calibrating an action scoring result by using a calibration data set, determining a first consistency proportion threshold value and a second consistency proportion threshold value corresponding to the operation step criticality level, and generating an action acceptance set in each state; S6, based on the robust state sequence, the action sequence and the rewarding labeling sequence, taking the intersection of the allowed action set and the action receiving set as a limited action space, and adopting an improved IQL algorithm to execute offline reinforcement learning training to obtain an operation strategy model; and S7, outputting an operation check result and a step-level error correction instruction by using the operation strategy model based on the robust state generated in real time in the practical training process.
3. The reinforcement learning-based power operation verification multi-mode training system according to claim 2, wherein the S1 specifically comprises: S11, acquiring operation multi-source data corresponding to a practical training scene, wherein the operation multi-source data comprises operation video data, operation voice data, voice transfer text data, operation bill structured field data, equipment running state data, alarm event data and operation event log data; S12, extracting time identification information from each data record in the operation multi-source data, and converting the time identification information into a time stamp under a unified time reference; S13, dividing the operation multi-source data into a plurality of time windows according to the length of a preset time window according to the uniform time stamp; S14, combining the operation video segment, the voice segment, the text segment, the equipment state record, the alarm record and the operation log record in the same time window to form a multi-source sample corresponding to the time window; s15, sequencing the multi-source samples according to the time stamp sequence to generate a multi-source sample sequence consistent with the time sequence of the power operation steps.
4. The reinforcement learning-based power operation verification multi-mode training system according to claim 2, wherein S2 specifically comprises: S21, generating a plurality of groups of time alignment samples in a preset time offset range aiming at each multi-source sample, wherein the time offset range extends forwards and backwards by taking a corresponding time stamp of the multi-source sample as a center; S22, respectively executing state coding processing on each time alignment sample, wherein the state coding processing is to map operation video clips, voice clips, text clips, equipment state records, alarm records and operation log records in the time alignment samples into state vectors with unified dimensions; S23, dividing the state vector into an equipment state sub-vector, an operation behavior sub-vector and an environment state sub-vector corresponding to the operation bill according to operation semantics according to the operation object, the operation type and the operation sequence information defined in the operation bill structured field data; S24, respectively executing lower bound aggregation processing on each group of equipment state subvectors, operation behavior subvectors and environment state subvectors corresponding to the same multi-source sample, wherein the lower bound aggregation processing is to compare the numerical values of corresponding dimensions in the same subvector group and select the minimum value; s25, splicing the device state subvector, the operation behavior subvector and the environment state subvector which are subjected to lower bound aggregation treatment according to a preset sequence to form a hierarchical lower bound aggregation state vector corresponding to the multi-source sample; s26, arranging the hierarchical lower bound aggregation state vectors corresponding to the multi-source samples according to the time stamp sequence to generate a robust state sequence.
5. The reinforcement learning-based power operation verification multi-mode training system according to claim 2, wherein the S3 specifically comprises: s31, reading an operation step number, an operation object identifier, an operation type identifier and step time constraint information from operation bill structured field data, wherein the step time constraint information is a minimum time interval and a maximum time interval allowed between adjacent operation steps; S32, constructing a step ordered sequence according to the operation step numbers and the step sequence, and constructing a flow state machine by taking the operation step numbers as state nodes and taking adjacent step pairs meeting the step sequence and the step time constraint information as state transition relations; S33, for each state in the robust state sequence, determining a candidate step set in the step ordered sequence according to a timestamp corresponding to the state, wherein the candidate step set consists of operation step numbers of which the timestamp falls into a step time constraint interval determined according to step time constraint information; S34, when the candidate step set comprises a plurality of operation step numbers, according to the operation object identification and the operation type identification recorded in the operation event log corresponding to the state, consistency comparison is carried out between the operation object identification and the operation type identification in the operation bill structured field data, and the current step number which is uniquely matched is obtained through screening; S35, determining a state node corresponding to the current step number in the flow state machine according to the current step number, and extracting an operation object identifier and an operation type identifier recorded in a state node subsequent to the state node to form an action set allowed to be executed under the state node.
6. The reinforcement learning-based power operation verification multi-mode training system according to claim 2, wherein S4 specifically comprises: s41, reading an operation time stamp, an operation object identifier and an operation type identifier from operation event log data corresponding to a multi-source sample, and arranging according to the operation time stamp sequence to generate an operation behavior record sequence; s42, combining operation object identifiers of adjacent operation behavior records with operation type identifiers according to the operation behavior record sequence to form an action sequence arranged in time sequence; S43, selecting a plurality of states with time stamps falling in a preset time neighborhood range from a robust state sequence as an associated state set aiming at each action in the action sequence, wherein the time neighborhood range extends forwards and backwards by taking the action time stamp as a center; S44, matching the actions with the flow state machine permission action sets corresponding to the states in the association state sets respectively to obtain matching results of the actions in a plurality of association states; s45, generating reward annotation data for the action according to the consistency relation of the plurality of matching results, generating positive reward annotation data when the action meets the matching condition of a preset proportion in the association state set, and generating negative reward annotation data otherwise; and S46, associating the rewarding annotation data with the corresponding action according to the time sequence of the action sequence to form a rewarding annotation sequence.
7. The reinforcement learning-based power operation verification multi-mode training system according to claim 2, wherein the S5 specifically comprises: s51, based on historical operation data in the operation multisource data, constructing an action scoring model corresponding to the state and the action, wherein the action scoring model scores the state and the action to output the action; s52, selecting a plurality of states with time stamps falling in a preset time neighborhood range from a robust state sequence for each action in the historical operation data to form an associated state set of the corresponding action; s53, respectively inputting the actions and each state combination in the association state set to an action scoring model to obtain an action scoring value set of the corresponding actions in a plurality of association states; S54, carrying out statistical processing on the action scoring value set, and calculating the proportion of the state quantity with the action scoring value not lower than a preset scoring reference to the total quantity of the associated state sets to obtain an action consistency proportion; S55, reading an operation step number corresponding to the action from the structured field data of the operation bill, and dividing the action into a high-criticality action and a non-high-criticality action according to the step criticality level marked in the operation bill by the operation step number; S56, judging the action consistency ratio by adopting a first consistency ratio threshold for high-criticality actions, and judging the action consistency ratio by adopting a second consistency ratio threshold for non-high-criticality actions, wherein the first consistency ratio threshold is higher than the second consistency ratio threshold; S57, generating an action acceptance mark for the action according to the comparison result of the action consistency ratio and the corresponding consistency ratio threshold value, and forming the action which is marked as acceptable into an action acceptance set in a corresponding state.
8. The reinforcement learning-based power operation verification multi-mode training system according to claim 2, wherein the S6 specifically comprises: S61, constructing an offline training sample set based on a robust state sequence, an action sequence and a reward labeling sequence; s62, screening the actions meeting the intersection conditions of the allowed action set and the action acceptance set for each training sample according to the action set and the action acceptance set allowed to be executed by the flow state machine to form a limited action training sample set; S63, constructing a value evaluation network on the limited action training sample set, and updating a value evaluation result corresponding to the state and the action in the value evaluation network based on the current robust state and the corresponding action in the training sample; S64, calculating the dominant value of the corresponding action in the training sample based on the output result of the value evaluation network, wherein the dominant value is obtained by comparing the value evaluation result corresponding to the action with the value evaluation results of other limited actions in the same state; s65, weighting the training samples according to the magnitude of the dominant value, wherein the training samples with positive dominant values participate in strategy updating, and the training samples with negative dominant values do not participate in strategy updating; and S66, in the strategy updating process, updating strategy network parameters based on training samples in the limited action training sample set only to obtain an operation strategy model.
9. The reinforcement learning-based power operation verification multi-mode training system according to claim 2, wherein the S7 specifically comprises: S71, acquiring operation multi-source data in real time in a training process, generating corresponding multi-source samples based on time stamps, and executing time offset processing, state coding and lower bound aggregation processing on the multi-source samples to generate a current robust state; S72, acquiring an action set allowed to be executed from the combination of the operation object identifier and the operation type identifier recorded by the flow state machine under the flow state corresponding to the current robust state according to the flow state machine, and screening the action set allowed to be executed according to the action acceptance set to form a limited action set; S73, inputting operation object identifiers and operation type identifiers corresponding to the current robust state and each action in the limited action set into an operation strategy model to obtain strategy output results corresponding to each action; s74, sorting the actions in the limited action set according to the strategy output result, and selecting the action at the first position in the sorting result as the current recommended operation action; s75, comparing the recommended operation action with the actual operation action acquired in the practical training process to generate an operation verification result; And S76, outputting a step-level error correction instruction corresponding to the recommended operation action when the actual operation action is inconsistent with the recommended operation action.

Description

Electric power operation verification multi-mode training system based on reinforcement learning Technical Field The invention relates to the technical field of reinforcement learning and power system operation, in particular to a power operation verification multi-mode practical training system based on reinforcement learning. Background Along with the continuous expansion of the scale of the power system and the increasing complexity of the operation mode, the normalization and safety requirements of the operation process of the power equipment are continuously improved, and the practical training technology for performing verification, operator skill training and error control around the power operation rules gradually becomes an important component of informatization and intelligent construction of the power industry. Currently, an electric power operation training and checking system is usually combined with an operation bill, a simulation environment and monitoring data to record, replay and manually check an operation process, and part of the system introduces expert rules or a simple state machine model to restrict and prompt an operation sequence so as to assist an operator to finish a standardized operation flow. In the practical application process, the prior art still has obvious limitations. When multi-source heterogeneous data such as video, voice, text, equipment state, alarm information, operation log and the like related to the power operation process are collected and processed, the problems of inconsistent time stamp precision, different data arrival delays and difficult cross-modal alignment often exist. The existing system processes multi-source data by adopting a fixed time alignment or single time window segmentation mode, so that the time uncertainty introduced by factors such as human operation rhythm difference, equipment response delay and the like in the practical training process is difficult to effectively cope with, the state depiction is unstable, and the accuracy of subsequent operation verification is influenced. In recent years, some researches attempt to introduce a reinforcement learning method to model an electric power operation strategy, but the existing scheme is mostly focused on strategy optimization under an online learning or ideal simulation environment, and is difficult to directly apply to an offline practical training scene mainly based on historical operation data. In the existing offline reinforcement learning method, in the processes of action space constraint, rewarding construction and strategy updating, the business semantics and step criticality difference of the power operation flow are not fully combined, strategy deviation is easy to generate or actions which do not accord with operation rules are recommended, and the practicability and safety are limited. Therefore, how to provide a power operation verification multi-mode training system based on reinforcement learning is a problem that needs to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide an electric power operation verification multi-mode practical training system based on reinforcement learning, which synchronously models electric power operation multi-source data through uniform time stamps, introduces time offset expansion and hierarchical lower bound aggregation to construct a robust operation state, generates action and rewarding information under the constraint of a flow state machine and operation semantics, combines time neighborhood consistency and step criticality hierarchical constraint to construct a limited action space, adopts an improved IQL algorithm to train an operation strategy model under an offline condition, realizes stable verification and step level error correction output of operation behaviors in the practical training process, and improves the safety and reliability of electric power operation practical training. According to an embodiment of the invention, an electric power operation verification multi-mode practical training system based on reinforcement learning comprises: The multi-source data acquisition module is used for acquiring operation multi-source data in the practical training process and generating a multi-source sample sequence according to the time stamp; The robust state generation module is used for generating a plurality of groups of time alignment samples in a preset time offset range, carrying out state coding on the time alignment samples, and carrying out lower bound aggregation processing on a plurality of groups of state coding results to generate a robust state sequence; the flow state machine module is used for constructing a flow state machine based on the structured field data of the operation bill and determining an action set allowed to be executed under each flow state; The action and rewards construction module is used for generating an action sequence according to the historical operation