CN-121996460-A - SQL sequence anomaly determination method based on Top-K hit

CN121996460ACN 121996460 ACN121996460 ACN 121996460ACN-121996460-A

Abstract

The invention provides a Top-K hit based SQL sequence anomaly judgment method, which models keywords in an SQL operation sequence in a generated sequence modeling mode, introduces a Top-K hit based anomaly judgment mechanism on the basis of predicting an SQL keyword candidate set which possibly appears in each position, and analyzes the occurrence condition of the keywords in the SQL operation position by position. The method comprises the following steps of 1, constructing an ID mapping dataset, 2, carrying out grammar analysis on an original SQL sentence by utilizing an analysis script, 3, carrying out normalization processing on a structured SQL module, 4, carrying out ID replacement on the normalized structured JSON, 5, splicing IDs of all modules according to a fixed sequence on the structured JSON subjected to ID replacement to generate a standardized token sequence, 6, carrying out pretraining on InitGPT, 7, carrying out reinforcement learning fine tuning on LogGPT, 8, optimizing model parameters through embedding fusion and autoregressive learning, and 9, carrying out LogGPT reinforcement learning training.

Inventors

Cao Maifu
Cao Feimei

Assignees

浙江工商大学

Dates

Publication Date: 20260508
Application Date: 20260123

Claims (10)

1.A SQL sequence anomaly judgment method based on Top-K hit is characterized by comprising the following steps: Step 1, constructing an ID mapping data set of SQL keywords, table names and column names, and providing basic dictionary support for subsequent normalization; Step 2, carrying out grammar analysis on the original SQL sentence by utilizing analysis scripts, extracting structural information and converting the structural information into a JSON format; step 3, normalization processing of the structured SQL module; carrying out module level normalization on the structured JSON output in the step 2, eliminating grammar difference and redundant information, and unifying expression formats; Step 4, using the mapping file generated in the step 1 to perform ID replacement on the normalized structured JSON, and converting the table name, column name and SQL keyword of the text type into corresponding digital ID; Step 5, splicing the IDs of the modules according to a fixed sequence to the structural JSON with the replaced IDs to generate a standardized token sequence; step 6, pre-training InitGPT; Step 7, performing reinforcement learning fine adjustment on LogGPT; Step 8, optimizing model parameters through embedding fusion and autoregressive learning; and 9, performing LogGPT reinforcement learning training, and completing parameter optimization around sequence cutting, multi-round sampling, rewarding loss calculation and gradient updating.
2. The method for determining SQL sequence anomalies based on Top-K hits according to claim 1, wherein in step 1, the ID mapping dataset comprises the following three types of core mapping files: (1) The mapping file of the SQL key word and the conditional logic operator defines the ID mapping rule of the SQL key word and the conditional logic operator, writes the mapping relation into the CSV file through the script in the coreWordTranslete. Txt; (2) The table name mapping file is used for storing the corresponding relation between the table names and the table IDs of the database; (3) Column name mapping file, storing the corresponding relation between the column name and column ID of the database.
3. The method for determining SQL sequence anomalies based on Top-K hits according to claim 1, wherein the step 2 comprises the steps of: (1) Inputting an original SQL sentence, and analyzing the original SQL sentence into a grammar tree object through a sqlparse library; (2) The method comprises the steps of (21) extracting column names, aggregation functions AND wildcards in a SELECT clause, (22) extracting table names AND aliases in a FROM clause, (23) extracting types, associated table names AND ON conditions of the JOIN clause, (24) extracting conditional expressions in a WHERE clause, removing redundant spaces AND punctuations, splitting AND retaining connectors according to AND/OR logic to form a condition list, (25) respectively extracting ordering fields OR grouping fields in GROUPBY AND ORDERBY clauses, (26) extracting offset AND record number in the LIMIT clause, AND (27) splitting a plurality of sub-queries AND respectively analyzing if SQL contains UNION keywords AND storing the sub-queries in the UNION field list.
4. The SQL sequence anomaly determination method based on Top-K hit of claim 1, wherein step 3 comprises the steps of (1) SELECT_COLS normalization, (2) WHERE/JOIN condition normalization, (3) ORDER_BY normalization, and (4) JOIN type normalization.
5. The method for determining SQL sequence anomalies based on Top-K hits according to claim 1, wherein the step 4 comprises the steps of: (1) Loading a mapping file, namely constructing a mapping dictionary in a memory; (2) Splitting the table name and the alias, capitalizing the table name to match the table_name in the tables.csv, replacing the table name with the corresponding table_id, and ignoring the alias to only reserve the ID; (3) Processing the JOIN clause, namely executing the same replacement logic as the FROM clause on the associated table name in the JOIN; (4) Constructing alias mapping, namely establishing an alias-table name mapping relation based on the table names and aliases of the FROM clauses; (5) The complete column name matching, namely, if the column names in the select_cols, the group_by and the ORDER_by contain aliases, the column names without the aliases are converted into complete table names and column names through the alias mapping, and the column names without the aliases are directly matched and replaced according to the column names; (6) Processing a function sequence; (7) Keyword and condition type replacement; (8) Structured SQL keywords in JSON; (9) And (3) carrying out unmatched processing, namely directly reserving an original value or marking the original value as a default ID if the table name, the column name and the key word do not find a matched item in the mapping file.
6. The method for determining SQL sequence anomalies based on Top-K hits according to claim 1, wherein the step 5 comprises the steps of: (1) Splicing according to a fixed logic sequence of SELECT, FROM, JOIN, WHERE, group, ORDERBY and LIMIT; (2) Sequence format processing, namely separating all the spliced IDs by commas to form a token sequence of a character string type; (3) And outputting a result, namely associating the standardized token sequence with the corresponding user identifier and the abnormal label to form a final model input data set.
7. The method for determining SQL sequence anomalies based on Top-K hits according to claim 6, wherein in the step 5 (1), each module splicing rule is as follows: (11) A SELECT module, which is to splice SELECT key IDs first and splice each row ID in select_cols sequentially; (12) The FROM module is used for firstly splicing the FROM keyword IDs and then splicing all the table IDs in the FROM in sequence; (13) A JOIN module, which is used for splicing the JOIN type ID, the association table ID, the ON keyword ID and the association condition ID for each JOIN item in sequence; (14) The WHERE module is used for splicing WHERE keyword IDs firstly and then splicing all condition IDs in WHERE in sequence; (15) The GROUPBY and ORDERBY modules are respectively spliced with the GROUPBY and ORDERBY keyword IDs, and then the corresponding field IDs are spliced in sequence; (16) And the LIMIT module is used for firstly splicing the ID of the LIMIT key words and then splicing the classified IDs of the LIMIT.
8. The method for judging the abnormal SQL sequence based on the Top-K hit according to claim 1, wherein in the step 6, the length of the generated token sequence is adapted based on a max_lens parameter configured by a model, if the sequence length exceeds the max_lens length, the sequence is truncated from the end to the max_lens length, if the sequence length is not enough, an ID corresponding to < pad > is supplemented at the end of the sequence, and if serious errors occur in the SQL statement analysis or ID mapping process, the sequence is marked as an invalid sequence, and the token sequence is uniformly set as all < pad > IDs.
9. The method of claim 1, wherein in step 7, the GPT2Config object is constructed based on the parameters of the options dictionary loading model structure, wherein the vocabulary size is set to the total length of the vocabulary, the special symbol IDs are mapped to the IDs corresponding to < bos >, < eos >, < pad >, < unk >, < mask >, and the maximum sequence length is bound to the max_lens parameter.
10. The method for determining SQL sequence anomalies based on Top-K hits of claim 1, wherein a verification and early-stop mechanism is introduced in said LogGPT reinforcement learning fine-tuning process.

Description

SQL sequence anomaly determination method based on Top-K hit Technical Field The invention relates to a SQL sequence anomaly determination method based on Top-K hit, which is suitable for anomaly detection and analysis of database access behaviors, application logs or other structured instruction sequences. Background With the wide application of database systems in key fields such as finance, government affairs, the internet and the like, the safety and reliability of database operation have important significance for protecting data assets. The database log is used as an important component of the database system, records the inquiry, update and management operations of a user on the database, and is a main basis for carrying out security audit, access control and operation and maintenance monitoring. By analyzing the database log, potential abnormal operation behaviors such as unauthorized access, abnormal data modification or illegal deletion and the like can be found, so that support is provided for guaranteeing the safety and the integrity of the database system. For anomaly detection of database logs, the prior art mainly comprises a method based on statistical analysis, a method based on traditional machine learning and a method based on deep learning. In the technical schemes based on statistical analysis and traditional machine learning, common methods include Principal Component Analysis (PCA), isolated forests (IsolationForest), and a class of support vector machines (OCSVM), etc. Such methods typically utilize manual design features to vectorize log entries and implement anomaly determination based on feature distribution or anomaly scores. However, because SQL operations in database logs have high dimensionality, diversity and strong context correlation, the method often relies on preset assumptions or linear relation modeling, is difficult to accurately describe complex patterns of database access behaviors, and has weak adaptability to log format changes and access pattern evolution. In recent years, the deep learning method is gradually applied to the field of database log anomaly detection. Part of the prior art refers to the sequence modeling thought in natural language processing, takes a database log as an operation sequence, and utilizes a cyclic neural network (such as LSTM) to construct a log language model so as to predict whether the next SQL operation accords with a normal behavior mode. Representative methods include LSTM based DeepLog, logAnomaly and the like. The method improves the anomaly detection effect to a certain extent, is limited by a circulating neural network structure, is easy to have the problems of gradient disappearance or information forgetting when processing a long sequence, and is difficult to fully capture the long-range dependency relationship in the SQL operation sequence. To overcome the above-mentioned shortcomings, some of the prior art use a transducer structure to construct a log language model, and model context relationships in a log sequence by a self-attention mechanism to enhance the processing capability for long sequences. However, the existing method based on the log language model usually uses whether the prediction of the complete SQL operation or the template thereof occurs in the first K candidate results as the basis of anomaly judgment, the training target mainly focuses on the whole sequence prediction probability, and the normal occurrence rules of each keyword in the SQL sentence at different positions of the sequence are difficult to be described in fine granularity. In terms of anomaly determination policies, existing log language model methods typically employ a Top-K prediction mechanism, i.e., when an actually observed SQL operation or template does not appear in the Top K candidate results of model prediction, that operation is determined to be anomalous. However, this manner of determination typically takes a complete SQL statement or template as the smallest unit of determination, without location-by-location analysis of the SQL keyword or markup level. When the SQL sentence is long, the structure is complex or the parameter changes frequently, even if the appearance sequence or the combination mode of part of keywords in the SQL sentence is abnormal, the whole prediction result can hit Top-K and can not be identified, so that the fineness and the accuracy of the abnormality detection are affected. In addition, in the prior art, the training objective of the log language model is usually focused on minimizing the prediction error or improving the overall sequence generation probability, and a certain deviation exists between the training objective and the requirement of performing anomaly judgment based on the keyword-level behavior in the actual service scene, so that the model training process cannot be directly optimized for anomaly detection decision, and the suitability and the robustness of the model in the actual applic