CN-122020166-A - File data reconstruction model training method, file data anomaly detection method, device, equipment and medium

CN122020166ACN 122020166 ACN122020166 ACN 122020166ACN-122020166-A

Abstract

The application provides a training method of a file data reconstruction model, a file data anomaly detection method, a device, equipment and a medium, wherein a file record to be detected is obtained; the method comprises the steps of carrying out vectorization processing on a file record to be detected to obtain a corresponding input vector, inputting the input vector into a file data reconstruction model obtained based on normal file record unsupervised training to obtain a corresponding reconstruction vector, calculating a reconstruction error between the input vector and the reconstruction vector, and determining the file record to be detected as abnormal data when the reconstruction error is larger than a preset error threshold. The application can identify the internal association and nonlinear relation of each field in the archive record through the pre-trained archive data reconstruction model, find out the novel abnormality of unknown types and undefined rules, break through the limitation of the traditional hard coding rules and realize the accurate detection of deep and hidden defects in archive data.

Inventors

Ma Chunchuo
YU DEMING
LIU HONG

Assignees

北京合思信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (10)

1. A method of training a archival data reconstruction model, comprising: acquiring a plurality of verified normal file records in a history file database; Carrying out vectorization processing on each field in each normal file record to obtain input vectors, and forming a training sample set by all the input vectors; Performing unsupervised iterative training on the training sample set based on a reconstructed model of the pre-constructed archive data until a preset iteration stop condition is reached; in each iterative training, the parameters of the archive data reconstruction model are updated based on the reconstruction loss between the reconstruction vector and the input vector by learning the nonlinear dependency relationship among the fields in the normal archive record and implicitly modeling the joint distribution characteristics of the archive data reconstruction model.
2. The method of claim 1, wherein the types of the fields include at least a category field and a numeric field, wherein the vectorizing each field in each normal archive record to obtain an input vector includes: Performing single-heat coding on the category type field to obtain a category type field vector, wherein the dimension of the category type field vector is the number of all category type fields; Normalizing the numerical field, and mapping an original numerical value to a preset interval to obtain a numerical field vector; and splicing all the category field vectors and all the numerical field vectors according to a preset sequence to obtain an input vector.
3. The method of claim 1, wherein the archival data reconstruction model comprises an encoder and a decoder, the encoder comprising an input layer, a first concealment layer, a second concealment layer, the decoder comprising a third concealment layer and an output layer, wherein, The input layer is used for receiving the normal file record after vector processing as an input vector; the first hidden layer is used for carrying out preliminary compression and feature abstraction on the input vector through a nonlinear activation function so as to capture a local association mode between fields; the second hidden layer is used as a bottleneck layer for generating a low-dimensional potential representation, and the low-dimensional potential representation implicitly codes global nonlinear dependency relationships and joint distribution characteristics among fields in a normal file record; the third hidden layer is used for symmetrically expanding the low-dimensional potential representation and gradually recovering a high-dimensional characteristic structure through a nonlinear activation function; The output layer is used for outputting a reconstruction vector with the same dimension as the input vector, and the reconstruction vector represents an optimal reconstruction result of normal archival data distribution under the current model parameters.
4. The method of claim 1, wherein updating parameters of the archive data reconstruction model based on reconstruction losses between the reconstruction vector and an input vector comprises: Calculating reconstruction errors of the input vector and the reconstruction vector in corresponding dimensions of each field; determining the mean square error of the input vector and the reconstruction vector based on the reconstruction errors of all dimensions, and representing the reconstruction loss by using the mean square error; And according to the reconstruction loss, adjusting network parameters of the archival data reconstruction model through back propagation so as to minimize the reconstruction loss.
5. A method for detecting archive data anomalies, comprising: Acquiring a file record to be detected; vectorizing the archive record to be detected to obtain a corresponding input vector; Inputting the input vector into a file data reconstruction model trained by the method of any one of claims 1 to 4 to obtain a corresponding reconstruction vector; calculating a reconstruction error between the input vector and the reconstruction vector; And when the reconstruction error is larger than a preset error threshold, determining that the file to be detected is recorded as abnormal data.
6. The method of claim 5, wherein the preset error threshold is determined by: Obtaining a reconstruction error distribution generated by a normal archive record; Determining a statistical feature of the reconstructed error distribution based on the reconstructed error distribution; Determining the preset error threshold according to the statistical characteristics; Wherein the statistical features include any one of the following: presetting an error value corresponding to a high percentile in the reconstruction error distribution; and the mean value and standard deviation of the reconstruction error distribution.
7. The method of claim 5, wherein after determining that the archive to be detected is recorded as anomalous data, the method further comprises: calculating a reconstruction error of the input vector and the reconstruction vector in corresponding dimensions of each original field, and determining at least one field with the reconstruction error larger than a preset error threshold as an abnormal field position; Based on the abnormal field position, carrying out normalized inverse processing on the vector of the corresponding field in the reconstructed vector, and generating a repair suggestion value consistent with the data format of the original field; And in a data quality auditing interface, highlighting the archive record corresponding to the abnormal data, and visually presenting the original field value and the repair suggestion value in a comparison chart form.
8. An archive data anomaly detection device, comprising: The acquisition unit is used for acquiring the file record to be detected; the vectorization processing unit is used for vectorizing the archive record to be detected to obtain a corresponding input vector; a reconstruction unit, configured to input the input vector into a archive data reconstruction model trained by the method according to any one of claims 1 to 4, to obtain a corresponding reconstruction vector; a calculation unit configured to calculate a reconstruction error between the input vector and the reconstruction vector; and the determining unit is used for determining that the file record to be detected is abnormal data when the reconstruction error is larger than a preset error threshold value.
9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are in communication with each other through the communication bus; the memory is used for storing a computer program; The processor is configured to implement the method of any one of claims 1-7 when executing a program stored on the memory.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-7.

Description

File data reconstruction model training method, file data anomaly detection method, device, equipment and medium Technical Field The invention relates to the technical field of data processing, in particular to a training method of a archive data reconstruction model, an archive data anomaly detection method, an archive data anomaly detection device, equipment and a medium. Background In the daily operation of enterprises and organizations, massive archival data such as financial vouchers, personnel archives, purchase orders, customer information, etc. are generated. These data are typically imported from different sources, either manually or by automation such as OCR, and the data quality varies. Various errors are common in the data, such as errors in manual entry (e.g., one more or one less zero more of the amount), errors in OCR recognition, inconsistent unit filling (e.g., mixed use of "element" and "ten thousand element"), abnormal values which do not accord with normal conditions (e.g., the employee age is 200 years), and logical contradictions between fields (e.g., the reimbursement amount is far greater than the invoice amount). These erroneous data seriously affect the accuracy of subsequent data analysis, report generation and intelligent decisions. In the prior art, a series of hard-coded rules are typically predefined to verify the data, such as "the amount field must be a number", "the age must be between 18-65", "the phone number must be 11 digits", etc. Although the rule can effectively intercept partially explicit format errors or boundary violations, the rule is essentially based on static verification of explicit business rules, and has obvious limitations that on one hand, the rule is difficult to cover complex, dynamic or context-related abnormal modes, and on the other hand, the traditional rule cannot be modeled and identified for nonlinear logic contradictions hidden in multi-field joint distribution (such as 'medium-level staff accommodation cost in first-line city is not more than 1.2 ten thousand yuan'). Disclosure of Invention Accordingly, the present invention is directed to a training method for a archival data reconstruction model, and an archival data anomaly detection method, apparatus, device and medium for implementing accurate detection of deep and hidden defects in archival data. In a first aspect, a training method for a archival data reconstruction model is provided, including: acquiring a plurality of verified normal file records in a history file database; carrying out vectorization processing on each field in each normal archive record to obtain an input vector, and forming a training sample set by all the input vectors; performing unsupervised iterative training on the training sample set based on a reconstructed model of the pre-constructed archive data until a preset iteration stop condition is reached; in each iterative training, the parameters of the archive data reconstruction model are updated based on the reconstruction loss between the reconstruction vector and the input vector by learning the nonlinear dependency relationship among the fields in the normal archive record and implicitly modeling the joint distribution characteristics of the nonlinear dependency relationship. Optionally, the field types at least comprise category fields and numerical value fields, and the vectorizing processing is carried out on each field in each normal file record to obtain an input vector, wherein the input vector comprises: performing single-hot coding on the category type field to obtain a category type field vector, wherein the dimension of the category type field vector is the number of all category type fields; Carrying out normalization processing on the numeric field, and mapping an original numeric value to a preset interval to obtain a numeric field vector; and splicing all the category field vectors and all the numerical field vectors according to a preset sequence to obtain an input vector. Optionally, the archive data reconstruction model comprises an encoder and a decoder, the encoder comprises an input layer, a first hidden layer, a second hidden layer, and the decoder comprises a third hidden layer and an output layer, wherein: the input layer is used for receiving the normal file record after vector quantization processing as an input vector; The first hiding layer is used for carrying out preliminary compression and feature abstraction on the input vector through a nonlinear activation function so as to capture a local association mode between fields; The second hidden layer is used as a bottleneck layer for generating a low-dimensional potential representation, and the potential representation implicitly codes global nonlinear dependency and joint distribution characteristics among all fields in the normal file record; The third hidden layer is used for symmetrically expanding the low-dimensional potential representation and gradually recovering the high