CN-122020123-A - File data structured extraction method based on machine learning

CN122020123ACN 122020123 ACN122020123 ACN 122020123ACN-122020123-A

Abstract

The invention discloses a method for structured extraction of archival data based on machine learning, which comprises the following steps of carrying out standardization processing, noise suppression and sequence segmentation on archival text to obtain a processable text sequence, generating a plurality of candidate field fragments in the text sequence, extracting multi-level features for each fragment, carrying out joint evaluation on boundary combinations and feature representations of the field fragments by introducing minimum description length criteria, determining an optimal fragment division result and generating compressed feature representations, taking the field fragments as integral unit construction sequences to be input, carrying out integral modeling and joint decoding on variable length fields by utilizing a half Markov conditional random field model to obtain corresponding relations between field labels and field fragments, and finally generating a structured archival data result. The invention can reduce the dependence on manual rules and improve the accuracy and adaptability of the structured extraction of the archive data.

Inventors

CHEN GUODONG
ZHAO XIAOJU
LI HUIYUAN
Zhou Diefeng
CUI SHUO

Assignees

山东通航信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (8)

1. The method for structured extraction of archive data based on machine learning is characterized by comprising the following steps: acquiring archival text data, and performing character normalization, noise character suppression and serialization segmentation on the archival text data to obtain a text sequence to be processed; Generating a candidate field segment set based on a text sequence to be processed, wherein the candidate field segment set is composed of a plurality of candidate field segments covering a preset length range, and each candidate field segment has a segment start-stop boundary; extracting a multi-granularity feature set for each candidate field segment in the candidate field segment set; The minimum description length criterion is used as a unified objective function, segment boundary combinations formed by segment start and stop boundaries of candidate field segments are subjected to joint coding evaluation with corresponding multi-granularity feature sets, a segment division result with the minimum description length is output, and compressed feature representations are generated for each field segment in the segment division result; Based on field fragments and compressed characteristic representations in the fragment division result, constructing fragment sequence input of a half Markov conditional random field model, and defining a field label set to form a fragment-level sequence labeling task; Training the half Markov conditional random field model to obtain model parameters, executing segment cascade decoding on segment sequence input, and outputting the corresponding relation between a field label set and field segments; And mapping the field fragments into structured field values according to the corresponding relation between the field label set and the field fragments, and generating a structured archive data result.
2. The method of claim 1, wherein the generating of the text sequence to be processed includes obtaining the text data, performing character normalization processing on the text data character by character to form a normalized character sequence, performing noise character suppression processing on the normalized character sequence to form a denoised character sequence, performing serialization segmentation processing on the denoised character sequence, determining an initial segmentation position set according to a line break, a page break, a tab and a punctuation separator, adjusting the initial segmentation position set in combination with a number string, a date string and a continuity constraint of the number string, and organizing the denoised character sequence into the text sequence to be processed in an original order based on the adjusted segmentation position set.
3. The machine learning based archival data structured extraction method of claim 1, wherein the generation of the set of candidate field segments comprises: establishing character position identifiers based on the text sequence to be processed, and endowing each character in the text sequence to be processed with continuous position identifiers according to the appearance sequence in the text sequence to be processed to obtain a character sequence with the position identifiers; Setting a preset length range of the candidate field fragments based on the character sequence with the position mark, wherein the preset length range is limited by a minimum length value and a maximum length value, the minimum length value limits the candidate field fragments to contain one character, and the maximum length value limits the number of the characters contained in the candidate field fragments to be not more than a preset upper limit; Based on a preset length range, sequentially taking any position mark in a character sequence with position marks as a segment starting position, continuously selecting characters from the segment starting position to backwards and determining a segment ending position under the condition that the preset length range is met and the character sequence with the position marks is not crossed with the end of the character sequence, and generating a candidate field segment consisting of characters corresponding to the segment starting position to characters corresponding to the segment ending position; After generating a candidate field segment, recording a segment start-stop boundary for the candidate field segment, wherein the segment start-stop boundary consists of a segment start position and a segment end position, and establishing a corresponding relation between the segment start-stop boundary and the content of the candidate field segment; Repeatedly executing fragment starting position determination, fragment ending position determination, candidate field fragment generation and fragment start-stop boundary record on all position identifiers in the character sequence with the position identifiers, and summarizing to obtain all candidate field fragments and fragment start-stop boundaries thereof to form a candidate field fragment set.
4. The method for structured extraction of archive data based on machine learning of claim 1 wherein the extracting of the multi-granularity feature set comprises, for each candidate field segment in the candidate field segment set, reading segment start and stop boundaries corresponding to the candidate field segment and determining segment internal character sequences and left and right adjacent context windows thereof, extracting character-level features based on the segment internal character sequences, extracting position-level features based on the positions of the segment start and stop boundaries in the character sequences with position identifiers, extracting context relation features based on the left and right adjacent context windows, and indexing and encoding the character-level features, the position-level features and the context relation features according to a preset feature template to obtain the multi-granularity feature set of the candidate field segment.
5. A machine learning based archival data structured extraction method as claimed in claim 1 wherein the modified minimum descriptive length criteria comprises: Constructing a segment boundary combination set based on a candidate field segment set, wherein the segment boundary combination set is composed of a plurality of segment boundary combinations, each segment boundary combination is composed of segment start and stop boundaries of candidate field segments and corresponds to a group of field segments which are arranged in sequence in a text sequence to be processed and are not overlapped with each other; for each segment boundary combination in the segment boundary combination set, reading the corresponding candidate field segments of each field segment in the candidate field segment set in the segment boundary combination, and performing indexing coding on the multi-granularity feature set of each field segment according to a preset feature template to form a segment-level feature sequence corresponding to the segment boundary combination; Determining character position ranges covered by field fragments by the start and stop boundaries of each field fragment in the fragment boundary combination based on the fragment-level feature sequences, performing noise mask inference on characters in the field fragment coverage and feature dimensions thereof in the fragment-level feature sequences to obtain noise masks, and dividing the fragment-level feature sequences into content feature sequences and noise feature sequences according to the noise masks; Constructing a joint coding structure based on the content characteristic sequence and the noise characteristic sequence and mapping the joint coding structure, wherein the joint coding structure comprises the content coding structure and the noise coding structure, and the mapping codes the content characteristic sequence into a content coding sequence and codes the noise characteristic sequence into a noise coding sequence; based on the content coding sequence and the noise coding sequence, calculating the total description length corresponding to the segment boundary combination and the local description length of each field segment; Constructing a dynamic programming solving structure based on the local description length, taking character position identifiers in a text sequence to be processed as dynamic programming nodes, taking candidate field fragments as inter-node transfer, taking the local description length of the field fragments corresponding to the candidate field fragments as transfer cost, and solving a transfer path with the minimum total description length in all legal transfer paths corresponding to a fragment boundary combination set through dynamic programming to generate fragment boundary combinations with the minimum total description length as fragment division results; outputting a segment division result and a content coding sequence and a noise coding sequence corresponding to each field segment in the segment division result, and forming a compression characteristic representation of each field segment in the segment division result according to the content coding sequence and the noise coding sequence.
6. A machine learning based archival data structured extraction method as claimed in claim 1 wherein the construction of the segment sequence input comprises: Based on the segment division result, reading segment start and stop boundaries of each field segment in the segment division result, and sequencing each field segment according to the sequence of the segment start position of the field segment in the text sequence to be processed to form an ordered sequence of the field segment; for each field segment in the field segment ordered sequence, reading a compression characteristic representation corresponding to the field segment, and binding the compression characteristic representation with a segment start-stop boundary of the field segment to form a segment-level observation item taking the field segment as a basic unit; based on the segment-level observation items, sequentially arranging the segment-level observation items according to the sequence of the field segment ordered sequence to construct a segment sequence structure; The segment sequence structure is input as a segment sequence of a semi-Markov conditional random field model, wherein each sequence element in the segment sequence input includes a segment start and stop boundary of a corresponding field segment and a compressed feature representation thereof.
7. The machine learning based archival data structured extraction method of claim 1, wherein the generation of the correspondence of the set of field tags to the field segments comprises: Based on the segment sequence input, reading segment-level observation items which are arranged according to the field segment ordered sequence in the segment sequence input, wherein each segment-level observation item comprises segment start and stop boundaries of the corresponding field segment and compression characteristic representations thereof; after reading the fragment-level observation item, setting a field tag set, distributing tag identifiers for each field tag in the field tag set, and determining an allowable transfer relationship between the field tags in the field tag set; Constructing a semi-Markov conditional random field model based on a segment-level observation item and a field tag set, setting the field tag set as a state set, setting the segment-level observation item as an observation sequence, determining a character position range covered by a field segment according to a segment start-stop boundary in the segment-level observation item, determining the duration of the field segment according to the character position range covered by the field segment, and writing the duration of the field segment as a state duration constraint into a model structure; Constructing an observation related feature function based on the compressed feature representation in the fragment-level observation item, and constructing a state transition related feature function based on adjacent field labels and corresponding duration under the allowable transition relation; based on the segment sequence input and field label marking information corresponding to the segment sequence input, monitoring marking information for model training is constructed; Constructing a training objective function based on the supervision labeling information, and executing optimization and updating on the training objective function to obtain the half Markov conditional random field model parameters after training, wherein the model parameters correspond to the state transition related characteristic function and the observation related characteristic function; Inputting a segment sequence into the trained half Markov conditional random field model based on the trained half Markov conditional random field model parameters, and executing segment cascade decoding on the segment sequence input based on the model parameters and state continuous constraints, wherein the segment cascade decoding comprises recursively calculating accumulated path scores of each field label under each continuous length according to the arrangement sequence of segment-level observation items, and recording the corresponding optimal precursor field labels in the recursion calculation process; Backtracking is performed on the basis of the recorded optimal precursor field tag to obtain a field tag sequence, and the corresponding relation between the field tag set and the field fragment is generated according to the corresponding relation between the field tag sequence and the fragment-level observation item.
8. A machine learning based archival data structured extraction method as claimed in claim 1 wherein the generation of the structured archival data results comprises: based on the corresponding relation between the field label set and the field fragments, reading the field fragments corresponding to each field label, and keeping the appearance sequence of the field fragments in the text sequence to be processed; Aiming at the field fragments corresponding to each field tag, intercepting the corresponding field fragment content from the text sequence to be processed according to the fragment start-stop boundary of the field fragment to form a field fragment content set corresponding to the field tag; Based on the field fragment content set, splicing the field fragment contents belonging to the same field tag according to the sequence in the field fragment content set to generate a structured field value corresponding to the field tag; Based on the field label set, mapping relation is established between each field label and the corresponding structured field value, and mapping relation between all field labels and the structured field value is summarized to generate structured archive data result.

Description

File data structured extraction method based on machine learning Technical Field The invention relates to the technical field of computer information processing, in particular to a method for structured extraction of archive data based on machine learning. Background Along with the advancement of file digitization and intelligent file management, massive historical files, business files and electronic documents are gradually and intensively stored and utilized in a form of scanning text or electronic text, and automatic arrangement, automatic cataloging and structural extraction of file contents become important application scenes in file informatization construction. Currently, archive data usually takes unstructured text as a main existence form, and key information in text content needs to be identified, split and classified to form retrievable and analyzable structured archive data. The existing archival data structuring processing technology still has obvious defects. On the one hand, the traditional method relies on manual rules, template matching or fixed field modes for processing, the rule design and maintenance cost is high, and when the types, formats or historic periods of files change, the problems of field identification errors, boundary misplacement or missing extraction easily occur, and the method is difficult to adapt to complex and diverse file texts. On the other hand, the automatic extraction method based on the conventional sequence labeling model usually takes a single character or a fixed window as a modeling unit, is difficult to directly process variable length fields, the field boundary recognition and feature construction processes are mutually split, errors are easy to accumulate in multi-stage processing, the integrity and consistency of a structural result are insufficient, noise conditions such as optical recognition errors, irregular typesetting and symbol mixing are commonly existing in file texts, the existing method lacks an effective distinguishing and suppressing mechanism for noise features, is easy to be interfered by irrelevant information, and further reduces the accuracy of the structural extraction. Therefore, how to provide a method for structured extraction of archive data based on machine learning is a problem that needs to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide a method for structured extraction of archival data based on machine learning, which realizes automatic conversion of unstructured archival content into structured archival data by segment-level modeling and sequence analysis of archival text. The invention comprehensively utilizes text feature modeling, information theory criteria and sequence model technology to uniformly process field fragments, field boundaries and feature representations in the archive text, completes field identification and field value generation, can realize the structured extraction of complex archive text under the condition of reducing the dependence of manual rules, and has the advantages of strong adaptability, high structural consistency and high processing accuracy. According to the embodiment of the invention, the method for structured extraction of archive data based on machine learning comprises the following steps: acquiring archival text data, and performing character normalization, noise character suppression and serialization segmentation on the archival text data to obtain a text sequence to be processed; Generating a candidate field segment set based on a text sequence to be processed, wherein the candidate field segment set is composed of a plurality of candidate field segments covering a preset length range, and each candidate field segment has a segment start-stop boundary; Extracting a multi-granularity feature set for each candidate field segment in the candidate field segment set, wherein the multi-granularity feature set comprises character-level features, position-level features and context relation features; The minimum description length criterion is used as a unified objective function, segment boundary combinations formed by segment start and stop boundaries of candidate field segments are subjected to joint coding evaluation with corresponding multi-granularity feature sets, a segment division result with the minimum description length is output, and compressed feature representations are generated for each field segment in the segment division result; Based on field fragments and compressed characteristic representations in the fragment division result, constructing fragment sequence input of a half Markov conditional random field model, and defining a field label set to form a fragment-level sequence labeling task; Training the half Markov conditional random field model to obtain model parameters, executing segment cascade decoding on segment sequence input, and outputting the corresponding relation between a field label set and field segments;