CN-122021620-A - Text information extraction method and system based on artificial intelligence

CN122021620ACN 122021620 ACN122021620 ACN 122021620ACN-122021620-A

Abstract

The invention relates to the technical field of text information extraction and discloses a text information extraction method and system based on artificial intelligence, comprising the following steps of S1, acquiring a text to be processed and a historical text, setting a structured output field set, and establishing consistency constraint conditions; the method comprises the steps of S2, establishing an extraction model and an evidence model based on historical texts, S3, inputting a text to be processed and a structured output field set into the extraction model to obtain a field candidate set of each field in the structured output field, S4, constructing main evidence and branch evidence of each candidate field value, calling the evidence model to calculate the evidence credibility of the candidate field value, and S5, obtaining text information according to all the field candidate sets, the evidence credibility and consistency constraint conditions. The invention constructs the positive evidence and the negative evidence through the main evidence and the branch evidence and forms the evidence credibility, thereby obviously reducing the error selection and the error allocation under the text scene of the coexistence of multiple candidates and multiple values in the same field.

Inventors

HUANG YI

Assignees

湖南人文科技学院

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (7)

1. The text information extraction method based on artificial intelligence is characterized by comprising the following steps: s1, acquiring a text to be processed and a historical text, setting a structured output field set, and establishing consistency constraint conditions based on the structured output field set; step S2, an extraction model and an evidence model are established based on the historical text; S3, inputting the text to be processed and the structured output field set into an extraction model to obtain a field candidate set of each field in the structured output field, wherein the field candidate set comprises a candidate field value and an original text fragment position; S4, constructing a main evidence and a branch evidence of each candidate field value, calling an evidence model, constructing a positive evidence and a negative evidence according to the main evidence and the branch evidence, and calculating the evidence credibility of the candidate field values; And S5, obtaining text information according to all the field candidate sets, the evidence credibility and the consistency constraint conditions.
2. The method for extracting text information based on artificial intelligence according to claim 1, wherein the structured output field set is configured for a preset text information extraction format including a field name and field meta information. The establishing the consistency constraint condition comprises: Each field in the structured output field set is defined by using an XML schema, and structural constraint is generated according to field meta information; Establishing rule constraint according to the structured output field set by using a rule type XML check language; The structural constraints and the rule constraints are integrated into a consistency constraint condition.
3. The method for extracting text information based on artificial intelligence as claimed in claim 2, wherein the extraction model is composed of an input module, a word segmentation device, a text encoder, a span extraction head and a mapping module; the input module is used for generating an input text sequence of the extraction model, and the input text sequence is obtained by splicing field names in the structured output field set with texts to be processed; The word segmentation device is used for segmenting the input text sequence into a sub word sequence and outputting an alignment relation between the sub word sequence and the original text character position; The text encoder adopts a bidirectional pre-training language model based on a transducer to encode the sub word sequence and output sub word level expression vectors; the span extraction head is used for predicting the initial sub-word position and the end sub-word position of the candidate field value on the sub-word level expression vector; The mapping module is used for mapping the initial sub-word position and the end sub-word position into an original character initial position and an original character end position according to the alignment relation output by the word segmentation device to obtain an original segment position, intercepting candidate field values from a text to be processed based on the original segment position, forming a field candidate set and outputting the field candidate set; And training the extraction model by using the historical text with the labeling information, constructing an input text sequence based on the historical text, taking the position of the original text segment corresponding to the field in the labeling information as supervision information, performing supervision training on the extraction model, and solidifying extraction model parameters.
4. The method for extracting text information based on artificial intelligence according to claim 3, wherein the evidence model is composed of an input construction module, a text coding module, a relation judging head, an evidence judging module and an countermeasure aggregation module; the input construction module is used for generating an input pair of the evidence model, splicing the field name and the candidate field value into a candidate field value statement, and respectively forming the candidate field value statement, each trunk evidence and each branch evidence into the input pair; The text coding module adopts a bidirectional pre-training language model based on a transducer and is used for coding input pairs and outputting semantic representation vectors of the input pairs; The relation judging head is used for outputting a relation judging result on the semantic representing vector, wherein the relation judging result represents the relation among the main evidence, the branch evidence and the candidate field value and comprises a supporting relation, an objecting relation and an irrelevant relation; The evidence determination module is used for constructing positive evidence and negative evidence based on the relation determination result and discarding irrelevant evidence, wherein: Regarding the main evidence, if the relation judging result is a supporting relation, the main evidence is taken as positive evidence, if the relation judging result is an opposite relation, the main evidence is taken as opposite evidence, and if the relation judging result is an irrelevant relation, the main evidence is discarded; Generating a reverse evidence by adopting a comparison triggering mechanism for the branch evidence, calculating the current supporting probability and the trunk supporting probability of the branch evidence on the same branch evidence, subtracting the current supporting probability from the trunk supporting probability to obtain a branch substitution triggering quantity, taking the branch evidence as the reverse evidence when the branch substitution triggering quantity is larger than a triggering threshold, and discarding the branch evidence when the branch substitution triggering quantity is smaller than or equal to the triggering threshold; The contrast type aggregation module is used for calculating the evidence credibility of the candidate field value, determining evidence correlation weights based on the distances among the positions of the positive evidence, the negative evidence and the original text segment, respectively carrying out weighted aggregation on the positive evidence and the negative evidence to obtain the positive evidence overall weight and the negative evidence overall weight, and outputting the evidence credibility according to the difference between the positive evidence overall weight and the negative evidence overall weight; training an evidence model by using a historical text with marking information, constructing an input pair of the evidence model based on the historical text, and performing supervision training on a relation judging head by taking a marking field value determined by the marking information as supervision information; And constructing a comparison training sample of branch evidence based on the competition relationship between the marked field value of the same field and other candidate field values, so that the evidence judging module learns a comparison triggering mechanism in the training process, and solidifying evidence model parameters after training is finished, wherein the evidence model parameters are used for outputting the reliability of the evidence of the candidate field values in the text to be processed.
5. The method for extracting text information based on artificial intelligence of claim 4, wherein constructing the trunk evidence and the branch evidence of each candidate field value comprises: searching the nearest sentence boundary leftwards and searching the nearest sentence boundary rightwards in a text to be processed by taking the position of an original text fragment as an anchor point for a candidate field value A in a field candidate set to obtain a core sentence; Based on the core sentence, acquiring a front adjacent sentence and a rear adjacent sentence of the core sentence, and taking the core sentence and the two adjacent sentences as main evidence; if the number of characters of the sentence of the main evidence is smaller than or equal to the threshold value of the number of characters, the sentence of the main evidence is taken as the main evidence; If the number of characters of the sentence of the trunk evidence is larger than the threshold value of the number of characters, cutting the sentence of the trunk evidence by taking the position of the original text segment as the center and taking the threshold value of the number of characters as the length, and obtaining the trunk evidence; And taking the trunk evidence of other candidate field values except the candidate field value A in the field candidate set as branch evidence of the candidate field value A.
6. The text information extraction method based on artificial intelligence as claimed in claim 5, wherein the step S5 is specifically: combining candidate field values of all fields in the field candidate set with corresponding evidence credibility, and sequencing the candidate field values from high to low in each field according to the evidence credibility; Under the constraint of consistency constraint conditions, expanding the fields one by one and generating candidate text information, and carrying out consistency check on the candidate text information obtained by each expansion, wherein only the candidate text information meeting the consistency constraint conditions is reserved; And organizing the reserved candidate text information into text information according to the fields of the structured output field set and outputting the text information.
7. An artificial intelligence based text information extraction system applied to the text information extraction method based on the artificial intelligence as claimed in any one of claims 1 to 6, comprising, The acquisition module is used for acquiring a text to be processed and a historical text, setting a structured output field set and establishing consistency constraint conditions based on the structured output field set; the modeling module is used for establishing an extraction model and an evidence model based on the historical text; the extraction model inputs the text to be processed and the structured output field set into the extraction model to obtain a field candidate set of each field in the structured output field, wherein the field candidate set comprises a candidate field value and an original text fragment position; the evidence model is used for constructing a main evidence and a branch evidence of each candidate field value, calling the evidence model, constructing a positive evidence and a negative evidence according to the main evidence and the branch evidence, and calculating the evidence credibility of the candidate field values; and the output module is used for obtaining text information according to all the field candidate sets, the evidence credibility and the consistency constraint conditions.

Description

Text information extraction method and system based on artificial intelligence Technical Field The invention relates to the technical field of text information extraction, in particular to a text information extraction method and system based on artificial intelligence. Background In business scenes such as bills, contracts, reimbursement, business trips and the like, a large amount of information exists in a natural language text form, the text expression mode is flexible, the field distribution is discrete, multiple places can occur in the same field, the same field value can be supplemented or corrected later, and therefore stable and available text information output is difficult to directly obtain only by one extraction. Especially when multiple fields are output together, the fields often have dependency relationship, mutual exclusion relationship and time sequence relationship, and if cross-field consistency check is absent, the problems of inconsistent field value splicing, combination conflict, non-traceability of results and the like are easy to occur. In the prior art, one type of method is biased to directly give out field values based on an extraction model, but is easy to generate mismatching when the same field is multiple in candidates and the same text is expressed at multiple places and cross-field logic constraint is strong, and the other type of method is biased to check rules, but is difficult to perform credible screening on candidate field values and difficult to form locatable error information and consistency information when the source of the candidate field values is unstable and evidence support is not clear. Therefore, a text information extraction method capable of verifying, screening and combining candidate field values and outputting a traceable result by using a field candidate set, evidence credibility and consistency constraint conditions simultaneously is needed. Disclosure of Invention The present invention has been made in view of the above-described problems. In order to solve the technical problems, the invention provides a text information extraction method based on artificial intelligence, which comprises the following steps: s1, acquiring a text to be processed and a historical text, setting a structured output field set, and establishing consistency constraint conditions based on the structured output field set; step S2, an extraction model and an evidence model are established based on the historical text; S3, inputting the text to be processed and the structured output field set into an extraction model to obtain a field candidate set of each field in the structured output field, wherein the field candidate set comprises a candidate field value and an original text fragment position; S4, constructing a main evidence and a branch evidence of each candidate field value, calling an evidence model, constructing a positive evidence and a negative evidence according to the main evidence and the branch evidence, and calculating the evidence credibility of the candidate field values; And S5, obtaining text information according to all the field candidate sets, the evidence credibility and the consistency constraint conditions. The invention relates to an artificial intelligence based text information extraction method, which is characterized in that the structured output field set is configured for a preset text information extraction format and comprises a field name and field meta information. The establishing the consistency constraint condition comprises: Each field in the structured output field set is defined by using an XML schema, and structural constraint is generated according to field meta information; Establishing rule constraint according to the structured output field set by using a rule type XML check language; The structural constraints and the rule constraints are integrated into a consistency constraint condition. As a preferable scheme of the text information extraction method based on artificial intelligence, the extraction model comprises an input module, a word segmentation device, a text encoder, a span extraction head and a mapping module; the input module is used for generating an input text sequence of the extraction model, and the input text sequence is obtained by splicing field names in the structured output field set with texts to be processed; The word segmentation device is used for segmenting the input text sequence into a sub word sequence and outputting an alignment relation between the sub word sequence and the original text character position; The text encoder adopts a bidirectional pre-training language model based on a transducer to encode the sub word sequence and output sub word level expression vectors; the span extraction head is used for predicting the initial sub-word position and the end sub-word position of the candidate field value on the sub-word level expression vector; The mapping module is used for mapping the initial