CN-121579993-B - NLP-based customs receipt text target feature extraction method and system

CN121579993BCN 121579993 BCN121579993 BCN 121579993BCN-121579993-B

Abstract

The application relates to the technical field of data processing, and discloses a customs receipt text target feature extraction method and system based on NLP. The method comprises the steps of carrying out sequence labeling on customs bill texts to obtain an initial field set and a field missing vector, matching the field missing vector with a structure type feature vector to determine a field space attention template, screening candidate text blocks according to the attention template to generate a complement field, constructing a semantic association graph by the initial field and the complement field, and outputting a target field set after constraint verification and iterative correction. The application improves the field extraction accuracy and the logic consistency of the customs documents with variable formats and complex formats.

Inventors

ZHANG CHENGZHE
WANG YUNAN
YIN XIAOPING
LIU KAIXI
ZHANG CHENG
CAO WEI
YANG LU
GUO ZHENTONG

Assignees

天津易泰科技发展有限公司
中国电子口岸数据中心天津分中心

Dates

Publication Date: 20260505
Application Date: 20260126

Claims (9)

1. The method for extracting the customs bill text target features based on NLP is characterized by comprising the following steps: step S1, performing sequence labeling processing on customs receipt texts to obtain an initial field set and extraction confidence coefficient of each field, and generating a field missing vector according to the existence condition of a necessary-filling field in a high-confidence coefficient field; Step S2, performing similarity matching on the field missing vector and a pre-stored structure type feature vector, calculating a comprehensive score by combining the spatial position of the extracted field and the distribution matching degree of the structure template, and determining a field spatial attention template by the structure type with the highest comprehensive score; Step S3, screening candidate text blocks according to probability distribution thermodynamic diagrams of missing fields in the field space attention templates, calculating weighted completion scores of space response values, format matching degrees and semantic matching degrees for the candidate text blocks, and generating completion fields by the text blocks with the highest completion scores; The method comprises the steps of S4, constructing an initial field set and the complement field into a semantic association graph, embedding constraint violation marks between calculation fields based on a knowledge graph, performing context expansion or keyword correction on field nodes with lowest confidence according to constraint violation types, and outputting a target field set after iterative verification, wherein the method comprises the steps of constructing the semantic association graph by taking the initial field set and the complement field as nodes, establishing a semantic consistency constraint edge between a commodity name and an HS code, a numerical logic constraint edge between a unit amount, a trade agreement constraint edge between an origin and a tax rate and a rule constraint edge between a receiver and a trade mode supervision mode, calculating the relation distance between a commodity name vector and an HS code vector based on the pre-constructed customs commodity knowledge graph, marking constraint edge violation when the relation probability is lower than 0.15, calculating the hidden unit price deviation of the amount based on historical average unit price and standard deviation, selecting the field nodes with the lowest confidence degree as a re-extraction target for the commodity name, performing constraint edge violation by syntax analysis in the context, performing iteration correction on the iteration relation calculation for the HS code vectors when the number of the constraint vectors is lower than 2.5 times, and performing iteration correction on the iteration correction target vectors when the number of iteration correction is completed by means of the iteration correction vectors is calculated for the iteration vectors, and the iteration correction target number is reached after the iteration correction is completed by the iteration relation is calculated for the node.
2. The method for extracting the target feature of the customs bill text based on the NLP according to claim 1, wherein the step S1 comprises: OCR recognition processing is carried out on the customs receipt image, the image is converted into a text sequence, chinese word segmentation is carried out, and the word segmentation sequence is mapped into a text embedding matrix based on the BERT model; Inputting the text embedding matrix into a BiLSTM-CRF sequence labeling model for labeling, wherein the BiLSTM-CRF sequence labeling model comprises a bidirectional LSTM layer and a CRF output layer, and a labeling label set comprises a field starting position label, a field internal position label and a non-target field label, and decoding by a Viterbi algorithm to obtain an optimal label sequence; Extracting an initial field set according to the optimal tag sequence, wherein each field in the initial field set comprises a field type, text content and extraction confidence, and the extraction confidence is obtained through tag probability multiplication calculation of words corresponding to each field; And marking the field with the extracted confidence coefficient larger than 0.7 in the initial field set as a high confidence coefficient field set, counting the existence condition of the field set to be filled in the high confidence coefficient field set, generating a field missing vector, and representing that the corresponding field to be filled in is missing by a vector element value of 0.
3. The method for extracting the target feature of the customs bill text based on the NLP according to claim 1, wherein in the step S2, the matching the field missing vector with the pre-stored feature vector of the structure type includes: Reading 15 typical bill structure types and corresponding feature vectors from a customs bill structure knowledge base, wherein each dimension value of the feature vectors represents the historical deletion probability of a corresponding necessary filling field under each structure type; Calculating the similarity between the field missing vector and each structure type feature vector based on a weighted cosine similarity formula, wherein the weight factor in the weighted cosine similarity formula is obtained through the coincidence degree calculation of an actual missing field and a high missing probability field; And selecting the first 3 structure types with the highest similarity as candidate structures, wherein the corresponding similarity satisfies a decreasing relation and the highest similarity is larger than 0.6.
4. The method for extracting the target feature of the customs documents text based on the NLP according to claim 3, wherein in the step S2, the calculating the composite score by combining the spatial position of the extracted field and the distribution matching degree of the structural template includes: Extracting normalized space coordinates of successfully extracted fields in the initial field set, wherein the normalized space coordinates comprise the abscissa and the ordinate of a field boundary frame center point, and the width and the height of the boundary frame; Reading corresponding field space distribution templates from the customs receipt structure knowledge base aiming at each candidate structure, wherein the field space distribution templates adopt a Gaussian mixture model to represent the expected position distribution of each field type on a page; Calculating probability density values of actual positions of all fields in the high confidence level field set on the corresponding field space distribution templates, and summing the probability density values of all the fields to obtain the matching degree of each candidate structure; and calculating the comprehensive score of each candidate structure based on the 0.6 times similarity and the 0.4 times normalized matching degree, and selecting the candidate structure with the highest comprehensive score as the inferred structure type.
5. The method for extracting the target feature of the customs bill text based on the NLP according to claim 4, wherein in the step S2, the determining the field space attention template by the structure type with the highest comprehensive score includes: and reading a field space attention template corresponding to the inferred structure type from a customs receipt structure knowledge base according to the inferred structure type, wherein the field space attention template comprises a probability density thermodynamic diagram of each necessary-filled field type at each position of the page, and the thermodynamic diagram has a numerical range of 0 to 1 and represents the probability of each necessary-filled field at the corresponding position.
6. The method for extracting the target feature of the customs bill text based on the NLP according to claim 1, wherein the step S3 comprises: Reading a corresponding probability distribution thermodynamic diagram from the field space attention template aiming at the missing necessary filling field, calculating a response value of each text block center point saved in the OCR recognition stage on the thermodynamic diagram, and screening text blocks with the response value larger than 0.5 to form a candidate text block set; Performing field type discrimination on each text block in the candidate text block set, calculating format matching degree based on regular expression matching, and calculating semantic matching degree based on a pre-training entity recognition model or BERT vector cosine similarity; And calculating the completion score of each candidate text block based on the 0.3-time space response value, the 0.4-time format matching degree and the 0.3-time semantic matching degree, selecting the text block with the highest completion score as the completion content of the missing field, and generating the completion field containing the field type, the text content and the completion score.
7. An NLP-based customs document text target feature extraction system for implementing the NLP-based customs document text target feature extraction method of any one of claims 1 to 6, the NLP-based customs document text target feature extraction system comprising: The generation module is used for carrying out sequence labeling processing on the customs receipt text, obtaining an initial field set and extraction confidence coefficient of each field, and generating a field missing vector according to the existence condition of the necessary-filling field in the high-confidence coefficient field; The matching module is used for carrying out similarity matching on the field missing vector and a pre-stored structure type feature vector, calculating a comprehensive score by combining the spatial position of the extracted field and the distribution matching degree of the structure template, and determining a field spatial attention template by the structure type with the highest comprehensive score; the weighting module is used for screening candidate text blocks according to the probability distribution thermodynamic diagram of the missing field in the field space attention template, calculating weighted completion scores of the space response value, the format matching degree and the semantic matching degree for the candidate text blocks, and generating a completion field by the text block with the highest completion score; The correction module is used for constructing an initial field set and the completion field into a semantic association graph, embedding constraint violation marks between calculation fields based on a knowledge graph, performing context expansion or keyword correction on field nodes with lowest confidence according to constraint violation types, and outputting a target field set after iterative verification, wherein the method comprises the steps of constructing the semantic association graph by taking the initial field set and the completion field as nodes, establishing a semantic consistency constraint edge between a commodity name and an HS code, a numerical logic constraint edge between a quantity unit amount, a trade agreement constraint edge between an origin and a tax rate, and a rule constraint edge between a receiver and a trade mode supervision mode, calculating the relation distance between a commodity name vector and an HS code vector based on the pre-constructed customs commodity knowledge graph, marking constraint edge violation when the relation probability is lower than 0.15, calculating the hidden unit price deviation of the quantity amount based on historical average unit price and standard deviation, selecting the field nodes with the lowest confidence degree as a re-extraction target for the commodity name, performing constraint edge violation by syntax analysis in the context, performing iteration correction on the iteration correction vectors when the number of the iteration vectors is lower than 2.5 times, and stopping iteration correction of the iteration vectors is performed for the iteration correction target vectors when the number of the iteration correction vectors is calculated by means of the iteration correction vectors in the iteration correction method, and the iteration correction target number is reached after the iteration correction has reached to the iteration correction vectors.
8. The system of claim 7, wherein the sequence labeling of customs documents text to obtain an initial set of fields and extracted confidence levels for each field, and generating a field missing vector based on the presence of a fill-in field in a high confidence field comprises: OCR recognition processing is carried out on the customs receipt image, the image is converted into a text sequence, chinese word segmentation is carried out, and the word segmentation sequence is mapped into a text embedding matrix based on the BERT model; Inputting the text embedding matrix into a BiLSTM-CRF sequence labeling model for labeling, wherein the BiLSTM-CRF sequence labeling model comprises a bidirectional LSTM layer and a CRF output layer, and a labeling label set comprises a field starting position label, a field internal position label and a non-target field label, and decoding by a Viterbi algorithm to obtain an optimal label sequence; Extracting an initial field set according to the optimal tag sequence, wherein each field in the initial field set comprises a field type, text content and extraction confidence, and the extraction confidence is obtained through tag probability multiplication calculation of words corresponding to each field; And marking the field with the extracted confidence coefficient larger than 0.7 in the initial field set as a high confidence coefficient field set, counting the existence condition of the field set to be filled in the high confidence coefficient field set, generating a field missing vector, and representing that the corresponding field to be filled in is missing by a vector element value of 0.
9. The system of claim 8, wherein screening candidate text blocks based on a probability distribution thermodynamic diagram of missing fields in the field space attention template, calculating weighted completion scores for spatial response values, format matches, and semantic matches for candidate text blocks, generating completion fields from text blocks with highest completion scores, comprising: Reading a corresponding probability distribution thermodynamic diagram from the field space attention template aiming at the missing necessary filling field, calculating a response value of each text block center point saved in the OCR recognition stage on the thermodynamic diagram, and screening text blocks with the response value larger than 0.5 to form a candidate text block set; Performing field type discrimination on each text block in the candidate text block set, calculating format matching degree based on regular expression matching, and calculating semantic matching degree based on a pre-training entity recognition model or BERT vector cosine similarity; And calculating the completion score of each candidate text block based on the 0.3-time space response value, the 0.4-time format matching degree and the 0.3-time semantic matching degree, selecting the text block with the highest completion score as the completion content of the missing field, and generating the completion field containing the field type, the text content and the completion score.

Description

NLP-based customs receipt text target feature extraction method and system Technical Field The application relates to the technical field of data processing, in particular to a method and a system for extracting customs receipt text target characteristics based on NLP. Background The customs receipt text target feature extraction is a key technical link of customs clearance automation, and relates to automatic identification and extraction of key business fields such as commodity names, HS codes, quantity, units, amount, origin and the like from unstructured or semi-structured text documents such as customs notes, business invoices, boxing slips, bill of lading and the like. In the prior art, the customs receipt feature extraction method based on natural language processing mainly adopts a sequence labeling model to label the types of the text, maps the text sequence after word segmentation into a tag sequence through a BiLSTM-CRF and other deep learning models, and extracts various service fields. The method enables the contextual characteristics and the boundary recognition mode of the model learning field to reach higher extraction accuracy on the standard format bill through the large-scale annotation data training sequence annotation model, and provides basic data support for customs automation audit. However, customs documents have extremely high format diversity, and documents of different countries, different enterprises and different business types have great differences in terms of layout, field arrangement sequence, form nesting structure and the like, so that the generalization capability is insufficient when a sequence labeling model trained on a fixed training set faces to documents of a new format, the field extraction accuracy is greatly reduced, and particularly under non-standard formats such as column layout, form nesting and the like, the deletion rate of key fields is as high as 30-50%. Secondly, the existing method extracts each field as an independent target, lacks modeling of business logic relevance among fields, cannot utilize the semantic consistency of commodity names and HS codes, numerical logic relation of quantity unit amounts, trade agreement matching of origins and tax rates and other domain knowledge to carry out cross verification, and therefore even if the extraction confidence of a single field is higher, wrong results of contradiction of field combination on business logic can still occur, and the logic errors cannot be automatically discovered and corrected before manual verification. Disclosure of Invention The application provides a method and a system for extracting customs receipt text target characteristics based on NLP, which are used for solving the technical problems that the existing customs receipt characteristic extraction method cannot be completed effectively when aiming at format diversified receipts, and error results cannot be corrected automatically due to the fact that logic consistency verification is lacked among extraction fields. In a first aspect, the present application provides a method for extracting a customs document text target feature based on NLP, where the method for extracting a customs document text target feature based on NLP includes: step S1, performing sequence labeling processing on customs receipt texts to obtain an initial field set and extraction confidence coefficient of each field, and generating a field missing vector according to the existence condition of a necessary-filling field in a high-confidence coefficient field; Step S2, performing similarity matching on the field missing vector and a pre-stored structure type feature vector, calculating a comprehensive score by combining the spatial position of the extracted field and the distribution matching degree of the structure template, and determining a field spatial attention template by the structure type with the highest comprehensive score; Step S3, screening candidate text blocks according to probability distribution thermodynamic diagrams of missing fields in the field space attention templates, calculating weighted completion scores of space response values, format matching degrees and semantic matching degrees for the candidate text blocks, and generating completion fields by the text blocks with the highest completion scores; and S4, constructing the initial field set and the complement fields into a semantic association graph, embedding constraint violation marks among calculation fields based on the knowledge graph, performing context expansion or keyword correction on field nodes with the lowest confidence degree according to the constraint violation types, and outputting a target field set after iterative verification. In a second aspect, the present application provides an NLP-based customs document text target feature extraction system, the NLP-based customs document text target feature extraction system comprising: The generation module is used for carryin