Search

CN-122021615-A - Method and system for extracting structured information of business trip certificate based on hybrid model

CN122021615ACN 122021615 ACN122021615 ACN 122021615ACN-122021615-A

Abstract

The invention relates to the technical field of data processing, in particular to a method and a system for extracting structured information of an outer business trip certificate based on a mixed model, wherein the method comprises the steps of splitting a text sequence in the outer business trip certificate according to an attribute name and a word segmentation algorithm to obtain a segmented text sequence containing attribute segments; the method comprises the steps of determining a core word of a current attribute section, determining the structure priority of an effective word combination in the current attribute section, taking a word group corresponding to the maximum value of the structure priority as a reference word group, taking the core word in the rest effective word combinations in the reference word group and the corresponding effective word combination as a link word group of the reference word group, and obtaining the structured information of the outer business trip document. The invention improves the accuracy and reliability of the information extraction of the business trip certificates.

Inventors

  • WEI YANLI
  • HAN CHUNTAO

Assignees

  • 德迅科技有限公司

Dates

Publication Date
20260512
Application Date
20251210

Claims (10)

  1. 1. The method for extracting the structured information of the outer business trip certificate based on the mixed model is characterized by comprising the following steps of: Splitting a text sequence in the outer business trip certificate according to the attribute name and the word segmentation algorithm to obtain a segmented text sequence containing attribute segments, wherein one attribute segment contains at least one vocabulary; according to cosine similarity of the current vocabulary and the rest vocabulary in the current attribute section and other attribute sections, determining the core vocabulary of the current attribute section; determining the structural priority of the effective vocabulary combination in the current attribute section according to the distance between the current vocabulary and the core vocabulary in the current attribute section, the first length of the current attribute section, the first number of the core vocabulary in the current attribute section and the second number of all the vocabularies in the current attribute section; And taking the effective vocabulary combination corresponding to the maximum value of the structure priority as a reference phrase, and taking the core vocabulary in the rest effective vocabulary combinations appearing in the reference phrase as a link phrase of the reference phrase to obtain the structural information of the outer business trip document.
  2. 2. The method for extracting structured information from a business trip document based on a mixed model according to claim 1, wherein the determining the core vocabulary of the current attribute segment according to the cosine similarity of the current vocabulary and the rest vocabulary in the current attribute segment and the other attribute segments comprises: determining an effective correlation coefficient of the current vocabulary in the current attribute section according to cosine similarity between vocabulary vectors of the current vocabulary and vocabulary vectors of the rest vocabulary in the current attribute section, maximum value of the cosine similarity of the current vocabulary in the current attribute section and in other attribute sections, first length of the current attribute section of the current vocabulary and second length of other attribute sections where the current vocabulary is located; and taking the vocabulary corresponding to the maximum value of the effective correlation coefficient in the current attribute section as the core vocabulary of the current attribute section.
  3. 3. The method for extracting information from a mixed model-based extrapolant travel document structure according to claim 2, wherein determining the effective correlation coefficient of the current vocabulary in the current attribute segment according to the cosine similarity between the vocabulary vectors of the current vocabulary and the vocabulary vectors of the rest of the current attribute segment, the maximum value of the cosine similarity of the current vocabulary in the current attribute segment and the rest of the attribute segments, the first length of the current attribute segment of the current vocabulary, and the second length of the rest of the attribute segments in which the current vocabulary is located comprises: Calculating an average value of cosine similarity between the vocabulary vectors of the current vocabulary and the vocabulary vectors of the rest vocabulary in the current attribute section, and a first ratio between the average value and the maximum value of the cosine similarity; calculating a second ratio between the first length and the second length; and determining the effective correlation coefficient of the current vocabulary in the current attribute section according to the first ratio and the second ratio.
  4. 4. The method for extracting structural information of a business trip certificate based on a mixed model according to claim 2, wherein the determining the structural priority of the valid vocabulary combination in the current attribute segment according to the distance between the current vocabulary and the core vocabulary in the current attribute segment, the first length of the current attribute segment, the first number of core vocabularies in the current attribute segment, and the second number of total vocabularies in the current attribute segment comprises: determining a structural coefficient of the current vocabulary corresponding to the core vocabulary in the current attribute section according to the distance between the current vocabulary and the core vocabulary in the current attribute section, the first length of the current attribute section, the effective correlation coefficient of the current vocabulary and the effective correlation coefficient of the core vocabulary; The structural coefficients of the vocabularies corresponding to the core vocabularies in the current attribute section are arranged in a descending order to obtain a structural coefficient sequence; performing first-order differential processing on the structure coefficient sequence to obtain a coefficient differential sequence; extracting adjacent structure coefficients in the structure coefficient sequence corresponding to the position of the maximum value in the coefficient differential sequence; Taking the larger one of the adjacent structure coefficients as one interval endpoint, and taking the largest one of the structure coefficient sequences as the other interval endpoint; extracting a section between two section endpoints from the structure coefficient sequence, and taking the vocabulary in the section as an effective vocabulary combination of the core vocabulary of the current attribute section; And determining the structural priority of the effective vocabulary combination in the current attribute section according to the first number of the core vocabularies in the current attribute section, the second number of all vocabularies in the current attribute section and the structural coefficient of each vocabulary corresponding to the core vocabularies in the effective vocabulary combination in the current attribute section.
  5. 5. The method for extracting structural information of a business trip document based on a mixed model according to claim 4, wherein the determining the structural coefficient of the current vocabulary corresponding to the core vocabulary in the current attribute segment according to the distance between the current vocabulary and the core vocabulary in the current attribute segment, the first length of the current attribute segment, the effective correlation coefficient of the current vocabulary and the effective correlation coefficient of the core vocabulary comprises: calculating a third ratio between the first length and the distance and a fourth ratio between the effective correlation coefficient of the current vocabulary and the effective correlation coefficient of the core vocabulary; And determining the structural coefficient of the current vocabulary corresponding to the core vocabulary in the current attribute section according to the third ratio and the fourth ratio.
  6. 6. The method for extracting information from a mixed model-based travel document structure according to claim 4, wherein determining the structural priority of the valid vocabulary combination in the current attribute segment based on the first number of core vocabularies in the current attribute segment, the second number of total vocabularies in the current attribute segment, and the structural coefficients of each vocabulary in the valid vocabulary combination in the current attribute segment corresponding to a core vocabulary comprises: calculating a fifth ratio between the first number and the second number; determining a first mean value of structural coefficients of each word in the effective word combination in the current attribute section, corresponding to a current core word, a second mean value of structural coefficients of each word in the effective word combination in the current attribute section, corresponding to all core words in the current attribute section, and standard deviation of all the structural coefficients in the current attribute section; And determining the structural priority of the effective vocabulary combination in the current attribute section according to the fifth ratio, the first mean value, the second mean value and the standard deviation.
  7. 7. The method for extracting structured information from an outer travel document based on a mixed model according to claim 1, wherein after the core vocabulary in the remaining valid vocabulary combinations to be present in the reference phrase, the corresponding valid vocabulary combinations are used as linked phrases of the reference phrase, the method further comprises: Receiving keywords input by a user; and under the condition that the key words are the core words, displaying all the structured information corresponding to the core words.
  8. 8. The method for extracting structured information from a business trip document based on a mixed model according to claim 1, wherein the splitting the text sequence in the business trip document according to the attribute name and the word segmentation algorithm to obtain the segmented text sequence containing the attribute segment comprises: acquiring a text sequence in an outer business trip certificate; Splitting the text sequence according to the attribute names to obtain a plurality of attribute segments; splitting each attribute segment based on a word segmentation algorithm to obtain a segmented text sequence containing attribute segments, wherein one attribute segment contains at least one vocabulary.
  9. 9. A system for extracting structured information of a business trip document based on a hybrid model, comprising: The splitting module is used for splitting the text sequence in the external business travel certificate according to the attribute name and the word segmentation algorithm to obtain a segmented text sequence containing attribute segments, wherein one attribute segment contains at least one vocabulary; The determining module is used for determining the core vocabulary of the current attribute section according to the cosine similarity of the current vocabulary and the rest vocabulary in the current attribute section and other attribute sections; the determining module is further configured to determine a structural priority of an effective vocabulary combination in the current attribute segment according to a distance between the current vocabulary and the core vocabulary in the current attribute segment, a first length of the current attribute segment, a first number of core vocabularies in the current attribute segment, and a second number of all vocabularies in the current attribute segment; And the determining module is further used for taking the effective vocabulary combination corresponding to the maximum value of the structure priority as a reference phrase, and taking the core vocabulary in the rest effective vocabulary combinations in the reference phrase as a link phrase of the reference phrase to obtain the structural information of the business trip document.
  10. 10. A mixed model-based extravehicular travel document structured information extraction system comprising a processor and a memory, wherein the memory is configured to store a computer program executable on the processor, and the processor is configured to execute the program stored on the memory to implement the steps of the mixed model-based extravehicular travel document structured information extraction method as claimed in any one of claims 1 to 8.

Description

Method and system for extracting structured information of business trip certificate based on hybrid model Technical Field The invention relates to the technical field of data processing, in particular to a method and a system for extracting structural information of an outer business trip certificate based on a hybrid model. Background Under the scene that globalization communication and business travel activities are increasingly frequent, the business travel certificates such as passports, visas, identity cards, travel itineraries and the like are used as key certificates for identity verification and travel confirmation, and the efficient extraction and management of information are very important. The structural information of the certificates, namely, the key data fields are extracted from various certificates, and the managed information is stored in a standard, organized and machine-readable format, so that the problem of registration delay caused by slow circulation of paper information can be effectively avoided. However, there are significant layout differences among the business trip certificates issued by different kinds, countries or institutions, and the layout differences are specifically represented by the situations of unfixed field positions, changeable word arrangement modes (such as horizontal and vertical mixed arrangement), mixed form and free text, and the like. The diversity breaks through the dependence of the traditional information extraction method on fixed templates or rules, so that the traditional technology based on preset template matching and fixed field positioning is difficult to generalize and apply, the different formats of various certificates cannot be adapted, and the preliminary information identification can be realized by relying on the self-adaptive layout analysis technology. Because of the differences of information registration formats and field association logics of different business trip certificates, the extracted information is easy to cross and mix, and especially the identification accuracy of the same attribute such as date, nationality code and the like is greatly reduced. When the independent structure of the extracted information is split, the information of the structural difference is often overlapped with other fields, so that the split result is wrong, the accuracy and the reliability of the extraction result of the information of the business trip certificate are lower, and the requirements of the business trip business on the accuracy and the reliability of the information of the certificate are difficult to meet. Disclosure of Invention In order to solve the technical problems of low accuracy and low reliability of extrapolant travel certificate information extraction, the invention aims to provide an extrapolant travel certificate structured information extraction method and system based on a hybrid model. In order to solve the technical problems, the adopted technical scheme is as follows: According to the first aspect, the embodiment of the invention provides an external business trip document structured information extraction method based on a mixed model, which comprises the steps of splitting a text sequence in the external business trip document according to an attribute name and a word segmentation algorithm to obtain a segmented text sequence containing an attribute segment, determining a core word of the current attribute segment according to cosine similarity of the current word and rest words in the current attribute segment and other attribute segments, determining the structure priority of an effective word combination in the current attribute segment according to the distance between the current word and the core word in the current attribute segment, the first length of the current attribute segment, the first number of the core words in the current attribute segment and the second number of all words in the current attribute segment, taking the effective word combination corresponding to the maximum value of the structure priority as a reference word group, and taking the core word in the rest effective word combinations appearing in the reference word group as a link word group of the reference word group to obtain structured information of the external business trip document. Preferably, determining the core vocabulary of the current attribute section according to the cosine similarity of the current vocabulary and the rest vocabulary in the current attribute section comprises determining the effective correlation coefficient of the current vocabulary in the current attribute section according to the cosine similarity between the vocabulary vector of the current vocabulary and the vocabulary vector of the rest vocabulary in the current attribute section, the maximum value of the cosine similarity of the current vocabulary in the current attribute section and the rest vocabulary in the other attribute section, the first len