CN-121982732-A - Image recognition-based scan file digital fine classification method
Abstract
The invention discloses a scanning file digital fine classification method based on image recognition, which comprises the following steps of obtaining scanning file images, carrying out layout correction and image enhancement to obtain normalized images, carrying out text recognition and layout structure understanding on the normalized images to form text results and structure results, combining a class hierarchy system and a business attribute system to construct corresponding generation prototype spaces, carrying out class constraint consistency scoring in the prototype spaces and generating evidence packages, determining class hierarchy and business attributes according to the scoring and evidence, and outputting fine classification results. The invention adopts the method for reversibly generating consistency assessment based on image recognition and hierarchical constraint to realize digital fine classification of the scanning file, and has the advantages of high classification stability, strong traceability of results and strong service adaptation capability.
Inventors
- WAN FENG
Assignees
- 江西优樟生物科技股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260224
Claims (6)
- 1. The image recognition-based scanning archive digitized fine classification method is characterized by comprising the following steps of: Acquiring a scanned archival image, performing layout correction and image enhancement, and outputting a normalized scanned archival image; performing word recognition and structure understanding on the normalized scanned archival image to generate a word recognition result set and a structure understanding result set; Based on a text recognition result set, a structure understanding result set, a preset category hierarchy and a business attribute system, constructing a category hierarchy generation prototype space set and a business attribute generation prototype space set; Inputting the text recognition result set and the structure understanding result set into a category hierarchy to generate a prototype space set and a business attribute to generate a prototype space set, executing a hierarchy constraint to reversibly generate a consistency score, and outputting to generate a consistency score set and an evidence package set; And determining class level results based on the generated consistency score set, determining business attribute results based on the evidence package set, and generating a digital fine classification result.
- 2. The method for digitally fine classification of a scan file based on image recognition of claim 1, wherein the generation of the normalized scan file image specifically comprises: And carrying out combined processing of page boundary determination, layout direction correction, geometric distortion and deformation correction and image enhancement on the acquired scanned archival image, eliminating imaging deviation of scanning background interference, layout inclination, perspective distortion and uneven brightness, and solidifying the effective area of the page under the constraint of uniform output scale and uniform color space to form a normalized scanned archival image.
- 3. The method for digitally classifying a scan file according to claim 1, wherein the generating of the text recognition result set and the structure understanding result set specifically comprises: Carrying out text region positioning processing on the normalized scanned archival image to obtain a text region set; Performing word recognition processing on the text region set, decoding a character sequence in the text region based on a character level and line level combination modeling mode, and outputting a word recognition result set corresponding to the page space position; executing layout structure understanding processing on the normalized scanned archival image to form a structure understanding result set; Performing structural relation modeling on the structural understanding result set, constructing a structural relation set according to the spatial relative positions, the inclusion relations and the context continuous relations among the page structural units, and aligning the structural relation set with text line sequence relation items in the text recognition result set to form a text-structure alignment result set; Based on the character-structure alignment result set, carrying out consistency check on the character recognition result set and the structure understanding result set, and merging the character recognition result and the structure understanding result which pass the check to form a fusion feature set; and calculating a structural integrity score and a text coverage score for each page structural unit of the text-structure alignment result set to form a page-level structure abstract result set.
- 4. The method for digitally fine classification of scan files based on image recognition according to claim 1, wherein the generating of the class hierarchy generation prototype space set and the business attribute generation prototype space set specifically comprises: Based on the character recognition result set, the structure understanding result set and the character-structure alignment result set, extracting semantic elements and structural elements corresponding to the category hierarchy to form a writing element set of the category hierarchy generation prototype space set; Generating a credibility constraint label for the structural elements in the prototype space set generated by the category hierarchy based on the page-level structure abstract result set; Generating a writing element set and a credibility constraint label of a prototype space set based on the category hierarchy, and constructing the category hierarchy to generate the prototype space set; based on the text recognition result set, the structure understanding result set and the page level structure abstract result set, extracting business attribute evidence elements corresponding to a business attribute system to form a writing element set of a business attribute generation prototype space set; Generating a writing element set of a prototype space set based on the service attribute, and constructing the service attribute to generate the prototype space set; And carrying out index binding on the class hierarchy generation prototype space set and the business attribute generation prototype space set, and establishing a prototype space index mapping set.
- 5. The method for digitally fine classification of scan files based on image recognition according to claim 1, wherein the generating of the consistency score set and the evidence package set specifically comprises: Based on the text recognition result set and the structure understanding result set, combining the prototype space index mapping set to construct a consistency grading input unit set; generating a prototype space set based on the category hierarchy and a business attribute, and executing forward reversible generation calculation of the prototype space on the consistency score input unit set to obtain a forward generation consistency intermediate result set; generating a prototype space set based on the category hierarchy and a business attribute, and performing prototype space inverse reversible generation calculation on the consistency score input unit set to obtain an inverse consistency generation intermediate result set; Generating a consistency score set based on the forward generated consistency intermediate result set and the backward generated consistency intermediate result set; Applying a level constraint to the generated consistency score set and executing path rollback or score inhibition to obtain the generated consistency score set after the level constraint; and generating an evidence packet set based on the generated consistency score set after the level constraint.
- 6. The method for digitized fine classification of scan files based on image recognition of claim 1 wherein said generating of digitized fine classification results comprises: Constructing a candidate judging unit set based on the consistency score set and the evidence packet set; Performing category-level candidate path screening on the candidate judging unit set to obtain a category-level candidate path result set; performing level closure check on the category level candidate path result set to determine a category level result; performing business attribute evidence aggregation on the candidate judging unit set to obtain a business attribute candidate result set; performing evidence sufficiency judgment on the service attribute candidate result set to determine a service attribute result; And generating a digital fine classification result based on the category hierarchy result and the service attribute result.
Description
Image recognition-based scan file digital fine classification method Technical Field The invention relates to the technical field of file digitization, in particular to a scanning file digitization fine classification method based on image recognition. Background The digital classification of the existing scanning files mostly adopts a process combining image preprocessing and character recognition, obtains text contents through layout detection, text region positioning and optical character recognition, extracts structural elements such as titles, tables, signatures and the like by assisting layout structure analysis or template rules, and judges and files types according to the structural elements. The confidence threshold, region type constraint or field rule check is introduced in part of the scheme to reduce the influence of noise and layout distortion on the recognition result, so that the automatic collection, retrieval and management of the paper files are realized. However, the above technology generally takes word recognition and structure understanding as parallel output, lacks a unified organization and joint judgment mechanism in category hierarchy dimension and business attribute dimension, and is easy to generate category hierarchy parent-child relationship conflict and error and instability caused by mutual contradiction of same-page multi-region evidence. Especially under low-quality scanning, complex layout or page-crossing continuous structure, the prior art often lacks an executable hierarchical constraint triggering rollback and score suppression mechanism, and also lacks an evidence packet structure which has definite field closure and can be used for rechecking and tracing, so that the evidence sufficiency of business attribute judgment is difficult to quantify and audit. Therefore, how to provide a scan file digital fine classification method based on image recognition is a problem that needs to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide a method for digitally and finely classifying a scanning file based on image recognition, which adopts a method for reversibly generating consistency assessment based on image recognition and hierarchical constraint to realize the digitally and finely classifying the scanning file and has the advantages of high classification stability, strong result traceability and strong service adaptation capability. According to the embodiment of the invention, the method for digitally and finely classifying the scanning archives based on image recognition comprises the following steps: Acquiring a scanned archival image, performing layout correction and image enhancement, and outputting a normalized scanned archival image; performing word recognition and structure understanding on the normalized scanned archival image to generate a word recognition result set and a structure understanding result set; Based on a text recognition result set, a structure understanding result set, a preset category hierarchy and a business attribute system, constructing a category hierarchy generation prototype space set and a business attribute generation prototype space set; Inputting the text recognition result set and the structure understanding result set into a category hierarchy to generate a prototype space set and a business attribute to generate a prototype space set, executing a hierarchy constraint to reversibly generate a consistency score, and outputting to generate a consistency score set and an evidence package set; And determining class level results based on the generated consistency score set, determining business attribute results based on the evidence package set, and generating a digital fine classification result. Optionally, the generating of the normalized scan archival image specifically includes: And carrying out combined processing of page boundary determination, layout direction correction, geometric distortion and deformation correction and image enhancement on the acquired scanned archival image, eliminating imaging deviation of scanning background interference, layout inclination, perspective distortion and uneven brightness, and solidifying the effective area of the page under the constraint of uniform output scale and uniform color space to form a normalized scanned archival image. Optionally, the generating of the text recognition result set and the structure understanding result set specifically includes: Carrying out text region positioning processing on the normalized scanned archival image to obtain a text region set; Performing word recognition processing on the text region set, decoding a character sequence in the text region based on a character level and line level combination modeling mode, and outputting a word recognition result set corresponding to the page space position; executing layout structure understanding processing on the normalized scanned archival image to form a structure un