CN-121982738-A - Mixed expert multi-mode form identification method

CN121982738ACN 121982738 ACN121982738 ACN 121982738ACN-121982738-A

Abstract

The invention discloses a method for identifying a multi-mode form of a hybrid expert, and belongs to the field of computer vision and data processing. The method comprises an expert model establishing stage, a training stage and an application stage, wherein the expert model comprises an OCR character recognition model, a form structure segmentation model and a picture target detection model, the training stage sequentially executes data preparation, data preprocessing and expert model parallel training, and the application stage sequentially executes the expert model after the picture input training of the form to be processed, the mixed expert model parallel reasoning, a post-processing and fusing engine and the digital form output. The method does not need to rely on multi-tool step processing or manual supplementary recording correction, can accurately restore the multi-mode content of the semiconductor quality report form, has high processing efficiency and identification precision, and effectively supports the quality report analysis requirements in the digital transformation of semiconductor enterprises.

Inventors

LIN SHANGJUN
PAN BINGZHEN
BAI XIAOYAN
ZHAO YONGKAI
GUAN JIAN

Assignees

无锡智现未来科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251217

Claims (10)

1. The method is characterized by comprising an expert model establishment stage, a training stage and an application stage; The expert model comprises three independent models, namely an OCR character recognition model, a table structure segmentation model and a picture target detection model; The training stage sequentially executes data preparation, data preprocessing and expert model parallel training, and the application stage sequentially executes expert model after the to-be-processed form picture is input and trained, mixed expert model parallel reasoning, post-processing and fusion engine and digital form output.
2. The method for identifying the mixed expert multi-modal form according to claim 1, wherein the establishment of the OCR character recognition model adopts an encoder CNN convolution network to extract image characteristics and adopts a sequence decoder based on an attention mechanism to generate a text sequence; The table structure segmentation model is established by adopting a Unet ++ based multi-task segmentation network, a coder is adopted to output multi-scale characteristics, and a decoder is connected with fusion characteristics through jumping; the picture target detection model is established by adopting YOLOv-based infrastructure and is composed of a backbone network, a feature fusion network and a detection head network.
3. The method for identifying the mixed expert multi-mode table according to claim 2, wherein the data preparation method in the training stage is characterized in that a table is extracted from a semiconductor history report in a manual and tool combination mode, the original image data are obtained, a corresponding labeling data set is constructed, the labeling data comprise table structure information, text content information, text position information and cell picture position information, the table structure information comprises the number of table rows and columns, table line frames, row and column separators and a merging cell structure, the text position information comprises coordinates of text blocks and cells of the text blocks, and the cell picture position information comprises a boundary frame range of pictures in the cells.
4. A hybrid expert multi-modal form identification method as in claim 3 wherein the training phase data preprocessing step comprises cleaning, enhancing, formatting raw data, the cleaning process being to eliminate blurred, incomplete invalid form data, the enhancing process being to expand data diversity by image rotation, scaling, noise reduction, the formatting process being to convert data into a unified format for adaptive expert model training.
5. The method for identifying the mixed expert multi-modal form according to claim 4, wherein the training phase expert model parallel training method is to use a distributed training framework, use a training data set after data preprocessing, and independently train three expert models at the same time, and specifically comprises the following steps: Learning character content and corresponding coordinate information in a recognition form by an OCR character recognition model; The table structure segmentation model learns and identifies table grid frames, row-column separators and merging cell structure characteristics; The picture target detection model learns the position coordinates of the pictures in the positioning table cells.
6. The method for recognizing the mixed expert multi-modal form according to claim 5, wherein the learning of the OCR character recognition model adopts a three-stage learning strategy, the first stage uses a larger learning rate to quickly converge, the second stage uses a reduced learning rate to carry out fine adjustment, the third stage uses a small learning rate to stabilize model parameters, and the third stage uses cosine annealing learning rate scheduling.
7. The method for identifying the mixed expert multi-modal form according to claim 6, wherein the learning of the form structure segmentation model adopts a weighted combination loss function and a four-stage training strategy; The weighted combination loss function specifically comprises the steps of guaranteeing the integrity of a segmentation area by using a Dice loss, solving the problem of class unbalance by using a Focal loss, providing a stable gradient by using a binary cross entropy loss, and optimizing the quality of a segmentation boundary by using a Lovasz loss; The four-stage training strategy comprises the steps of training a basic feature extraction network in a first stage to focus on table area detection, thawing part of a network layer in a second stage, adding table lines and cell detection tasks, adding row-column structure identification tasks in a third stage, performing joint fine tuning of all tasks in a fourth stage, and setting different learning rates and data enhancement strengths in each stage.
8. The hybrid expert multimodal table identification method of claim 7, wherein the learning of the picture object detection model uses a classification penalty that uses Varifocal Loss functions instead of conventional cross entropy penalty functions.
9. The method for identifying the mixed expert multi-mode form according to claim 8, wherein the expert model parallel reasoning method in the application stage is characterized in that a to-be-processed picture format form is input into three trained expert models simultaneously, the OCR character recognition model outputs the recognized character content and corresponding coordinates, the form structure segmentation model outputs the row-column distribution and merging cell structure information of the form, and the picture target detection model outputs the position coordinate information of pictures in cells.
10. The method for identifying the mixed expert multi-modal form according to claim 9, wherein the processing steps of the post-processing and fusion engine of the application stage include alignment and matching, structural reconstruction, picture embedding; The alignment and matching are based on the coordinate information output by each model, and the text content is accurately filled into the corresponding table cells by calculating the coordinate overlapping degree of the text blocks and the cells; the structure is rebuilt to combine the row, column and merging cell information, the grid structure of the table is adjusted, and the complete table structure is restored; the picture is embedded to establish association between the detected picture and the corresponding position of the table.

Description

Mixed expert multi-mode form identification method Technical Field The invention relates to the field of computer vision and data processing, in particular to a method for identifying a multi-mode table of a hybrid expert. Background In recent years, "AI+semiconductor" wave tide promotes a semiconductor enterprise to accelerate full-link digital transformation, quality control is used as a core link for guaranteeing production precision, a produced semiconductor quality report becomes a key data carrier for establishing an internal knowledge base and realizing defect tracing and production flow optimization of the enterprise, more than 80% of key information (such as wafer numbers, test parameters, defect types and corresponding micrographs) in the report is presented in a multi-mode table in a picture format, and the table can completely support quality analysis only by bearing text data and defect pictures at the same time, so that the table is converted into an editable and retrievable digital table, and is a core application requirement for the semiconductor enterprise to realize intelligent quality control. However, the prior art scheme for processing the table has the obvious defects that the traditional OCR technology can only identify the text content in the table, often misjudges the defective picture in the cell as a fuzzy messy code or directly misses, so that the visual basis of quality tracing is lost, the main stream open source table identification tool can divide the basic table structure, but does not design a cell picture detection module, and because the training data is mainly a universal document table, the suitability of merging cells and industry technical terms in a semiconductor scene, such as a Bonding Defect and a Wafer ID is extremely poor, the merging cell identification accuracy is less than 60 percent, the technical term identification error rate is more than 15 percent, the manual input and manual mapping mode adopted by partial enterprises is extremely low in processing efficiency, and the problems of easily generating data input errors due to manual operation, such as a Wafer ID is recorded as 517, picture pasting error cells and the like, so that the historical data is distorted, the serious enterprise digital conversion is unable to meet the severe requirements of the precision and processing efficiency of the table in the semiconductor industry, and the prior art scheme for accurately identifying the table is broken through. Disclosure of Invention The invention aims to overcome the defects in the background technology, and provides a mixed expert multi-mode table identification method, which has the advantages that the limitation that only table characters can be identified without relying on traditional OCR (optical character recognition), the additional manual correction of a table structure is not needed, an identification model is focused and has clear logic, the processing speed is high, the multi-mode identification precision is high, the structure of a table in a semiconductor quality report, the text content in the table and the picture information in a cell can be accurately identified and restored, the generalization capability of the model under the scene after the training of a special data set in the semiconductor field is strong, and the problems of incomplete table information identification, structural restoration errors and the like are effectively avoided. The invention adopts the following technical scheme for solving the technical problems: A mixed expert multi-mode form identification method comprises an expert model establishment stage, a training stage and an application stage; The expert model comprises three independent models, namely an OCR character recognition model, a table structure segmentation model and a picture target detection model; The training stage sequentially executes data preparation, data preprocessing and expert model parallel training, and the application stage sequentially executes expert model after the to-be-processed form picture is input and trained, mixed expert model parallel reasoning, post-processing and fusion engine and digital form output. Further, the OCR character recognition model is built, an encoder CNN convolution network is adopted to extract image features, and a sequence decoder based on an attention mechanism is adopted to generate a text sequence; the table structure segmentation model is established by adopting a Unet ++ based multi-task segmentation network, a coder is adopted to output multi-scale characteristics, and a decoder is connected with fusion characteristics through jumping; the picture target detection model is established by adopting YOLOv-based infrastructure and is composed of a backbone network, a feature fusion network and a detection head network. Further, the data preparation method in the training stage is to extract a table from a semiconductor historical report in a mode of combi