CN-121615613-B - Medicine production data form processing method based on document analysis and HTML rendering
Abstract
The invention discloses a drug production data table processing method based on document analysis and HTML rendering, and relates to the technical field of document data processing. The method for processing the pharmaceutical production data table based on document analysis and HTML rendering comprises the steps of S1, collecting and preprocessing table identification data, position mapping data and performance data to construct a standardized table state data set, S2, analyzing semantic association compactness between nested tables and paragraphs to which the nested tables belong, adjusting a table attribution marking strategy, S3, evaluating the structuring degree of cells, reconstructing a logical hierarchical relationship of a table structure, S4, evaluating the information importance of the cells, adjusting the layout priority of the cells, and S5, generating a business process of butt joint of structured records to a manufacturing execution platform. The method solves the problem that the document structure and the business field are lack of binding, and the construction of the unified data main line and the business driving flow is seriously hindered.
Inventors
- TANG HONG
- LI JUN
- LI HAO
- ZHONG QIN
- JIANG SHUAI
- LUO WEI
Assignees
- 成都泓睿科技有限责任公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260130
Claims (7)
- 1. The medicine production data table processing method based on document analysis and HTML rendering is characterized by comprising the following steps: s1, collecting form identification data, position mapping data and expressive data in a form processing process, preprocessing the collected form identification data, position mapping data and expressive data, and constructing a standardized form state data set; the specific steps of table identification data, position mapping data and performance data in the process of collecting the table are as follows: acquiring table identification data in the document structure hierarchical dividing process, wherein the table identification data comprises row numbers, column numbers, transverse cross columns, longitudinal cross columns, nested layer numbers, nested sub-table row numbers in the cells and nested sub-table heights; collecting position mapping data in the construction process of a table structure, wherein the position mapping data comprises a table sequence number of a table to which a cell belongs, a paragraph number of a paragraph to which the table belongs, a section number of a section to which the paragraph belongs, a starting line number and an ending line number of a merging cell in the longitudinal direction, and a starting column number and an ending column number in the transverse direction; collecting expressive data in the style feature extraction process, wherein the expressive data comprises font thickness, font size level, italic condition, underline condition, text font foreground color and cell background color in each cell; S2, carrying out association analysis on structural attribution consistency and semantic association closeness between the nested tables and the sections based on a standardized table state data set, and dynamically adjusting attribution marking strategies of the tables in the document sections based on association analysis results; Based on the standardized form state data set, the specific steps of carrying out association analysis on the structural attribution consistency and semantic association closeness between the nested form and the section to which the nested form belongs are as follows: adding the number of the nested sub-table rows in the current cell and the height of the nested sub-table, and dividing the sum of the transverse cross-column number and one to obtain the complexity of the nested structure; multiplying the nesting layer number of the cell at the left side of the current cell by one and the nesting structure complexity to obtain a hierarchical structure correction value; Adding one to the paragraph number of the paragraph to which the current table belongs and dividing the added paragraph number by the sum of the section number of the section where the current paragraph belongs and the one to obtain the section attribution reliability ratio; adding the hierarchical structure correction value and the segment attribution reliability ratio to obtain a nested paragraph association strength value of the current cell; S3, performing intensity evaluation on the structuring degree of the cells from the row-column positioning of the cells in the main table structure based on the standardized table state data set, and reconstructing the logic hierarchical relationship of the table structure based on the intensity evaluation result; Based on the standardized table state data set, the specific steps of carrying out intensity evaluation on the structuring degree of the cells by the row-column positioning of the slave cells in the main table structure are as follows: dividing the sum of the cell line number and the column number by the sum of the cell longitudinal cross line number, the cell transverse cross column number and one to obtain a normalized position expression value; adding one to the table sequence number of the table to which the cell belongs, and then taking the natural logarithm to obtain the complexity of the text structure; Subtracting the initial line number from the end line number of the merging cells in the longitudinal direction and adding one to obtain a line-direction merging span, subtracting the initial line number from the end line number of the merging cells in the transverse direction and adding one to obtain a line-direction merging span, adding the line-direction merging span and dividing the added line-direction merging span by two to obtain merging structure complexity; Multiplying the complexity of the text structure with the complexity of the merging structure, and then adding the multiplied complexity of the text structure with the normalized position expression value to obtain a structural attribute intensity value; S4, taking the association analysis result and the strength evaluation result as input, carrying out priority evaluation on the information importance of the cells from the structure cascade relation and the visual salient features, and adjusting the layout priority of the cells based on the priority evaluation result; s5, generating a structured record based on the rendered table, and docking to the business process of the manufacturing execution platform.
- 2. The method for processing the pharmaceutical production data table based on document parsing and HTML rendering of claim 1, wherein the preprocessing of the collected table identification data, the position mapping data and the expressive data to construct the standardized table state data set comprises the following specific steps: The method comprises the steps of uniformly standardizing the line number and column number formats of each cell into integer type and sequencing according to sequence, simultaneously normalizing transverse cross-column number and longitudinal cross-column number into positive integers, representing nesting relation by using uniform level identification in combination with nesting layer number, and compressing the nesting sub-cell line number and the nesting sub-cell height into a fixed scale range in a linear stretching mode; Coding and checking the position mapping data, namely, the sequence number of the table, the segment number and the section number, excluding integer values which are repeated and vacant and are in a unified format and can be compared, respectively checking the validity of the section of the starting line number and the ending line number of the combined cell and the starting line number and the ending line number in the transverse direction, and reserving the combined cell as a cell body if the starting line number and the ending line number are equal; For the expressive data, standardizing font thickness information into numerical expression through a mapping function to obtain the visual line thickness degree of the text, and marking the line thickness degree as a font thickness intensity value; inputting font size, font size grade, italic condition and underline condition into a weight superposition model, dynamically distributing contribution weights by adopting a multidimensional feature attention fusion mechanism, extracting most discriminative visual combination expression by utilizing a significance projection algorithm based on sparse constraint, carrying out linear embedding and output reconstruction on fusion visual combination expression results by generalized weighted transformation, and finally outputting a text strong scheduling value; And carrying out normalization processing on the standardized form identification data, the position mapping data and the expressive data to construct a standardized form state data set.
- 3. The method for processing the pharmaceutical production data table based on document parsing and HTML rendering of claim 1, wherein the step of dynamically adjusting the home marking strategy of the table in the document paragraph based on the association analysis result comprises the following specific steps: comparing the nested paragraph association strength value of the current cell with an association strength threshold in real time, wherein the association strength threshold comprises a first strength threshold and a second strength threshold, and the first strength threshold is higher than the second strength threshold: when the association strength value of the nested paragraphs is larger than a first strength threshold value, marking the current cell as a structure anchor point, and injecting the structural meta-information of the current cell in the HTML rendering; when the nested paragraph association strength value is larger than the second strength threshold and smaller than or equal to the first strength threshold, recording a paragraph number and a structure path corresponding to the current cell, generating a unique anchor point field marking cell position, and providing a field mapping reference when the unique anchor point field marking cell position is in butt joint with an LIMS and an MES system; When the association strength value of the nested paragraphs is smaller than a second strength threshold value, generating position structure uncertain warning information, adding structure prompt information to be confirmed in rendering, marking the sequence number, the paragraph number and the nesting layer number of the tables where the cells are positioned, and simultaneously backtracking the table identification data of adjacent cells of the current cell to acquire paragraph attribution of the current cell based on a space adjacent relation and a header context; carrying out mean statistics on nested paragraph association intensity values of all cells in the whole table range to obtain a nested paragraph association intensity value mean value, if the nested paragraph association intensity value mean value is higher than a first intensity threshold value, adopting a nested priority rendering template to strengthen independent presentation of the sub-tables, if the nested paragraph association intensity value mean value is higher than a second intensity threshold value and is lower than or equal to the first intensity threshold value, adopting a mixed structure rendering template to balance stability between a master-slave structure and field layout, and if the nested paragraph association intensity value mean value is lower than the second intensity threshold value, reserving the table as original picture rendering.
- 4. The method for processing the pharmaceutical production data table based on document parsing and HTML rendering of claim 1, wherein the reconstructing the logical hierarchical relationship of the table structure based on the strength evaluation result comprises the following specific steps: Obtaining the structured attribute intensity values of all the cells, arranging all the cells in the sequence from big to small of the structured attribute intensity values, and constructing a structured attribute intensity value sequence; In the structure reduction stage, firstly processing the cells with the maximum structured attribute intensity values, and then sequentially processing the subsequent cells according to the descending order of the structured attribute intensity values; In the compression and attribute injection stage, the cells with the structural attribute intensity value larger than or equal to the attribute intensity threshold value are kept uncompressed, the complete merging boundary attribute is injected, the cells with the structural attribute intensity value smaller than the attribute intensity threshold value are scaled equally, and only the basic coordinate attribute is injected; in the map rendering stage, the structured attribute intensity values are mapped into gray pixel brightness values through a normalization linear algorithm, and visual image enhancement is performed based on the gray pixel brightness values.
- 5. The method for processing the pharmaceutical production data table based on document parsing and HTML rendering of claim 1, wherein the specific steps of performing priority evaluation on the information importance of the cells from the structural cascade relationship and the visual salient features by taking the association analysis result and the strength evaluation result as inputs are as follows: Adding the nested paragraph association intensity value and the structured attribute intensity value, and dividing the added value by two to obtain a structure level basic intensity value; Dividing the sum of the font thickness intensity value, the text strength scheduling value and the color significance intensity value by three and then adding with one to obtain a visual expression weight value; multiplying the structural level basic intensity value by the visual expression weight value to obtain a rearrangement priority value.
- 6. The method for processing the pharmaceutical production data table based on document parsing and HTML rendering of claim 1, wherein the step of adjusting the layout priority of the cells based on the priority evaluation result is as follows: Based on the rearrangement priority value calculation result, sequentially performing rendering arrangement according to the sequence from high to low of the rearrangement priority value corresponding to each cell; In the actual arrangement process, if the target position of the current cell is occupied by the cell with the lower rearrangement priority value, the cell with the lower rearrangement priority value is withdrawn from the rearrangement sequence, and the arrangement flow is added again after the current cell is inserted; For the cells meeting the conditions that the font thickness value is larger than the standard thickness threshold value, the font italic condition is marked as italic, and the RGB color value difference between the text font foreground color and the cell background color exceeds any one of the color value threshold values, reserving cell expressive data; If the rearrangement priority value of the unit cell is lower than the rearrangement threshold value, the unit cell is classified into the structure buffer area; After the rendering arrangement is completed, judging whether position deviation exists or not by comparing the position of each cell in an actual rendering result with the row number and the column number of the original cell, if the rearrangement priority value of the cell with the position deviation is larger than the average value of the rearrangement priority values of all cells in the same row, preferentially executing position correction operation, and rendering the cell again in a front vacant corresponding area of the target coordinate, and if the rearrangement priority value of the cell with the position deviation is smaller than or equal to the average value of the rearrangement priority values of all cells in the same row, sequentially filling according to the original sequence, and finally generating the HTML form with a complete two-dimensional structure.
- 7. The method for processing the pharmaceutical production data table based on document parsing and HTML rendering of claim 1, wherein the business process of generating a structured record based on the rendered table and interfacing to the manufacturing execution platform comprises the following specific steps: the HTML table with a complete two-dimensional structure is used as a structured data source, the information of the row number and the column number attached to each cell is identified, the mapping relation between the field coordinates and the cell content is established by depending on the row number and the column number, and an original value extraction table with the field position as an index is formed; extracting a logic line structure in a table according to an original value, combining fields in the same line into a structured record, and generating a data format facing a service system according to a field meaning conversion rule; After the structured records are generated, the field mapping templates are further invoked, and finally the structured records are docked to relevant business process nodes in the enterprise-level manufacturing execution platform.
Description
Medicine production data form processing method based on document analysis and HTML rendering Technical Field The invention relates to the technical field of document data processing, in particular to a method for processing a medicine production data table based on document analysis and HTML rendering. Background With the increasing popularity of digital office scenes, the roles of structured documents in drug production management, quality control and batch traceability are becoming more and more remarkable. Especially, in the process of generating, editing and transmitting form data, how to accurately identify key fields, keep the structural attributes of the key fields and realize high-quality rendering becomes a key link for supporting the construction of an electronic medicine file system. The currently mainstream form processing mode generally relies on a general text analysis engine or recognition logic based on template matching, utilizes an OCR technology or a rule extraction algorithm to complete cell content recognition and form contour positioning, and reconstructs and renders the form through information based on row-column indexes. The part of advanced schemes are also fused with a machine learning method to enhance the adaptability of the model to different form styles, so that the accuracy and generalization capability of structure identification are improved. For example, the invention patent with publication number CN105630916B relates to a method for extracting and organizing unstructured form document data in big data environment. Firstly, analyzing the structural characteristics and the data flow characteristics of an unstructured form document, defining a data extraction rule, secondly, providing an unstructured form document data extraction flow and an extraction algorithm, thirdly, providing an organization method for converting an extraction result into structured data, and finally, providing a method for analyzing the obtained structured data set based on a MapReduce parallel programming model. The method can provide technical support for mining knowledge hidden in unstructured table documents in a big data environment. The invention patent with the bulletin number of CN103955497B relates to a data collection and summarization method of a custom form, which comprises the following steps of obtaining attribute data of the custom form by a database, storing the attribute data into a first data table of the database, receiving data filled in a data filling unit corresponding to the custom form by a filling end, storing a filling record of the filling end into a second data table of the database, storing the filling data into a server in a text file form, extracting relevant filling records in the second data table of the database, extracting the contents of the filling data according to the file names of all the filling data, and processing the contents into a form. The invention can provide the user with the data collection and summarization service of the custom form, only the initiator is required to set the form according to the own requirement, the filler is used for filling according to the form set by the initiator, the final system can automatically generate a summary list formed by all the data filled by the fillers according to the request, the format of the data filled by the fillers is not restricted, and the flexibility is high. However, when the existing method processes the drug production table with highly nested, cross-unit merging or diversified style attributes, the problems of insufficient resolution granularity, structural semantic deficiency, style reduction distortion and the like still exist. On one hand, most parsing logics fail to fully consider the structure level of the table in the document, cross-paragraph attribution, logic binding and other implicit attributes, so that the original expression intention of the table is difficult to recover in the HTML rendering process, and on the other hand, the visual style information such as font thickness, color, frame line type and the like is often regarded as additional attributes rather than structural features, so that the visual style information is ignored or weakened, and the visual guidance of structural differences cannot be provided for an end user. In addition, the existing structure analysis mode generally adopts a fixed rule and a static mapping strategy, lacks of self-adaptive judgment and weight modeling of diversified structural features in a document, and is difficult to support dynamic layout, field identification and semantic linkage expression in a complex medicine production scene. In view of the above, there is a need for a method for processing a pharmaceutical production data table based on document parsing and HTML rendering. Disclosure of Invention Technical problem to be solved Aiming at the defects of the prior art, the invention provides a medicine production data table processing meth