CN-116306493-B - Method for extracting and restoring Chinese and English text and Arabic numerals in Uygur language PDF document

CN116306493BCN 116306493 BCN116306493 BCN 116306493BCN-116306493-B

Abstract

The invention provides a method for extracting and restoring Chinese and English texts and Arabic numerals in Uygur language PDF documents, which utilizes PDFMiner to extract all element information in the PDF documents, further extracts all text contents from the element information, and typesets all texts according to the order from right to left after the text contents are ordered according to rows. All the text characters of Chinese and English characters, arabic numerals and the like which are sequenced from left to right are screened out, and are sequenced according to the sequence from left to right. Finally, all characters are inserted into the WORD document according to the corresponding format, so that the content of the reconstructed Uygur document is kept to be more consistent with the original document. The method can automatically detect and extract the characters represented from left to right in the Uygur PDF document, and then insert the characters into the corresponding positions in the DOCX file according to the order from left to right, so that the typesetting of the target document is kept consistent with the original document as much as possible.

Inventors

DENG BIAO
ZHAI FEIFEI

Assignees

北京中科凡语科技有限公司

Dates

Publication Date: 20260505
Application Date: 20221228

Claims (5)

1. A method for extracting and restoring Chinese and English texts and Arabic numerals in Uygur language PDF documents comprises the following steps: S1) inputting a PDF file, and analyzing the content in the text by utilizing PDFMiner to obtain the information of all elements in the PDF; S2) filtering information of all elements in the PDF, extracting information corresponding to text content from the information, wherein when the information is extracted, an element with the element type of "char" is the content corresponding to the text; S3) after text elements of all pages in all PDF documents are obtained, merging the text lines; S4) if the Chinese character is Uygur language, reversing the arrangement sequence of all the Uygur language characters, otherwise, the arrangement sequence of the Chinese character, the English character and the Arabic character is consistent with the x0 coordinate value in the PDF document, namely, the Chinese character, the English character and the Arabic character are arranged in the normal sequence in the left-to-right direction; S41) detecting all characters in each row, and finding out characters from left to right, which continuously exist in each row; s42) combining the consecutive left-to-right characters in each line to form a text block text_unit, the text of the text block being the combination of all the characters from left to right; s43) after the character content and the coordinate position of the text block are obtained, reversing the sequence of all characters in the text block; S5) combining all the characters from left to right into a text block, reversing the characters in the text block, combining all the characters in each row, and thus obtaining a row text corresponding to each row and the coordinate position of the row text, wherein the text row coordinate calculating method comprises the steps of calculating the coordinate positions (l_x0, l_y0, l_x1, l_y1) of the row text, wherein the leftmost coordinate position l_x0 of the whole row text is the coordinate x0 of the leftmost character, the bottom coordinate position l_y0 of the whole row text is the value with highest occurrence frequency of the y0 coordinate of all the characters in the row, the rightmost coordinate l_x1 of the whole row text is the coordinate x1 of the rightmost character of the row, and the coordinate l_y1 at the top of the whole row text is the value with highest occurrence frequency of the y1 coordinate of all the characters in the row; S6) after the line text of each line is obtained, reversing all characters in the text of each line according to the sequence, so as to obtain text lines in which all characters are ordered from right to left, wherein the text lines are text lines corresponding to Uygur documents; s7) inserting the text line obtained in the step S6) into a corresponding line coordinate position in the DOCX document according to a format from right to left; When the texts are combined in the step S3), firstly, sorting all the characters in different pages according to the coordinate position y0 at the bottom of the characters, and after sorting, dividing all the characters into different rows according to the threshold value alpha=6, namely, if the difference value of the y0 coordinates between the characters is smaller than 6, the two characters are in the same row; the Uygur characters in each row analyzed by the PDFMiner in the step S4) are ordered according to the size of the x0 value; The text block coordinate calculating method in step S42) includes calculating coordinate positions (u_x0, u_y0, u_x1, u_y1) of a text block, wherein the leftmost coordinate position u_x0 of the text block is the leftmost coordinate x0 of the characters, the bottom coordinate position u_y0 of the text block is the value with highest occurrence frequency of y0 coordinates of all the characters in the text block, the rightmost coordinate u_x1 of the text block is the rightmost coordinate x1 of the characters in the text block, and the top coordinate u_y1 of the text block is the value with highest occurrence frequency of y1 coordinates of all the characters in the text block.
2. The method for extracting and recovering Chinese and English texts and Arabic numerals in Uygur language PDF documents according to claim 1, wherein step S1) is characterized in that step S1) is a PDF file analysis system based on PDFMiner library, the system analyzes the contents of all pictures, tables and texts in the PDF documents and the coordinate position information of the contents in pages, and for the texts, the system obtains the size, color, font type and thickening information.
3. The method for extracting and recovering Chinese and English texts and Arabic numerals in a Uygur language PDF document according to claim 1, wherein in the step S2), the content of all characters in the text, the coordinate positions (x 0, y0, x1, y 1) of the characters in the document, the font size of the characters, the color of the characters, whether the characters are italics or not, whether the characters are thickened or not are obtained, wherein the coordinates x0 of the characters represent the leftmost coordinate position of the characters, x1 represent the rightmost coordinate position, y0 represent the bottom coordinate position, y1 represent the upper coordinate position, and the units of the coordinates are pixel values.
4. The method for extracting and recovering Chinese and English texts and Arabic numerals in a Uygur language PDF document according to claim 1, wherein the step S43) is characterized in that all the characters displayed from left to right in the Uygur language PDF document after the text block is reversed are also ordered in a right-to-left manner.
5. The method for extracting and recovering Chinese and English texts and Arabic numerals in a Uygur language PDF document according to claim 1, wherein the text block text_unit is used as an independent character representation when the text blocks are combined in the step S5).

Description

Method for extracting and restoring Chinese and English text and Arabic numerals in Uygur language PDF document Technical Field The invention relates to a computer algorithm and content analysis and reconstruction of PDF documents, in particular to a method for extracting and restoring Chinese and English texts and Arabic numerals in Uygur language PDF documents. Background PDF is one of the most widely used document formats at present, and is mainly used for file exchange, printing, etc., and cannot interact with other computer programs. With the wide application of PDF in fields of finance, scientific research, education, etc., automatic PDF document recognition and extraction of useful data therefrom, and reconstruction into a WORD document that is easy to edit, have become a concern. The PDF document mainly comprises contents such as text, images, tables, formulas and the like, wherein the restoration quality of the text contents has an important influence on the restoration effect of the PDF document as a main expression form. However, in the process of extracting and restoring the contents of the PDF document, unlike a document in which characters are arranged from left to right in chinese and english, the arrangement order of text characters in the Uygur PDF is from right to left. In Uygur PDF documents, some Chinese and English characters, arabic numerals and other characters arranged from left to right are sometimes doped, so that the problem of disorder of some characters in the restored WORD document is caused, and the readability of the document is further affected. Aiming at the problems, the invention mainly focuses on how to extract and restore texts such as Chinese and English characters, arabic numerals and the like which are arranged from left to right in the Uygur PDF document, and restore the texts in the WORD document, so that the typesetting of the target document is kept consistent with that of the original document as much as possible. Disclosure of Invention In view of the above-mentioned drawbacks of the prior art, in the method of the present invention, aiming at the problem that texts such as chinese and english characters and arabic numerals cannot be effectively restored in the Uygur language PDF document, first, all element information in the Uygur language PDF is extracted by using PDFMiner, then all text information is extracted, and then all Uygur language text characters are typeset in the order from right to left. After typesetting is completed, text characters such as Chinese and English characters, arabic numerals and the like which are sequenced from left to right are screened out, and are rearranged according to the sequence from left to right, and finally all the characters are inserted into the WORD document, so that the rearranged Uygur document has higher reduction degree. In order to achieve the above and other related objects, the present invention provides a method for extracting and recovering chinese and english text and arabic numerals in a Uygur language PDF document, comprising the steps of: S1) inputting a PDF file, and analyzing the content in the text by utilizing PDFMiner to obtain the information of all elements in the PDF; S2) filtering information of all elements in the PDF, extracting information corresponding to text content from the information, wherein when the information is extracted, an element with the element type of "char" is the content corresponding to the text; S3) after text elements of all pages in all PDF documents are obtained, merging the text lines; s4) if the characters are Uygur language characters, reversing the arrangement sequence of all the Uygur language characters, otherwise, the arrangement sequence of the characters such as Chinese, english, arabic and the like is consistent with the x0 coordinate value of the characters in the PDF document, namely, the characters such as Chinese, english, arabic and the like are all arranged in the left-to-right direction, and the characters such as Chinese, english, arabic and the like can be arranged according to the normal sequence; S41) detecting all characters in each row, and finding out characters from left to right continuously existing in each row, wherein the character judgment method comprises the following steps: The character is displayed from left to right if the encoding range of the character meets the following condition: U4e00< = c < = \u9 fff# chinese character U0021< = c < = u007 e# english character U00c0< = c < = u02af # latin character \u1e00<=c<=\u1eff If the character c is an Arabic number or belongs to ",. ? "one of them, the character is also a character displayed from left to right; s42) combining the consecutive left-to-right characters in each line to form a text block text_unit, the text of the text block being the combination of all the characters from left to right; s43) after the character content and the coordinate position of the text block are obtain