CN-122019482-A - Document classification method and system based on transducer model and electronic equipment

CN122019482ACN 122019482 ACN122019482 ACN 122019482ACN-122019482-A

Abstract

The invention relates to a document classification method and system based on a transducer model and electronic equipment. In the method, firstly, a classified document sample is obtained, a corresponding folder is created according to the periodical name and documents are stored, then, the documents are imaged, intercepted and integrated into a third picture, and then, the visual characteristics of the third picture are extracted to construct a visual characteristic data set and a periodical category label set. And finally, performing the same imaging, intercepting and integrating operation on the documents to be classified, extracting the characteristics, classifying by using the model, and archiving the documents to the corresponding journal folders. Compared with the prior art, the method has the advantages of high classification accuracy, strong robustness and the like.

Inventors

Cai Pengqing
ZHAO LI

Assignees

上海工程技术大学

Dates

Publication Date: 20260512
Application Date: 20260212

Claims (10)

1. A document classification method based on a transducer model, the method comprising the steps of: S1, acquiring classified document sample files, and storing each document sample file into a folder corresponding to a journal to which each document sample file belongs, wherein the name of the folder corresponds to the name of the journal; S2, performing imaging processing on each document sample file, and intercepting part containing page header characteristics in a first page of each document sample file after the imaging processing as a first picture and intercepting part containing page tail characteristics in a tail page as a second picture; s3, extracting visual characteristic information from all the third pictures to form a visual characteristic data set, wherein the journal category corresponding to each third picture forms a category label set; S4, training a transducer classification model by utilizing the visual characteristic data set and the class label set; S5, acquiring a document file to be classified, generating a corresponding third picture by adopting the same imaging processing, intercepting and integrating modes as in S2, extracting visual characteristic information from the third picture, classifying the document according to the extracted visual characteristic information by using a trained Transformer classification model, outputting a journal to which the document file to be classified belongs, and archiving the classified document file to a folder corresponding to the journal.
2. The document classification method based on a transducer model according to claim 1, wherein the first picture and the second picture capturing manner in S2 include: The upper half part of the first page is cut as a first picture, and the lower half part of the tail page is cut as a second picture.
3. The document classification method based on a Transformer model according to claim 2, wherein the capturing the upper half of the first page as the first picture and capturing the lower half of the second page as the second picture specifically means: Taking a region with the height accounting for 30 to 50 percent of the total height of the page as a first picture from the top edge of the front page; from the bottom edge of the tail page, an area with a height of 30% to 50% of the total height of the page is taken as a second picture.
4. The method for classifying documents based on a transducer model according to claim 1, wherein the step of integrating the first picture and the second picture in S2 to generate a third picture specifically comprises: longitudinally splicing the first picture and the second picture in an up-down adjacent sequence; and carrying out size standardization processing on the spliced pictures, uniformly adjusting the size of the spliced pictures to be the preset pixel size required by the input layer of the transducer classification model, and carrying out normalization processing on the standardized picture pixel values.
5. The method according to claim 1, wherein the visual characteristic information in S3 includes a text layout of a header, a text layout of a footer, a font style, a layout structure, and an identification element.
6. The method for classifying documents based on a transducer model according to claim 1, wherein the step S3 further comprises performing data enhancement processing on the third pictures before extracting visual feature information from all the third pictures; The data enhancement processing includes horizontal flipping, random rotation, and color dithering.
7. The method for classifying documents based on a transducer model according to claim 6, wherein the process of training the transducer model further comprises the steps of introducing data ablation experiments and policy optimization, specifically: constructing a plurality of groups of data enhancement ablation experiments, setting a complete strategy containing all data enhancement processing as a base line group, and setting an experiment group for sequentially removing horizontal overturn, random rotation or color dithering operations; comparing the loss curve convergence characteristics and the accuracy curve change trend of each experimental group and the baseline group in the training process and the verification process, evaluating the classification effect of each experimental group, quantifying the contribution degree of each data enhancement treatment on the classification effect, and obtaining the contribution degree sequencing; And executing an optimization strategy according to the contribution degree sequencing and the classification effect, wherein the enhancement treatment aiming at the high contribution degree and the preset contribution degree threshold is reserved or enhanced, and the operations comprising supplementing the document sample file, adjusting the attention mechanism weight of the transducer classification model or optimizing the picture interception area are executed aiming at the journal category with poor classification effect.
8. The document classification method based on a transducer model according to claim 7, wherein the evaluation of the contribution degree of different visual features to the classification effect adopts a multidimensional evaluation index; The multi-dimensional evaluation index comprises an accuracy rate, a recall rate and an F1 value, and the process of training the transducer classification model carries out parameter iteration by taking the F1 value on the maximum verification set as an objective function.
9. A document classification system based on a transducer model, which is characterized in that the system works by applying the document classification method based on the transducer model according to any one of claims 1-8, and the system comprises a sample management module, an image generation module, a feature construction module, a model training module and a classification filing module; the sample management module is used for acquiring classified document sample files and storing each document sample file into a folder corresponding to the journal; the image generation module is used for carrying out imaging processing on each document sample file, intercepting part containing page header characteristics in a head page after the imaging processing as a first picture, and intercepting part containing page tail characteristics in a tail page as a second picture; the feature construction module is used for extracting visual feature information from all the third pictures to form a visual feature data set; the model training module is used for training a transducer classification model by utilizing the visual characteristic data set and the class label set; the classified filing module is used for calling the output of the document files to be classified after being sequentially processed by the image generation module and the feature construction module, inputting the output into the trained Transformer classification model for document classification, outputting the journals of the document files to be classified, and filing the classified document files into folders corresponding to the journals.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-8 when the program is executed by the processor.

Description

Document classification method and system based on transducer model and electronic equipment Technical Field The invention relates to the technical field of document classification, in particular to a document classification method, a document classification system and electronic equipment based on a transducer model. Background Along with the continuous expansion of the digital scale of academic documents, a large number of journal papers in PDF format are stored in a scattered mode, manual classification and arrangement are low in efficiency, classification errors are easy to occur, and high-efficiency retrieval and management requirements under the scenes of academic research, document management and the like are difficult to meet. The existing document classification method is mostly dependent on text content keyword matching or single feature extraction, and ignores unique visual features of journal papers on formats (such as head page or tail page), so that classification accuracy is insufficient and generalization capability is weak. For example, chinese patent application publication No. CN119760138a discloses an automatic classification method of literature journals based on random forests. According to the method, text information of a front page of a PDF document is extracted, the TF-IDF vectorizer is utilized to convert the text data into numerical characteristics, and a random forest model is input for classification. However, the technical scheme has obvious limitations that firstly, the core of the technical scheme depends on extraction and semantic analysis of text content, which means that for documents with disordered scanning edition picture PDF or text layer coding, effective characteristics are difficult to extract, the application range of the method is limited, and secondly, the method only focuses on text statistical characteristics of a first page, and completely ignores important information of visual format information (such as unique page header design, font style and layout structure of a periodical) and a last page (usually comprising key characteristics such as copyright information and posting date) of the document. Due to the lack of comprehensive modeling of the head and tail visual characteristics of the document, when the documents are faced with similar content fields but different journals, the text characteristics are processed by simply relying on random forests, and high-precision distinction is often difficult to realize. In summary, the current document classification technology has the problems of excessive reliance on text semantic extraction, neglecting unique head-to-tail visual layout characteristics of journals, and difficulty in processing non-text layer documents, which results in insufficient classification accuracy and generalization capability. Disclosure of Invention The invention aims to overcome the defects of the prior art and provide a document classification method, a document classification system and electronic equipment based on a transducer model. The aim of the invention can be achieved by the following technical scheme: According to one aspect of the present invention, there is provided a document classification method based on a transducer model, wherein the method comprises the steps of: S1, acquiring classified document sample files, and storing each document sample file into a folder corresponding to a journal to which each document sample file belongs, wherein the name of the folder corresponds to the name of the journal; S2, performing imaging processing on each document sample file, and intercepting part containing page header characteristics in a first page of each document sample file after the imaging processing as a first picture and intercepting part containing page tail characteristics in a tail page as a second picture; s3, extracting visual characteristic information from all the third pictures to form a visual characteristic data set, wherein the journal category corresponding to each third picture forms a category label set; S4, training a transducer classification model by utilizing the visual characteristic data set and the class label set; S5, acquiring a document file to be classified, generating a corresponding third picture by adopting the same imaging processing, intercepting and integrating modes as in S2, extracting visual characteristic information from the third picture, classifying the document according to the extracted visual characteristic information by using a trained Transformer classification model, outputting a journal to which the document file to be classified belongs, and archiving the classified document file to a folder corresponding to the journal. As a preferred technical solution, the first picture and the second picture capturing manner in S2 includes: The upper half part of the first page is cut as a first picture, and the lower half part of the tail page is cut as a second picture. As an optimiz