CN-121999504-A - Information extraction method, device, equipment, storage medium and computer product
Abstract
The application discloses an information extraction method, an information extraction device, information extraction equipment, an information storage medium and a computer product, which relate to the technical field of information processing and disclose the information extraction method, wherein the information extraction method comprises the steps of obtaining target image information by applying for format conversion of files to be sorted; and acquiring text information and document layout information of the target image information, and importing the text information and the document layout information into an information extraction large model to obtain extraction information. By means of the method, the visual-text encoder based on layout perception is pre-trained, heterogeneous information of two modes of text and image is fused, the layout perception capability of a model on a document image is enhanced, OCR is prevented from losing document layout information, and further the performance of the model is improved. The information extraction large model with heterogeneous information perception can be obtained by perceiving the document layout structure information through the large model, and the whole information extraction performance of the finely tuned information large model can be improved end to end.
Inventors
- CHANG HONGYU
Assignees
- 北京奇虎科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20241101
Claims (10)
- 1. An information extraction method, characterized in that the method comprises: Carrying out format conversion on the file to be sorted to obtain target image information; And acquiring text information and document layout information of the target image information, and importing the text information and the document layout information into an information extraction large model to obtain extraction information.
- 2. The method of claim 1, wherein the step of converting the format of the document to be sorted to obtain the target image information includes: Acquiring file format information of a file to be tidied; determining a document type information file and a picture type information file according to the file format information; And converting the document type information file and the picture type information file into document image files to obtain target image information.
- 3. The method of claim 1, wherein the step of importing the text information and the document layout information into an information extraction large model to obtain extraction intelligence information further comprises, before: constructing a layout perception encoder according to the pre-training data; constructing an information extraction decoder based on a preset large language model; and generating an information extraction large model according to the layout perception encoder and the information extraction decoder.
- 4. The method of claim 3, wherein the step of constructing a layout-aware encoder from pre-training data is preceded by the step of: acquiring a history information document, and extracting text information and text layout information of the history information document; and determining pre-training data according to the text information and the text layout information.
- 5. A method according to claim 3, wherein the step of constructing a layout-aware encoder from pre-training data comprises: acquiring an initial visual document understanding model and a preset training mode; And training the initial visual document understanding model according to the pre-training data and the pre-training mode to obtain a layout perception encoder.
- 6. The method of claim 3, wherein the constructing an information extraction decoder based on a preset large language model comprises: acquiring a preset large language model, and performing instruction fine tuning on the preset large language model to obtain an instruction fine tuning model; and carrying out information domain fine adjustment on the instruction fine adjustment model to obtain an information extraction decoder.
- 7. An information extraction apparatus, characterized in that the apparatus comprises: the format conversion module is used for carrying out format conversion on the file to be processed to obtain target image information; And the information extraction module is used for acquiring text information and document layout information of the target image information, and importing the text information and the document layout information into an information extraction large model to obtain extraction information.
- 8. An information extraction apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the information extraction method according to any one of claims 1 to 6.
- 9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, realizes the steps of the information extraction method according to any one of claims 1 to 6.
- 10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the steps of the information extraction method according to any one of claims 1 to 6.
Description
Information extraction method, device, equipment, storage medium and computer product Technical Field The present application relates to the field of information processing technologies, and in particular, to an information extraction method, apparatus, device, storage medium, and computer product. Background In modern intelligence analysis, the acquisition and processing of information is facing increasingly complex challenges. In reality, information files often exist in unstructured form, including documents, pictures, etc., which, although rich, lack an explicit structure, making information extraction and analysis difficult. The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art. Disclosure of Invention The application mainly aims to provide an information extraction method, device, equipment, storage medium and computer product, and aims to solve the technical problem that original layout structure information is always lost when character information is extracted through OCR technology. In order to achieve the above object, the present application provides an information extraction method, which includes: Carrying out format conversion on the file to be sorted to obtain target image information; And acquiring text information and document layout information of the target image information, and importing the text information and the document layout information into an information extraction large model to obtain extraction information. Optionally, the step of converting the format of the file to be sorted to obtain the target image information includes: Acquiring file format information of a file to be tidied; determining a document type information file and a picture type information file according to the file format information; And converting the document type information file and the picture type information file into document image files to obtain target image information. Optionally, before the step of importing the text information and the document layout information into the information extraction large model to obtain extraction information, the method further includes: constructing a layout perception encoder according to the pre-training data; constructing an information extraction decoder based on a preset large language model; and generating an information extraction large model according to the layout perception encoder and the information extraction decoder. Optionally, before the step of constructing the layout-aware encoder from the pre-training data, the method further comprises: acquiring a history information document, and extracting text information and text layout information of the history information document; and determining pre-training data according to the text information and the text layout information. Optionally, the step of constructing the layout-aware encoder from the pre-training data comprises: acquiring an initial visual document understanding model and a preset training mode; And training the initial visual document understanding model according to the pre-training data and the pre-training mode to obtain a layout perception encoder. Optionally, the step of constructing the information extraction decoder based on the preset large language model includes: acquiring a preset large language model, and performing instruction fine tuning on the preset large language model to obtain an instruction fine tuning model; and carrying out information domain fine adjustment on the instruction fine adjustment model to obtain an information extraction decoder. Optionally, the step of performing instruction fine tuning on the preset large language model to obtain an instruction fine tuning model includes: finishing to obtain an instruction fine adjustment data set corresponding to the preset large language model; And performing instruction fine tuning on the preset large language model according to the instruction fine tuning data set to obtain an instruction fine tuning model. Optionally, the step of sorting to obtain the instruction fine-tuning data set corresponding to the preset large language model includes: Acquiring a preset fine adjustment data set, and determining extraction information type information of the preset fine adjustment data set; and according to the preset fine tuning data set and the extraction information type information, finishing to obtain an instruction fine tuning data set corresponding to the preset large language model. Optionally, the step of sorting the instruction fine tuning data set corresponding to the preset large language model according to the preset fine tuning data set and the extraction information type information includes: determining a plurality of plan extraction information and target output formats according to the extraction information type information; and according to the