CN-122019674-A - Processing method and system of professional knowledge corpus

CN122019674ACN 122019674 ACN122019674 ACN 122019674ACN-122019674-A

Abstract

The invention provides a processing method and a processing system of a specialized knowledge corpus, wherein the processing method comprises the steps of obtaining the specialized knowledge corpus to be processed, carrying out structural processing on the specialized knowledge corpus to convert unstructured specialized knowledge content into a structured knowledge unit, wherein the structural processing comprises a content recognition step of carrying out content recognition and classification on the specialized knowledge corpus according to page units to determine types and positions of various content elements in pages, and a content conversion and recombination step of carrying out conversion and recombination on the classified content elements according to preset structural rules based on the result of the content recognition step to generate an intermediate document with a unified structural mark, and carrying out verification and assembly on the knowledge content subjected to the structural processing to form a standardized knowledge corpus.

Inventors

GUO LI

Assignees

强魏飚唐(上海)智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (10)

1. The processing method of the expertise corpus is characterized by comprising the following steps of: acquiring professional knowledge corpus to be processed; Carrying out structural processing treatment on the expertise corpus so as to convert unstructured expertise content into a structured knowledge unit, wherein the structural processing treatment comprises the following steps: Content identification, namely carrying out content identification and classification on the professional knowledge corpus according to page units, and determining the types and positions of various content elements in the page; A content conversion and recombination step, namely converting and recombining the classified content elements according to a preset structuring rule based on the result of the content identification step to generate an intermediate document with a unified structure mark; and checking and assembling the knowledge content subjected to the structuring processing treatment to form a standardized knowledge corpus.
2. The method for processing a corpus of expertise according to claim 1, wherein the content recognition step specifically comprises: Dividing the specialized knowledge corpus according to pages; Identifying content elements in each page, wherein the content elements at least comprise one of texts, tables, pictures, titles, headers and footers; and carrying out batch division and position identification on the content elements according to the identified types of the content elements.
3. The method for processing the corpus of expertise according to claim 2, wherein said identifying content elements in each page is achieved by optical character recognition technology.
4. The method according to claim 2, wherein the content recognition step further comprises identifying a document structure of the specialized corpus, the document structure including at least one of a directory, a text, and an appendix.
5. The method for processing a corpus of expertise according to claim 1, wherein said content conversion and reorganization step comprises: arranging the identified unstructured content elements into predefined formatted content blocks according to pages and content types; and logically organizing and marking the formatted content blocks according to the document structure to generate the intermediate document.
6. The method for processing a corpus of expertise according to claim 5, wherein the intermediate document is a document with a structured label.
7. The processing method of the specialized knowledge corpus according to any one of claims 1 to 6, characterized in that the structured processing process further includes, before the content recognition step: The initialization step comprises checking access right and time synchronization state of the storage container, creating independent storage catalogue according to the processing items, and starting the processing process for monitoring task state; And the verification step is to verify the identity and the authority of the processing request.
8. The method for processing the specialized knowledge corpus according to claim 7, wherein the identity and authority verification is realized by verifying JSON Web Token.
9. The processing method of the specialized knowledge corpus according to any one of claims 1 to 6, characterized in that the structured processing further comprises a task execution and monitoring step for executing processing tasks in a concurrent manner and recording the status.
10. A processing system for specialized knowledge corpora, comprising: The corpus acquisition module is used for acquiring professional knowledge corpus to be processed; the structuring processing module is used for carrying out structuring processing on the expertise corpus so as to convert unstructured expertise content into structured knowledge units, and the structuring processing module comprises: The content identification unit is used for carrying out content identification and classification on the professional knowledge corpus according to the page unit and determining the types and positions of various content elements in the page; the content conversion and recombination unit is used for converting and recombining the classified content elements according to a preset structuring rule based on the result of the content identification unit to generate an intermediate document with a unified structure mark; And the quality control module is used for checking and assembling the knowledge content processed by the structural processing module to form a standardized knowledge corpus.

Description

Processing method and system of professional knowledge corpus Technical Field The invention relates to the technical field of predictive processing, in particular to a method and a system for processing expert knowledge corpus. Background In the field of artificial intelligence, high-quality specialized knowledge corpus is a key to training a domain-specific model. 1.2 The specialized knowledge corpus is a domain-specific language material subjected to systematic arrangement and labeling and is used for supporting research, teaching or technical application in the specialized domain, and the corpus is derived from specific industries such as medical treatment, law and the like and contains a large amount of specialized terms, complex logics and multi-modal contents (such as texts, tables and charts). For example, in the medical field, the corpus "three-in-one soup" refers to a mixed medicament of antibiotics and the like. At present, the automatic processing technology of the universal text is mature, but the automatic processing technology still faces obvious defects when applied to the professional knowledge corpus: firstly, the integrity is difficult to guarantee. Professional documents are complex in structure and often contain elements such as catalogues, appendices, spread sheets and the like. General processing methods tend to destroy their inherent structure, leading to element deletions or context breaks. And secondly, consistency maintenance is difficult. The professional field requires strict unification of conceptual expressions, but the existing method lacks depth recognition and retention capability for internal logic structures (such as chapter level and chart reference relation) of documents, and is easy to cause term inconsistency or reference dislocation. Thirdly, the processing process is easy to distort. The semantic boundary and functional attribute of the content (such as distinguishing the title from the text and judging whether the table spans pages) are difficult to identify based on the preprocessing of simple rules or formats, so that the generated knowledge fragment has incomplete semantics and cannot meet the requirement of professional model training on data fidelity. Therefore, a systematic expertise corpus extraction method is needed, and the accurate analysis, the lossless conversion and the standard recombination of the multi-mode expertise documents can be realized, so that the integrity, the consistency and the correctness of the produced corpus are ensured, and a reliable basis is provided for constructing a high-quality domain knowledge base and a model. Disclosure of Invention Aiming at the defects in the prior art, the invention aims to provide a processing method and a processing system of expert knowledge corpus. The processing method of the professional knowledge corpus provided by the invention comprises the following steps: acquiring professional knowledge corpus to be processed; Carrying out structural processing treatment on the expertise corpus so as to convert unstructured expertise content into a structured knowledge unit, wherein the structural processing treatment comprises the following steps: Content identification, namely carrying out content identification and classification on the professional knowledge corpus according to page units, and determining the types and positions of various content elements in the page; A content conversion and recombination step, namely converting and recombining the classified content elements according to a preset structuring rule based on the result of the content identification step to generate an intermediate document with a unified structure mark; and checking and assembling the knowledge content subjected to the structuring processing treatment to form a standardized knowledge corpus. Preferably, the content identification step specifically includes: Dividing the specialized knowledge corpus according to pages; Identifying content elements in each page, wherein the content elements at least comprise one of texts, tables, pictures, titles, headers and footers; and carrying out batch division and position identification on the content elements according to the identified types of the content elements. Preferably, the identifying of the content elements in each page is achieved by optical character recognition techniques. Preferably, the content recognition step further comprises identifying a document structure of the expertise corpus, the document structure comprising at least one of a catalog, a text, and an appendix. Preferably, the content conversion and reorganization step specifically includes: arranging the identified unstructured content elements into predefined formatted content blocks according to pages and content types; and logically organizing and marking the formatted content blocks according to the document structure to generate the intermediate document. Preferably, the intermediate document is a docum