Search

CN-122019537-A - Intelligent processing method for ship industry standard document

CN122019537ACN 122019537 ACN122019537 ACN 122019537ACN-122019537-A

Abstract

The embodiment of the invention provides an intelligent processing method for a ship industry standard document, which comprises the steps of S1, document analysis, structural analysis and restoration of the ship industry standard document to generate a structural text supporting vectorization storage and semantic retrieval, S2, database construction, document vector index library construction based on the structural text to form an efficient data structure facing a semantic retrieval task, S3, content retrieval, user query facing, semantic retrieval mechanism construction and high-precision matching of natural language problems to structural document content and result return. The embodiment of the invention can accurately restore the document structure, has more reasonable content segmentation, more accurate retrieval matching, more content depth of the retrieval result and clear technical path, and is easy for engineering landing.

Inventors

  • Dong Baiting
  • ZHOU ZEPENG
  • WU ZIHAO
  • ZHANG YANCHANG
  • LI SIYUAN
  • ZHANG HENGXI

Assignees

  • 上海外高桥造船有限公司

Dates

Publication Date
20260512
Application Date
20260120

Claims (8)

  1. 1. The intelligent processing method for the ship industry standard document is characterized by comprising the following steps: Step S1, document analysis, namely carrying out structural analysis and restoration on a ship industry standard document to generate a structural text supporting vectorization storage and semantic retrieval; Step S2, constructing a database, constructing a document vector index library based on the structured text, and forming a high-efficiency data structure facing the semantic retrieval task; and S3, content retrieval, namely constructing a semantic retrieval mechanism facing to user query, and completing high-precision matching from natural language problems to structured document content and returning results.
  2. 2. The intelligent processing method for the ship industry standard document according to claim 1, wherein the step S1 comprises, Step S11, OCR document recognition, carrying out preliminary structural analysis on a standard PDF document, obtaining a chapter hierarchical structure file which preliminarily reflects chapter hierarchy of the document, a page layout structure file which records text content and text block position information, a content block structure file which contains structured text and page number information, and a chart screenshot file; S12, restoring a document structure, and performing structural optimization on titles in the preliminarily generated chapter hierarchical structure file by using a large language model so as to reconstruct accurate parent-child levels of the document; And S13, content segmentation and metadata annotation, namely, based on the chapter hierarchical structure file after structure optimization, carrying out semantic unit segmentation on document content, and annotating metadata information related to an original document structure to construct a structured text supporting vectorization storage and semantic retrieval.
  3. 3. The intelligent processing method of the ship industry standard document according to claim 2, wherein in step S11, OCR document recognition is achieved based on MinerU tools.
  4. 4. The intelligent processing method of the ship industry standard document according to claim 2, wherein in step S12, the document structure reduction is realized by using DeepSeek-Chat large language model.
  5. 5. The intelligent processing method for the ship industry standard document according to claim 2, wherein the step S12 comprises, Step S121, extracting all initial header lines by using a regular expression; step S122, understanding the logic relationship between titles by utilizing a large language model and combining with a preset prompt word template; step S123, outputting standardized objects, wherein each object comprises an original item and a modified item; step S124, generating a chapter hierarchy file containing a title level according to the output standardized object.
  6. 6. The intelligent processing method for the ship industry standard document according to claim 2, wherein the step S13 comprises, Step S131, title hierarchy segmentation; step S132, page mapping and original text reduction; Step S133, content segmentation and segment generation; step S134, chart information structuring processing; and S135, metadata integration and output structure construction.
  7. 7. The intelligent processing method for the ship industry standard document according to claim 1, wherein the step S2 comprises, Step S21, vector generation and semantic data structure construction are carried out, semantic vectorization coding is carried out on each structured segment in the structured text, dense vector fields used for representing deep semantics and sparse vector fields used for capturing keyword importance are generated, unique identifier fields are generated for each structured segment, and a semantic content, hierarchical structure and context association three-in-one semantic data structure is formed; s22, constructing a vector database and mapping fields; step S23, configuring and mixing the index structure, and respectively configuring the index structure aiming at different types of vector features; And step S24, writing the vector and the metadata, and completing data disc dropping after data writing.
  8. 8. The intelligent processing method for the ship industry standard document according to claim 1, wherein the step S3 comprises, Step S31, retrieving the mixed vector; S32, reciprocal ordering fusion, namely, carrying out fusion ordering on the mixed vector retrieval results by adopting a reciprocal ordering fusion strategy, and screening out final candidate results of Top-K; step S33, expanding a context sliding window, taking the hit core semantic segment as the center, taking the sliding window size as the scope basis, and extracting a plurality of semantically continuous segments to form a context content window; And step S34, returning the Top-K semantic segments, and returning the Top-K segments and the context window content corresponding to the Top-K segments after fusion sequencing as a retrieval result.

Description

Intelligent processing method for ship industry standard document Technical Field The invention relates to the technical field of document structuring processing and semantic information management in the ship industry, in particular to an intelligent processing method for a standard document in the ship industry. Background In the ship industry, standard documents play an important role in normalizing production design flow, guaranteeing product quality requirements, pushing product inspection standards and the like. Along with the transformation and upgrading of ship manufacturers to digital and intelligent directions, standard documents in the related fields of ships are continuously changed in the aspects of use, management, retrieval requirements and the like, and the following typical current situations are presented: 1. The standard documents are various in sources and wide in coverage business range. The national standard, industry standard and enterprise technical documents are widely applied to a plurality of links such as product research and development, production and manufacturing, inspection and detection and the like. These documents are usually formulated by authorities, industry taps or the enterprises themselves, fit industry development trends, have high regularity and constraint, and are important bases for indicating industry development roads. 2. The document format is uniform but the structure is opaque. Most standard documents are released in PDF format, the content is strict and the typesetting is uniform, but the semantic structure information for program identification is lacking. The logical levels of chapters, clauses, appendices, etc. in the document are presented primarily in a visual format without structural tags that directly support machine parsing. 3. The enterprise customization is common, and the version difference is obvious. In the actual production process of a ship manufacturing enterprise, the 'customization processing' of term supplement, content deletion, expression optimization and the like is often carried out on the basis of general standards according to the current situation of the enterprise and the front edge specification of the industry, so that enterprise internal standards of multiple versions are formed. On the premise of inheriting the original document structure, the enterprise standards increase personalized clauses, and document contents are characterized by multiple versions and different structures. 4. The number of standard documents is rapidly increasing, and the management pressure is continuously increasing. As industry continues to develop and perfect, the standard documents that a marine enterprise needs to manage continue to grow in number and update frequency. Depending on the traditional manual classification, manual naming and keyword retrieval modes, the actual requirements of high-frequency calling, quick positioning, version tracing and the like of standard documents are difficult to meet. 5. The information use needs are various, and the retrieval granularity is continuously refined. When using standard documents, marine enterprises need to browse certain types of specifications as a whole, and often need to quickly locate specific terms, term definitions, technical parameters or annex contents. As business scenarios diversify, user demands for document retrieval extend from "find file" to "find content", "look context" and "compare version". 6. Structured management and semantic utilization are trends. In the face of the trend of increasing the number and complicating the use of standard documents, more and more ship enterprises begin to pay attention to the structured storage, semantic organization and intelligent retrieval capabilities of the standard documents so as to support business scenes such as internal system integration, knowledge management platform construction and intelligent question-answering. Aiming at the problems of disordered structure, inaccurate search result and the like in the digital processing and intelligent search process of the standard PDF document, the prior art mainly has the following five core technical problems: 1. PDF document structure recognition is difficult. Standard PDF documents lack semantic tags, traditional OCR or template methods have difficulty in accurately restoring hierarchical structures, even in the case of "custom development", and lack versatility and robustness. 2. The conversion content is disjointed with the page number, and the structural information is missing. Although the Markdown file generated by the open-source PDF conversion tool has the advantage of being editable, the Markdown file loses meta-information such as page positions and the like, so that the Markdown file cannot be in one-to-one correspondence with an original document, and is unfavorable for assisting in manual retrieval and positioning of the original document. 3. Lacks deep semantic understanding cap