Search

CN-122019566-A - Large language model RAG application method for extracting national standard document knowledge in ship field

CN122019566ACN 122019566 ACN122019566 ACN 122019566ACN-122019566-A

Abstract

The embodiment of the invention provides a large language model RAG application method for extracting national standard document knowledge in the field of ships, which comprises the steps of establishing a knowledge base, classifying and sorting related national standard documents, independently warehousing the documents, constructing a text level tree adapting to the line structure of the standard documents and the requirements of ship design business by combining text recognition with a custom form and text segmentation processing, searching and enhancing to generate a business flow, constructing an intention question-answer recognition mechanism by combining prompt word engineering and a large language model, carrying out multi-path information recall aiming at intention recognition and task allocation, introducing a comprehensive rearrangement mechanism, and guiding the large model to generate search replies by using the prompt word engineering and a thinking chain technology. The embodiment of the invention aims at national standard knowledge extraction in the field of ships, realizes national standard document knowledge extraction application by using a large language model and RAG technology, and supports design work of business personnel.

Inventors

  • ZHOU ZEPENG
  • WU ZIHAO
  • Dong Baiting
  • YUAN FEIHUI
  • ZHANG YANCHANG
  • LI SIYUAN
  • Liu Menzheng
  • XI BING

Assignees

  • 上海外高桥造船有限公司

Dates

Publication Date
20260512
Application Date
20260120

Claims (9)

  1. 1. A large language model RAG application method for extracting national standard document knowledge in the field of ships is characterized by comprising the following steps: Step S1, a knowledge base is established, classification and arrangement are carried out on related national standard documents, the documents are classified and independently put in storage, and text hierarchical trees adapting to the line-text structure of standard documents and the requirements of ship design business are constructed by combining text recognition, custom forms and text segmentation processing; And S2, searching and enhancing to generate a service flow, constructing an intention question-answer recognition mechanism by combining the prompt word engineering and the large language model, carrying out multi-path information recall aiming at intention recognition and task allocation, introducing a comprehensive rearrangement mechanism, and guiding the large model to generate a search reply by the prompt word engineering and the thinking chain technology.
  2. 2. The method for applying the large language model RAG extracted from national standard document knowledge in the field of ships as set forth in claim 1, wherein the step S1 comprises, S11, national standard file format conversion and database resource request establishment; step S12, PDF text analysis and document content classification processing; step S13, document content classification reprocessing; And S14, storing the text in a vectorization mode.
  3. 3. The method for applying the large language model RAG extracted from national standard document knowledge in the field of ships as set forth in claim 1, wherein the step S2 comprises, S21, inquiring intention recognition of national standard materials; Step S22, multi-channel information recall; step S23, rearranging the multipath information; and S24, multi-channel information fusion and text generation.
  4. 4. The method for applying the large language model RAG extracted from national standard document knowledge in the field of ships as claimed in claim 2, wherein in step S11, 3 vector database requirements are declared for each PDF file, namely a text vector database, a form vector database and a picture vector database of a current document, and the text vector database, the form vector database and the picture vector database all adopt HNSW index modes.
  5. 5. The method for applying the large language model RAG extracted from national standard document knowledge in the field of ships as set forth in claim 2, wherein the step S12 comprises, Step S121, standard PDF text analysis requirement construction, setting MinerU processing parameters of service processing requirement adaptation, calling MinerU a corresponding functional module, and obtaining PDF document analysis content meeting national standard processing requirements; step S122, standard PDF text analysis polling and decompression, minerU of processing states of the server side are confirmed, and result information is captured in time.
  6. 6. The method for applying the large language model RAG extracted from national standard document knowledge in the field of ships as claimed in claim 4, wherein in step S13, the analysis content is further processed to realize pure text information processing, form information processing and picture information processing.
  7. 7. The method for applying the large language model RAG extracted from national standard document knowledge in the field of ships as claimed in claim 3, wherein in step S21, Based on Deepseek large language model, by means of prompt word engineering, specific prompt words and case description are formulated, national standard business scene constraint is superposed, and problem key words input by a user are extracted in a problem supplementing mode, so that input information text of a downstream Chroma vector database is formed, and query types and guide information of national standard files are given.
  8. 8. The method for applying the large language model RAG extracted from national standard document knowledge in the field of ships as claimed in claim 7, wherein the rearrangement strategy adopted in the step S23 comprises a rearrangement strategy based on a BM25 algorithm and a rearrangement strategy based on a BGE-Reranker model.
  9. 9. The method for applying the large language model RAG extracted from national standard document knowledge in the field of ships as set forth in claim 8, wherein the step S24 comprises, By editing the summarizing prompt words, carrying out short summary on the rearrangement results according to the text, the pictures and the table categories by utilizing Deepseek large language models; combining the text, the picture, the form recall knowledge and the summary respectively; Based on the provided prompting word template, fusing large language model prompting words in a strategy formulation generation stage including query intention recognition, key judgment conditions, content matching inspection, content analysis, form processing and output format constraint; And the large language model realizes text generation according to the multipath fusion information and the constraint prompt words.

Description

Large language model RAG application method for extracting national standard document knowledge in ship field Technical Field The invention relates to the technical field of ship informatization data analysis and generation, in particular to a large language model RAG application method for extracting national standard document knowledge in the ship field. Background Along with the continuous enhancement of semantic understanding, function calling and context loading window capability of a large language model, intelligent application taking the large language model as a core engine is widely focused, and the existing implementation category comprises natural language processing tools represented by an intelligent question-answering system and voice interaction, artificial intelligent bodies supporting auxiliary code generation and tool automatic calling and the like. It is worth noting that the specification documents (such as national standards, technical specifications, design drawings, etc.) of the ship manufacturing industry have the characteristics of high specialization, complex format and dynamic update. Traditional large language models often face significant limitations when dealing with such documents directly. In particular, the general large language model lacks deep understanding of specific terms in the field of ships, and is prone to causing semantic confusion and misunderstanding. From knowledge timeliness dimension analysis, a ship industry standard system presents dynamic evolution characteristics, a training data timeliness bottleneck generally exists in a current general large language model, and in an offline operation mode, the model cannot acquire a subsequently updated ship specification file in real time, so that a remarkable knowledge blind area exists. From the field professionals, the specification standard class documents involved in marine enterprises contain a large number of terms of art, unique business logic, and enterprise-specific standards. The general large model training data is derived from widely disclosed network content, lacks specialized data learning in private areas of ship enterprises, and is difficult to deeply understand and accurately answer the problems related to the design principle of a complex structure of a ship, the details of a specific ship type building process and the like. In addition, data security and privacy are also key factors that prevent the general large model from being widely applied to ship enterprises. Marine enterprise specification standard class documents typically contain sensitive information such as business secrets, core technical material, and the like. In view of various limitations of the general large language model in the processing of the ship enterprise specification documents, the retrieval enhancement generation (RETRIEVAL-Augmented Generation, RAG) system provides a new thought and possibility for the ship enterprise to break the information utilization dilemma on the basis of fusing an external knowledge source, guaranteeing data security, updating knowledge in real time and deeply understanding professional contents. The RAG technology improves the accuracy and controllability of a large language model by combining an information retrieval and text generation module. The core flow includes three stages of searching (RETRIEVAL), enhancing (Augmentation) and generating (Generation). The existing mature open source solution platform comprises RAGFlow, dify, n n and the like, and the platform can realize RAG application based on text information, however, due to the difference between knowledge expertise of standard documents in the field of ships and text form structures, the platform still has a larger lifting space in the aspects of extracting fine-granularity document information and analyzing high-precision documents. The RAG application part mainly comprises 2 aspects of business content, namely establishment of a knowledge base and establishment of a search enhancement generated business flow. In the aspect of building a ship standard document knowledge base, the existing open source platform has better universality, but the main problems faced by the existing open source platform are the problems of lower document identification accuracy, form semantic analysis loss, form and text retrieval information loss and the like. By taking RAGFlow and Dify as examples, problems such as list structure missing or damage, paragraph associated information loss, and context structure damage caused by document context segmentation can usually occur when analyzing a ship text heterogeneous form. In the retrieval enhancement generation stage, due to the problem of service flow custom operation limitation of RAGFlow, dify and other tools, the extraction requirement of standard document knowledge related to the ship industry is difficult to be directly adapted, and service logic is relatively solidified. Disclosure of Invention I