Search

CN-121998054-A - Knowledge management method and system based on document similarity detection and multi-mode knowledge base

CN121998054ACN 121998054 ACN121998054 ACN 121998054ACN-121998054-A

Abstract

A knowledge management method and a knowledge management system based on document similarity detection and a multi-modal knowledge base are used for acquiring target documents, when the target documents are put in storage, a similarity detection method based on a resource self-adaption mechanism is adopted for screening existing documents from the multi-modal knowledge base to serve as candidate document sets, a similarity detection method based on an optimal mechanism of precision and efficiency is adopted for carrying out deduplication on the target documents or putting the target documents judged to be new documents in storage by utilizing the candidate document sets, questions of a user question are acquired, multi-modal knowledge base after deduplication or updating is adopted for carrying out multi-path mixed search on the questions, rearrangement results are obtained after two-stage sequencing on the multi-path mixed search results, reference data are selected from the rearrangement results, the questions, the prompt words, the context and the reference data are assembled into a large model input instruction according to a preset prompt word template, answers generated by a large model are acquired and returned to the user, and comprehensive improvement of knowledge utilization rate, search precision and efficiency is achieved.

Inventors

  • GUO YINGYING
  • ZHAI FENGZHANG
  • ZHANG LIQIANG
  • LOU XIAONAN
  • XU YANMING
  • LI XINYU
  • XU GANG
  • HAN MINGLEI
  • LI WENBIN
  • Li Ningde

Assignees

  • 北京四方继保工程技术有限公司
  • 北京四方继保自动化股份有限公司

Dates

Publication Date
20260508
Application Date
20251231

Claims (11)

  1. 1. A knowledge management method based on document similarity detection and a multi-modal knowledge base comprises a document, a question-answer pair, a picture, audio and video, and is characterized by comprising the following steps: Acquiring a target document; When the target document is put in storage, a similarity detection method based on a resource self-adaptive mechanism is adopted to screen the existing document from the multi-mode knowledge base as a candidate document set; And carrying out multi-path mixed retrieval on the questions based on the multi-mode knowledge base after duplication removal or updating, obtaining a rearrangement result after two-stage sequencing of the multi-path mixed retrieval result, selecting reference data from the rearrangement result, assembling the questions, the prompt words, the context and the reference data into a large model input instruction according to a preset prompt word template, and obtaining answers generated by the large model to be returned to the user.
  2. 2. The knowledge management method based on document similarity detection and multimodal knowledge base according to claim 1, wherein, When the target document is put in storage, all characters in the target document are segmented; Generating a MinHash value signature vector of each word in the target document and a MinHash value signature vector of each word in the existing documents in the library by adopting a MinHash algorithm so as to calculate Jaccard similarity of the target document and the existing documents in the library; setting a Jaccard similarity threshold based on a resource adaptive mechanism; Mapping target documents with the Jaccard similarity not smaller than the Jaccard similarity threshold and existing documents in the library into the same hash bucket, and dividing the existing documents into the hash bucket to form a candidate document set; If no documents exist in the database in the hash bucket, the target document is a new document and is stored in the database.
  3. 3. The knowledge management method based on document similarity detection and multimodal knowledge base according to claim 2, wherein, Target document And existing documents in the library The Jaccard similarity of (C) is shown in the following formula: in the formula, For target documents And existing documents in the library Jaccard similarity of (C); is the document of MinHash value signature vector of individual word segmentation; Is the number of the word segmentation; to indicate the function when The value is 1, otherwise 0.
  4. 4. The method for document similarity detection and multimodal knowledge base based knowledge management as in claim 3, wherein, Acquiring system resource data including average pressure values of CPU, memory and I/O The pressure index PSI of the system resource is calculated as follows: in the formula, 、 、 The weight coefficients of CPU, memory and I/O respectively meet the following requirements + + =1; The Jaccard similarity threshold is set based on the pressure index of the system resource, and specifically comprises the following steps: setting the Jaccard similarity threshold to 0.6 when PSI < 0.7; setting the Jaccard similarity threshold to 0.7 when PSI is more than or equal to 0.7 and less than or equal to 0.85; The Jaccard similarity threshold is set to 0.8 when PSI > 0.85.
  5. 5. The knowledge management method based on document similarity detection and multimodal knowledge base according to claim 2, wherein, Converting each word in the target document and each word in the existing document in the hash bucket into binary vectors through a hash algorithm; Generating word segmentation vectors of the target document by using binary vectors of all the words in the target document by adopting SimHash algorithm, generating word segmentation vectors of the existing document in the hash bucket by using binary vectors of all the words in the existing document in the hash bucket, and calculating hamming distances between the target document and the word segmentation vectors of the existing document in the hash bucket; based on an optimal mechanism of precision and efficiency, taking the ratio of the Hamming distance to the bit number of the binary vector as similarity; and when the similarity is not greater than the set threshold, warehousing the target document for a new document.
  6. 6. The method for document similarity detection and multimodal knowledge base based knowledge management as in claim 5, wherein, Adding the corresponding digits of the binary vectors of all the segmented words, and obtaining the segmented word vector of the target document by adding 1 to the digit position greater than or equal to 1 and 0 to the digit position less than 1 The following formula is shown: Wherein, the , Is the first Word number of each word A bit binary character is used to indicate that, Is the number of bits of the binary vector, Is the number of words.
  7. 7. The method for document similarity detection and multimodal knowledge base based knowledge management as in claim 6, wherein, Hamming distance The following formula is shown: in the formula, For the word segmentation vector of the target document, For the word vectors of existing documents in the hash bucket, Word segmentation vector for target document A bit binary character is used to indicate that, Word vector of existing document in hash bucket A bit binary character is used to indicate that, Is a bitwise exclusive or operator.
  8. 8. The knowledge management method based on document similarity detection and multimodal knowledge base according to claim 1, wherein, The method comprises the steps of obtaining questions asked by users, carrying out vectorization coding and word segmentation processing on the questions; The multi-modal knowledge base after duplication removal or updating comprises Milvus base and MongoDB base; Based on vectorization coding, performing semantic similarity retrieval in Milvus libraries to obtain M similarity vectors with the maximum similarity, and returning IDs associated with the M similarity vectors in a list form; According to the word segmentation result, keyword matching search is executed in a MongoDB library, N similar texts with the largest matching are obtained, and Q similar problems with the largest similarity with the problems are obtained in the MongoDB library; Inputting the questions, M original texts, N similar texts and Q similar questions into a sequencing model for first sequencing and de-duplication to obtain a first-stage rearrangement result; performing second sequencing and de-duplication by adopting a sequencing fusion algorithm to obtain a secondary rearrangement result, and selecting reference data from the secondary rearrangement result; according to a preset prompting word template, assembling the questions, the prompting words, the contexts and the reference data into a large model input instruction, and acquiring answers generated by the large model to return to a user; m, N and Q are both values set by the user, and are both positive integers.
  9. 9. A knowledge management system based on document similarity detection and a multi-modal knowledge base, for implementing the knowledge management method based on document similarity detection and multi-modal knowledge base according to any one of claims 1 to 8, comprising: The system comprises a target document collection module, a duplicate removal or update module and a duplicate removal or update module, wherein the target document collection module is used for collecting the existing document from a multi-mode knowledge base by adopting a similarity detection method based on a resource self-adaption mechanism as a candidate document collection; The question and answer module is used for acquiring questions asked by the user, carrying out multi-path mixed search on the questions based on the multi-mode knowledge base after duplication removal or updating, obtaining a rearrangement result after two-stage sequencing of the multi-path mixed search result, selecting reference data from the rearrangement result, assembling the questions, the prompt words, the context and the reference data into a large model input instruction according to a preset prompt word template, and acquiring answers generated by the large model and returning the answers to the user.
  10. 10. A terminal comprises a processor and a storage medium, and is characterized in that: The storage medium is used for storing instructions; The processor being operative according to the instructions to perform the steps of the method of any one of claims 1-8.
  11. 11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1-8.

Description

Knowledge management method and system based on document similarity detection and multi-mode knowledge base Technical Field The invention belongs to the field of large model application, and particularly relates to a knowledge management method and system based on document similarity detection and a multi-mode knowledge base. Background The traditional enterprise knowledge management system mainly uses a search technology, and has the pain points of high document repetition rate, knowledge island, low retrieval efficiency, insufficient utilization, insufficient generalization and the like. In the prior art, a document intelligent analysis and question answering method and system based on RAG and multi-mode knowledge graph mainly analyzes PDF, txt, jpg information, converts text information into text knowledge graph information, converts image mode information into image knowledge graph information, fuses the text knowledge graph and the image knowledge graph to obtain the multi-mode knowledge graph, builds a database based on the multi-mode knowledge graph, an image-text question answering method, a system, equipment and a storage medium based on multi-mode RAG mainly extracts multi-mode information from PDF documents, stores the multi-mode information in the text vector database and the image vector database, separately retrieves the multi-mode information in the PDF documents, generates final answers by a multi-mode large model, mainly solves multi-mode data in long documents, but the document intelligent analysis and question answering method and the system based on the RAG and the multi-mode knowledge graph mainly focuses on building knowledge graph, the original text information is ignored, the work of the multi-mode knowledge graph building mode is complex, and the image-text question answering method, the system, the equipment and the storage medium mainly focuses on extracting the multi-mode information in PDF documents, and generates final answers based on a pure video document and does not refer to the multi-mode language models. Disclosure of Invention In order to solve the defects in the prior art, the invention provides a knowledge management method and a knowledge management system based on document similarity detection and a multi-mode knowledge base, and aims to solve the technical problems of document repetition redundancy, knowledge island forestation, low efficiency of search results, insufficient knowledge utilization and poor cross-mode understanding generalization caused by the fact that a traditional enterprise knowledge management system depends on a single search technology. The invention adopts the following technical scheme. The invention provides a knowledge management method based on document similarity detection and a multi-modal knowledge base, wherein the multi-modal knowledge base comprises documents, question-answer pairs, pictures, audio and video; The method comprises the following steps: Acquiring a target document; When the target document is put in storage, a similarity detection method based on a resource self-adaptive mechanism is adopted to screen the existing document from the multi-mode knowledge base as a candidate document set; And carrying out multi-path mixed retrieval on the questions based on the multi-mode knowledge base after duplication removal or updating, obtaining a rearrangement result after two-stage sequencing of the multi-path mixed retrieval result, selecting reference data from the rearrangement result, assembling the questions, the prompt words, the context and the reference data into a large model input instruction according to a preset prompt word template, and obtaining answers generated by the large model to be returned to the user. Preferably, when the target document is put in storage, all characters in the target document are segmented; Generating a MinHash value signature vector of each word in the target document and a MinHash value signature vector of each word in the existing documents in the library by adopting a MinHash algorithm so as to calculate Jaccard similarity of the target document and the existing documents in the library; setting a Jaccard similarity threshold based on a resource adaptive mechanism; Mapping target documents with the Jaccard similarity not smaller than the Jaccard similarity threshold and existing documents in the library into the same hash bucket, and dividing the existing documents into the hash bucket to form a candidate document set; If no documents exist in the database in the hash bucket, the target document is a new document and is stored in the database. Preferably, the target documentAnd existing documents in the libraryThe Jaccard similarity of (C) is shown in the following formula: in the formula, For target documentsAnd existing documents in the libraryJaccard similarity of (C); is the document of MinHash value signature vector of individual word segmentation; Is the number of the word segmentat