CN-121996817-A - Multi-mode embedding-based hydropower equipment knowledge graph retrieval enhancement method and device

CN121996817ACN 121996817 ACN121996817 ACN 121996817ACN-121996817-A

Abstract

The invention provides a method and a device for enhancing knowledge graph retrieval of hydroelectric equipment based on multi-mode embedding, and belongs to the technical field of intelligent perception and multi-mode semantic understanding of water conservancy and hydropower engineering. The method comprises the steps of constructing a multi-modal knowledge graph based on related documents and pictures of hydroelectric equipment by using a large language model, vectorizing the pictures and text information in the multi-modal knowledge graph, mapping the pictures and the text information to a unified semantic space to generate a vector database for retrieving the multi-modal knowledge graph, and retrieving the multi-modal knowledge graph by using the vector database. The invention can realize the deep fusion and semantic alignment of multi-source heterogeneous data such as images, texts and the like, can be widely applied to the fields such as the overhaul of electromechanical equipment of a hydropower station, the safety inspection of hydraulic buildings, the dispatching operation and maintenance knowledge management and the like, and can obviously improve the equipment identification precision, the retrieval response speed and the knowledge service intelligent level.

Inventors

WAN KE
DENG WEIQIANG
HE RUI
LI TIANZHI
PENG CHEN
SUN YU
LUO WEI
XIE YING
ZHANG FENG

Assignees

国能大渡河猴子岩发电有限公司

Dates

Publication Date: 20260508
Application Date: 20251210

Claims (10)

1. The method for enhancing the retrieval of the knowledge graph of the hydroelectric equipment based on the multi-mode embedding is characterized by comprising the following steps of: based on related documents and pictures of the hydroelectric equipment, constructing a multi-modal knowledge graph by using a large language model; Vectorizing the picture and text information in the multi-modal knowledge graph, and mapping the vectorized representation to a unified semantic space to generate a vector database for retrieving the multi-modal knowledge graph; and searching the multi-mode knowledge graph by using the vector database.
2. The method of claim 1, wherein the constructing a multi-modal knowledge-graph comprises: 1) Acquiring a related document of hydroelectric equipment, and extracting entities and related attributes from the document by using a large language model to construct an initial text knowledge graph; 2) Acquiring pictures of hydroelectric equipment and preprocessing the pictures to generate a picture set; 3) Carrying out semantic matching on the pictures in the picture set obtained in the step 2) and the entity in the initial text knowledge graph obtained in the step 1), and constructing a multi-mode knowledge graph.
3. The method of claim 2, wherein the constructing an initial textual knowledge-graph comprises: 1-1) acquiring a hydropower equipment related document and extracting a paragraph text unit; 1-2) extracting entities and related attributes from the paragraph text units obtained in the step 1-1) by using a large language model according to a pre-defined map body, and associating the extracted entities together to form a relation to form an initial triplet, wherein the triplet is used for describing the semantic relation between two entities or the attribute of any entity, and the extracted entities form an entity set , Representing an ith entity, n representing the number of entities; 1-3) confidence level generation for the initial triples obtained in step 1-2) And then, judging: If it is Then the triplet is retained; If it is Checking the content of the triplet, and reserving the triplet after the checking is passed; If it is The triplet is not retained; Wherein, the 1 Is a preset triplet first confidence threshold, 2 Is a preset initial triplet second confidence threshold; ; 1-4) based on the results of steps 1-2) and 1-3), composing the entity, the attribute and the reserved triples into an initial text knowledge graph.
4. A method according to claim 3, further comprising: 3-1) generating a content description for the pictures in the picture set using the visual language model: Wherein, the Representative pair image The description text to be generated is provided with, An image description generation function representing a visual language model, Representing a j-th picture in the picture set; 3-2) generating a corresponding natural language description for each entity in the initial text knowledge graph; wherein each entity in the initial text knowledge-graph The attribute sets of (a) are: Wherein, the Representing an entity Is selected from the group consisting of the (j) th attribute, m represents an entity The number of attributes; constructing a natural language description for each entity: Wherein, the Representative of entity The Format represents a natural language generation function; 3-3) respectively carrying out vector representation on the result of the step 3-1) and the result of the step 3-2), and then matching the two groups of vectors by using text semantic similarity; And (3) making: Wherein, the Representative pair Is used in the context of a vectorized representation of (a), Representation pair Is a vectorized representation of (2); A representative text encoder for converting text into vectors; Then calculate the image And entity The similarity of (2) is: wherein cos represents a cosine similarity algorithm; then, selecting an entity with highest similarity for each picture: Wherein, the Representing finding and pictures from all entities Recording the entity with highest similarity Corresponding similarity score ; 3-4) Based on the result of the step 3-3), mounting the pictures in the picture set under the corresponding entities in the initial text knowledge graph to form a matched entity picture pair, thereby generating a multi-mode knowledge graph; Wherein: If any picture Corresponding to If the similarity is larger than the preset first similarity threshold, the picture is displayed Mounted on entity Forming a matched entity picture pair and adding a knowledge graph; If any picture Corresponding to Checking the picture and the entity between a preset first similarity threshold and a preset second similarity threshold, and forming a matched entity picture pair and adding a knowledge graph after the checking is passed, wherein the second similarity threshold is smaller than the first similarity threshold; If any picture Corresponding to If the similarity is smaller than the second similarity threshold, the picture is not mounted, and a knowledge graph is not added into the picture; After all the pictures in the picture set are traversed, the multi-mode knowledge graph is generated.
5. The method of claim 4, wherein the generating a vector database for retrieving the multimodal knowledge-graph comprises: 1) Aligning the cross-modal vector of the entity image pairs matched in the multi-modal knowledge graph; wherein, for any entity in the matched entity picture pair The entity is put into Corresponding natural language description Inputting a text-encoding model (text-encoding) to obtain a text vector: Centering the picture of the entity picture Inputting the picture vector into a picture coding model to obtain a picture vector: Wherein, the A representative image encoder for converting a picture into a vector; Vector the picture And text vector Composing picture-text vector pairs; 2) Based on the result of the step 1), using a contrast learning method to enable the distance between the picture vector and the text vector in the picture-text vector pair to be as close as possible in a vector space, and enabling the distance between the unmatched picture vector and the unmatched text vector to be as far as possible in the vector space, so as to obtain an optimized picture vector and an optimized text vector; wherein, the loss function of contrast learning is as follows: Wherein, the Representing cosine similarity; is a temperature parameter; for a single batch of samples, i.e. Logarithm of the combination; the optimization objective of contrast learning is to minimize contrast loss, i.e., to make semantically similar picture vectors and text vectors close together in vector space: Wherein, the Representing a set of all the learnable parameters in the contrast learning training, including weight parameters of the image encoder and the text encoder; Indicating that after training is completed, the contrast is lost Reaching a minimum optimal parameter set; after the comparison learning is completed, obtaining an optimized picture vector and a text vector in a vector space; 3) Carrying out semantic space unification on the result of the step 2); Respectively carrying out L2 normalization on the picture vector and the text vector which are optimized in the step 2): Wherein, the And Representing the picture vector and the text vector of the i-th entity in the vector space respectively, And Respectively representing a picture vector and a text vector of the ith entity in a vector space after normalization; fusing the picture vector and the text vector of each entity to generate a joint vector: Wherein, the A weighted joint vector representing text and picture of the i-th entity; Is a parameter for scaling the text vector and the picture vector when fusing the vectors; Normalizing the joint vector: Wherein, the Representing the joint vector of the ith entity in the vector space after normalization; Save all And forming a vector database for retrieving the multi-modal knowledge-graph.
6. The method as recited in claim 5, further comprising: and the multi-mode knowledge graph is searched by any one of picture mode search, text mode search or picture combined text mode search.
7. The method of claim 6, wherein the picture-wise retrieving comprises: 1) The user obtains the retrieved picture Mapping the picture into a picture vector using an image encoder : 2) Normalizing the result of the step 1); Wherein, the Is that Normalized picture vector; 3) Combining the result obtained in the step 2) with the picture vector in the vector database Calculating similarity: then selecting the picture vector with highest similarity score Corresponding entities in a multimodal knowledge graph As a search result: 。
8. The method of claim 6, wherein the text-wise retrieving comprises: 1) User retrieval of a retrieved text description The text is mapped using a text encoder into text vectors: 2) Normalizing the result of the step 1); Wherein, the Is that Normalized text vectors; 3) Combining the result obtained in step 2) with the text vector in the vector database Calculating similarity: then selecting the text vector with highest similarity score Corresponding entities in a multimodal knowledge graph As a search result: 。
9. the method of claim 6, wherein the picture is retrieved in a joint text manner, comprising: 1) The user obtains the retrieved picture And text description The picture and text are mapped using an image encoder and a text encoder into a picture vector and a text vector, respectively: 2) Normalizing the result of the step 1); 3) Fusing the normalized picture vector in the step 2) with the text vector to generate a joint search vector : 4) Combining the result obtained in step 3) with a joint vector in a vector database Calculating similarity: Then selecting the joint vector with highest similarity score Corresponding entities in a multimodal knowledge graph As a search result: 。
10. the utility model provides a water and electricity equipment knowledge map retrieval reinforcing means based on multimodal embedding which characterized in that includes: the multi-modal knowledge graph construction module is used for constructing a multi-modal knowledge graph by utilizing a large language model based on related documents and pictures of the hydroelectric equipment; The vector database generation module is used for vectorizing the pictures and text information in the multi-modal knowledge graph and mapping the pictures and text information to a unified semantic space so as to generate a vector database for retrieving the multi-modal knowledge graph; and the retrieval module is used for retrieving the multi-mode knowledge graph by utilizing the vector database.

Description

Multi-mode embedding-based hydropower equipment knowledge graph retrieval enhancement method and device Technical Field The invention belongs to the technical field of intelligent perception and multi-mode semantic understanding of hydraulic and hydroelectric engineering, and particularly relates to a method and a device for enhancing knowledge graph retrieval of hydroelectric equipment based on multi-mode embedding, which are suitable for intelligent overhaul, state monitoring and knowledge auxiliary decision making of hydroelectric equipment, hydraulic buildings and dispatching operation systems of hydropower stations. Background In the operation management of hydraulic and hydroelectric engineering, electromechanical equipment is various, the structure is complex, and the operation environment is changeable. During equipment inspection or emergency maintenance, an maintainer needs to identify equipment types and fault positions by means of professional experience, and manually search technical manuals or operation rules to obtain a treatment method. This process, which relies on manual experience, has the following major problems: 1) The recognition difficulty is high, the appearance difference of different equipment models and structural parts is large, and the manual recognition error rate is high. 2) The knowledge retrieval efficiency is low, the manual reference procedure, standard or knowledge base is long in time consumption, and the emergency response requirement is difficult to meet. 3) The multi-mode information is not utilized enough, namely the on-site acquired image and sensing data are not effectively related to a knowledge system, and knowledge update is delayed. 4) The knowledge graph lacks visual semantic alignment, the existing water conservancy knowledge graph is mainly constructed based on texts and parameters, and the semantic retrieval from images to knowledge cannot be realized due to the lack of an embedded mapping mechanism with visual features. In addition, hydraulic engineering equipment generally has the characteristics of similar structure and complex components, and equipment of different models or manufacturers is very similar in appearance, so that accurate classification and knowledge retrieval are difficult to realize in the traditional single-mode identification. Meanwhile, a large amount of unstructured data (such as monitoring images, design drawings and operation reports) exist in the water conservancy industry, and the heterogeneous data lack of unified semantic representation and organization modes, so that knowledge fusion and sharing are hindered. With the development of Large Language Models (LLM) and multimodal pre-training models (e.g., CLIP, BLIP, ALIGN, etc.), cross-modal semantic embedding techniques can achieve semantic alignment of images and text in a unified vector space. However, these models are mostly trained based on natural scene data, are limited by image noise, structural complexity and semantic level differences during migration in the field of hydraulic engineering, and are difficult to effectively support component-level semantic recognition and reasoning. The existing research also lacks modeling capability for specific hierarchical semantics, causal chains and visual association structures in the water conservancy industry in the aspect of knowledge graph construction. Therefore, a cross-modal embedded retrieval enhancement method for fusing images and text multi-source data is needed, semantic mapping and reasoning from visual input to knowledge entity are achieved, and basic support is provided for intelligent overhaul and knowledge decision of water conservancy and hydropower equipment. Disclosure of Invention The invention aims to overcome the defects of the prior art and provides a hydropower equipment knowledge graph retrieval enhancement method and device based on multi-mode embedding. The method can realize the deep fusion and semantic alignment of multi-source heterogeneous data such as images, texts and the like, can be widely applied to the scenes such as the overhaul of electromechanical equipment of a hydropower station, the safety inspection and the dispatch operation and maintenance knowledge management of a hydraulic building and the like, can obviously improve the equipment identification precision, the retrieval response speed and the knowledge service intellectualization level, and provides key technical support for the intelligent operation and maintenance, digital twin and auxiliary decision of the hydraulic and hydroelectric engineering. The embodiment of the first aspect of the invention provides a method for enhancing knowledge graph retrieval of hydroelectric equipment based on multi-mode embedding, which comprises the following steps: based on related documents and pictures of the hydroelectric equipment, constructing a multi-modal knowledge graph by using a large language model; Vectorizing the picture and text info