CN-121981281-A - Glass fiber field retrieval enhancement generation method based on large model
Abstract
The invention discloses a large model-based glass fiber field retrieval enhancement generation method. The method includes first constructing a multi-source heterogeneous database containing composition-properties, graph structure, and text. The multi-modal data is mapped to a unified semantic space by the proprietary encoder and projection matrix. After receiving the inquiry, the large model is used for identifying the intention and extracting the entity, and the multipath joint retrieval and the cross attention reordering are carried out in the unified space. And finally, constructing an enhanced promt input large model of the partition structure to generate an answer, and introducing a self-consistency verification mechanism to check the fact support degree, so that the illusion problem is effectively relieved. The invention realizes unified calling and professional reasoning of cross-modal knowledge and can remarkably improve the accuracy of knowledge question-answering and design in the glass fiber field.
Inventors
- ZHAO MING
- LIU XIN
- LANG YUDONG
- ZHAO ZIYU
- ZHANG YAN
- ZHAO QIAN
Assignees
- 南京玻璃纤维研究设计院有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260409
Claims (10)
- 1. The method for generating the retrieval enhancement of the glass fiber field based on the large model is characterized by comprising the following steps of: Step 1, constructing a multi-source heterogeneous database in the glass fiber field to establish a correlation mapping mechanism among different modal data; step 2, coding and characterizing each mode data in the multi-source heterogeneous database, and mapping the data to a uniform semantic embedding space; Step 3, receiving a user query request, identifying a query intention, mapping the query intention to the unified semantic embedded space, and carrying out joint retrieval and recall in the multi-source heterogeneous database; and 4, fusing and reordering recall results, constructing a structured retrieval enhanced promt input large-scale language model, and generating a final answer.
- 2. The large-model-based glass fiber domain retrieval enhancement generation method according to claim 1, wherein the specific process of step 1 comprises the following steps: Component-performance data construction, namely collecting components and corresponding performance data of a glass fiber system, and forming a structured data set after standardized treatment to obtain a component-performance sub-library; Constructing graph structure data based on atomic scale structure information to obtain a graph structure database, wherein the graph structure data is specifically constructed by taking atoms in a glass model as nodes and taking inter-atomic bond connection as edges, node characteristics comprise atomic types, coordination numbers and electronegativity, and edge characteristics comprise bond lengths and bond types; The text data construction, namely extracting professional literature knowledge in the glass fiber field, constructing a text data set oriented to field knowledge expression, and obtaining a text semantic sub-library; and (3) data association, namely establishing a cross-source association set among component-performance data, graph structure data and text data by taking a glass component as a main index through similarity matching or rule mapping.
- 3. The large-model-based glass fiber domain retrieval enhancement generation method according to claim 2, wherein the specific process of step 2 comprises the following steps: embedding and characterizing the component-performance data by adopting a multi-layer perceptron to obtain a component-performance embedded vector ; Message transmission and image level reading are carried out on atomic scale image structure data by adopting an image neural network to obtain an image structure embedded vector ; Embedding and characterizing text data by adopting a pre-training language model to obtain a text embedded vector ; The component-performance embedded vector, the graph structure embedded vector and the text embedded vector are projected to a public semantic space in a unified mode-specific linear projection layer, and the aligned unified embedded vector with consistent dimensionality is obtained and expressed as follows: In the formula, 、 、 Respectively comprises a projection matrix of three modes of component-performance, graph structure and text, 、 、 Respectively bias vectors corresponding to three modes of component-performance, graph structure and text; 、 、 And respectively aligning the components, the performance, the graph structure and the unified embedded vectors corresponding to the text modes.
- 4. The method for generating the large model-based glass fiber domain retrieval enhancement according to claim 1, wherein the step 3 is characterized in that query intent is identified, and specifically comprises the steps of inputting a user query q into a large language model, guiding the model to output a structured intent classification result through instruction prompt, and extracting a key entity set, wherein the intent classification at least comprises knowledge questions and answers, performance prediction and component design, and the key entity at least comprises an oxide component name, a performance index name, a numerical range and constraint conditions.
- 5. The large-model-based glass fiber domain retrieval enhancement generation method according to claim 3, wherein the joint retrieval and recall in the step 3 specifically comprises: Step 3-1, converting the key entities extracted by user query q and intention recognition into query embedded vectors respectively Auxiliary embedding vector ; Step 3-2 embedding vectors based on the query Auxiliary embedding vector The method comprises the steps of simultaneously carrying out parallel search on the component-performance sub-library, the structural diagram database sub-library and the text semantic sub-library in the unified embedding space to realize the joint recall of multi-source heterogeneous data, and specifically comprises the following steps: Calculating search similarity scores of the query and each item in the component-performance sub-library, the structure diagram database sub-library and the text semantic sub-library in the semantic embedded space, wherein the calculation formulas are as follows: In the formula, As a function of the cosine similarity, Weight balance coefficients for text embedding and component-performance embedding; 、 、 searching similarity scores of the j-th item in the query and text sub-library, the component-performance sub-library and the graph structure sub-library respectively; 、 、 The unified embedded vectors corresponding to the j-th item in the text sub-library, the component-performance sub-library and the graph structure sub-library are respectively; Aiming at each sub-library, the sub-libraries are respectively arranged in descending order according to similarity scores, and three recall sets are formed according to TOP-K selection principle: In the formula, 、 、 The recall number of the text sub-library, the component-performance sub-library and the graph structure sub-library is dynamically and adaptively adjusted according to the intention category; 、 、 Respectively a text sub-library, a component-performance sub-library and a graph structure sub-library recall candidate set, Representing taking the first K entries.
- 6. The method for generating large model-based glass fiber domain retrieval enhancement according to claim 5, wherein the step 4 specifically comprises: Step 4-1, combining the candidate sets recalled by the text sub-library, the component-performance sub-library and the graph structure sub-library to form a candidate set ; Step 4-2, reordering mechanism based on cross attention is adopted to perform candidate set The candidate items in the list are subjected to unified scoring and sorting to obtain a final retrieval result set; And 4-3, constructing a structured search enhancement promt based on the final search result set and the intention recognition result, inputting a large language model, and generating a final answer.
- 7. The large-model-based glass fiber domain retrieval enhancement generation method according to claim 6, wherein the step 4-2 specifically comprises: For candidate set Each candidate entry in (a) Uniformly embedding the vector Embedding vectors with queries Inputting a cross-attention scoring network, and calculating a fusion relevance score: In the formula, Representing a vector concatenation operation; Representing an element-by-element product for capturing interaction characteristics between the query and the candidate; Scoring a weight matrix of the network for cross-attention; is a bias vector; activating a function for Sigmoid; For candidate items Is used to determine the fusion relevance score of (1), , For candidate sets Total number of candidate entries in (if) Then If (1) Then If (1) Then ; Representing candidate entries, respectively The component-performance, the graph structure and the text of the text are unified embedded vectors corresponding to the three modes; descending order according to the fusion correlation score, and taking the front The candidate entries construct a final search result set, expressed as: In the formula, In order to finally retrieve the result set, The number of context entries for the final input large language model; Before representation Candidate entries.
- 8. The large model-based glass fiber domain search enhancement generation method of claim 7, wherein structured search enhancement promt is constructed in step 4-3 The following template structure was followed: Wherein, the The system instruction section is used for defining role positioning, answer specification and output format requirement of the model; for the user query segment, the original query q and the intention classification result are contained , For retrieving a context segment, each entry in the final retrieval result set is organized in a modal type partition, expressed as: In the formula, For the component-performance data context, presenting the retrieved component vectors and performance values in a structured table format; converting the graph structure data into natural language description for the structure information context area, wherein the natural language description comprises key structure parameters; the retrieved document paragraph text is directly referenced for the document knowledge context area.
- 9. The large model-based glass fiber domain search enhancement generation method according to claim 8, wherein the system instruction segment is specific to different intention categories Embedding differentiated reasoning instructions: When (when) When the instruction requires the model to synthesize literature knowledge and data evidence for an explanatory answer, wherein, In order to classify the result of the intention, The knowledge question and answer; When (when) When the instruction requires the model to combine component-performance data with graph structure information to perform performance reasoning and prediction, wherein, Is a performance prediction; When (when) When the method is used, the instruction requirement model generates candidate component schemes in a reasoning mode and gives out design basis based on the retrieved similar component system and domain knowledge, wherein, Designed for the composition.
- 10. The method for generating large model-based glass fiber domain search enhancement as in claim 6, wherein step 4 further comprises: And 4-4, self-consistency verification, namely calculating the fact support degree between the key fact statement and the retrieval context evidence in the generated final answer, and if the fact support degree is lower than a preset threshold value, carrying out evidence shortage prompting on the key fact statement.
Description
Glass fiber field retrieval enhancement generation method based on large model Technical Field The invention belongs to the technical field of glass fibers and artificial intelligence, and particularly relates to a large-model-based glass fiber field retrieval enhancement generation method. Background The glass fiber is an inorganic nonmetallic material taking SiO 2 as a main component and being assisted by various oxides such as Al 2O3, caO, mgO and the like, has excellent performances such as high specific strength, corrosion resistance, good insulativity, low cost and the like, and is widely applied to the fields of aerospace, building materials, electronic communication, new energy sources, transportation and the like. The properties of glass fibers are determined by their chemical composition together with the microstructure, and small changes in composition tend to cause significant changes in structure and properties. Therefore, the component design and performance regulation of glass fibers are one of the core problems in material development. The traditional glass fiber component design mainly depends on experience accumulation of researchers and a large number of trial and error experiments, has long research and development period and high cost, and is difficult to meet the requirement of rapid iterative development of high-performance glass fibers. In recent years, with the advancement of material genome projects and the continuous accumulation of material databases, data-driven methods typified by machine learning have been primarily used in glass material property prediction and composition design. The existing research shows that a machine learning model based on component-performance data training, such as a random forest, a support vector machine, an artificial neural network and the like, can realize rapid prediction of glass performance to a certain extent. However, the method has the main defects that firstly, the prediction capability of a model highly depends on the quantity and quality of marked data, the acquisition cost of high-quality experimental data in the glass fiber field is high, the data scale is limited, the generalization capability of the model is insufficient, secondly, the existing method generally only utilizes component-performance numerical data, atomic scale structure information in the glass field and domain expertise contained in literature cannot be effectively integrated, the knowledge utilization rate is low, thirdly, the single model is difficult to support knowledge query and reasoning in a natural language form, and man-machine interaction type material design assistance cannot be realized. The rapid development of large language models (Large Language Model, LLM) provides a new idea for the solution of the above-mentioned problems. The large language model has strong natural language understanding and generating capability, and can be used for reasoning and solving complex problems. However, the application of the general large language model in the specific professional field has obvious limitations that on one hand, the training data is mainly based on general corpus, the professional knowledge coverage in the glass fiber field is insufficient, and illusion or error reasoning is easy to generate, and on the other hand, the large language model does not have the capability of directly processing and reasoning the structured numerical data (such as component-performance data) and the graph structure data (such as an atomic structure diagram), so that the multi-source heterogeneous data in the glass fiber field is difficult to fully utilize. The technology of search enhancement generation (RETRIEVAL-Augmented Generation, RAG) dynamically searches an external knowledge base in the generation process, and fuses the searched related information into the model input, so that the knowledge limitation problem of a large language model is effectively relieved, and remarkable effects are achieved in tasks such as open domain questions and answers, professional domain knowledge questions and answers and the like. However, the existing RAG method is mainly oriented to single text mode data, but does not have the capability of uniformly characterizing, jointly searching and cooperatively utilizing multi-source heterogeneous data such as component-performance numerical data, atomic scale map structural data, text data and the like, and is difficult to be directly applied to research and development scenes of materials such as glass fibers, which need to be fused with multi-type professional data. In view of the above, there is no intelligent material design auxiliary method capable of effectively fusing multi-source heterogeneous data in the glass fiber field, fully utilizing field expertise and supporting natural language interaction. Disclosure of Invention Aiming at the defects or shortcomings of the prior art, the invention provides a large-model-based glass fiber f