CN-122024925-A - High-entropy alloy crystal structure clustering method integrating language large model knowledge

CN122024925ACN 122024925 ACN122024925 ACN 122024925ACN-122024925-A

Abstract

The invention discloses a high-entropy alloy crystal structure clustering method integrating language large model knowledge, which relates to the technical field of machine learning and comprises the steps of performing cleaning treatment on original high-entropy alloy atomic scale data, removing administrative fields irrelevant to a crystal structure, retaining core physical attributes such as atomic types, atomic numbers, total energy, atomic three-dimensional coordinates, three-dimensional stress and the like, uniformly filling zero values into fields corresponding to the missing atomic types to form structured table data, constructing attribute values of each sample into original feature vectors based on the structured table data, performing standardization treatment on all feature dimensions, eliminating dimension difference influences to obtain numerical embedding representation, analyzing semantic relations between attribute names and attribute values in the numerical embedding representation by utilizing a large language model to generate professional natural language texts describing microstructure characteristics, and inputting the professional natural language texts into a special text embedding model to be converted into semantic text embedding vectors corresponding to the samples.

Inventors

LI FEIJIANG
LI ZHENXIONG
WANG JIETING
QIAN YUHUA

Assignees

山西大学

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (10)

1. A high-entropy alloy crystal structure clustering method integrating language big model knowledge is characterized by comprising the following steps: Performing cleaning treatment on the original high-entropy alloy atomic scale data, removing administrative fields irrelevant to a crystal structure, retaining core physical properties such as atomic species, atomic number, total energy, atomic three-dimensional coordinates, three-dimensional stress and the like, uniformly filling zero values in fields corresponding to the missing atomic species, and forming structured table data; based on the structured form data, constructing the attribute value of each sample into an original feature vector, and performing standardized processing on all feature dimensions to eliminate the influence of dimension differences, so as to obtain a numerical embedded representation; analyzing semantic relations between attribute names and attribute values in the numerical embedding representation by using a large language model to generate professional natural language texts describing microstructure characteristics of the samples; Inputting the professional natural language text into a special text embedding model, and converting the special natural language text into a semantic text embedding vector corresponding to the sample; taking the numerical value embedded representation and the semantic text embedded vector as bimodal input, respectively constructing neighborhood distribution in a numerical value space and a text space, and establishing association mapping between the numerical value space and the text space through a cross-modal mutual distillation mechanism; and calculating neighborhood distillation loss, modal consistency loss and entropy regularization loss based on neighborhood distribution, and jointly optimizing a clustering objective function to output a clustering result of the high-entropy alloy crystal structure.
2. The method for clustering high-entropy alloy crystal structures by fusing language big model knowledge according to claim 1, wherein the method is characterized by performing cleaning treatment on original high-entropy alloy atomic scale data, removing administrative fields irrelevant to crystal structures, retaining core physical properties such as atomic types, atomic numbers, total energy, atomic three-dimensional coordinates, three-dimensional stress and the like, uniformly filling zero values in fields corresponding to missing atomic types, and forming structured form data, and comprises the following specific steps: Identifying fields containing sample numbers, test batches and experimenters in the original data, and removing the fields from the data set; for atomic species present in each sample Extracting the three-dimensional coordinates of all atoms under the category , , ) And three-dimensional stress , , ) Wherein Represents an index of the atomic species, Represents the first under the category An atom; The means are calculated for the coordinate components of the same kind of atoms respectively: ; The mean value is also calculated for the stress components: ; Wherein, the For the species in the sample Atomic number of (a); If a certain atomic species If the sample does not appear in the current sample, the atomic number is 0, and the coordinate mean value is set And the stress mean value All set to 0; Total energy of the sample And atomic total number Directly reserved as a global attribute field, and not aggregated or transformed; And arranging fields corresponding to all the atomic types according to a fixed sequence to form a row of structured records.
3. The method for clustering the high-entropy alloy crystal structure fusing language big model knowledge according to claim 2, wherein the method is characterized in that based on structured form data, the attribute value of each sample is constructed into an original feature vector, and standardized processing is carried out on all feature dimensions, so as to eliminate dimension difference influence, and obtain a numerical embedded representation, and the method comprises the following specific steps: Splicing all numerical fields of each row in the structured form into original feature vectors according to a predefined sequence Wherein Is the total number of feature dimensions; For the first The dimension features are all Calculating the mean value in each sample And standard deviation of : ; For every sample the first Dimensional features perform Z-score normalization: ; obtaining a normalized numerical embedded representation For subsequent bimodal modeling.
4. The method for clustering high-entropy alloy crystal structures by fusing language big model knowledge as set forth in claim 3, wherein the semantic relation between attribute names and attribute values in the embedded representation is analyzed by using the big language model to generate professional natural language text describing microstructure characteristics of the sample, and the method comprises the steps of { column name sequences of the structured form Normalized values of the corresponding samples { and } Calling Llama large language model subjected to material science corpus fine tuning, reasoning and outputting a section of natural language description according to prompt content The description covers the related semantic information of crystal structures such as atomic composition proportion, typical atomic distance trend, overall stress balance state, energy stability and the like, ensures the generation of text No hypothetical contents not contained in the original data are introduced, and only the logical deduction and term normalization expression are performed based on the input field.
5. The method for clustering high-entropy alloy crystal structure with large language model knowledge as set forth in claim 4, wherein said inputting special text into special text embedding model converts the special natural language text into semantic text embedding vector corresponding to the sample, and the specific steps are that the generated natural language description Input to nomic-embedded-text embedded model, and output a fixed dimension by encoder of the model Is a dense vector of (2) The vectors characterize the structural chemistry of the high-entropy alloy in semantic space, the vectors And numerical embedding representation Pairs of bimodal features are formed for subsequent cross-modal alignment.
6. The method for clustering high-entropy alloy crystal structures with large language model knowledge fusion as set forth in claim 5, wherein the numerical embedding representation and semantic text embedding vectors are used as bimodal inputs to construct neighborhood distribution in numerical space and text space respectively, and correlation mapping between the numerical space and text space is established by a cross-modal mutual distillation mechanism, comprising the steps of firstly, for each sample, an original data vector is obtained Large language model Firstly, analyzing the table head of a table, namely attribute names, further carrying out association understanding on each attribute name and a corresponding sample attribute value, and finally generating a section of natural language text describing the characteristics of the sample by using expert knowledge, wherein the formula is as follows: The rest of the materials are mixed together, Is the first The original vector of the individual samples is then used, In the case of a large language model, Post-generated first for large model language reasoning The nomic-emmbed-text model is then a text description that is specific to converting text into a vector embedding model, generating a large language model As input, through a pre-trained nomic-emmbed-text model Can be converted into corresponding numeric text embedded vectors, and has the following formula: Wherein, the Is an embedded interface of nomic-embed-text model, Is the first Text of individual samples embeds the vector.
7. The method for clustering high-entropy alloy crystal structures by fusing language big model knowledge according to claim 6, wherein the method for clustering high-entropy alloy crystal structures based on neighborhood distribution calculation comprises the steps of, for each sample, calculating neighborhood distillation loss, modal consistency loss and entropy regularization loss, and jointly optimizing a clustering objective function to output a clustering result of the high-entropy alloy crystal structures Query it with FAISS Neighbor set, defining a table sample neighborhood distribution: Wherein, the Is the first The raw data of the individual samples is embedded into the vector, Is the first The text of the individual samples is embedded into the vector, As a function of the degree of similarity, Is the first A neighbor set of the individual samples, As a subscript to a neighbor sample, For tabular data samples With corresponding text neighbors Similarity probability of (c) text neighborhood distribution: Wherein, the Embedding samples for text Neighbor to corresponding form data Is a similarity probability of (1); neighborhood distillation loss: Wherein, the As a total number of samples, For samples in text space Is used to determine the neighborhood distribution vector of (c), For samples in a tabular data space Is used to determine the neighborhood distribution vector of (c), Two distributions are measured for Kullback-Leibler divergence And Is used for the purpose of determining the difference in (2), For distillation loss, the neighborhood distribution used for aligning the table and the text is smaller, and the smaller the value is, the more consistent the neighborhood distribution of the two modes is; Modal consistency loss: Wherein, the A distribution is predicted for the ith table data sample, For the predicted distribution of the text, As a total number of samples, For the mode consistency loss, the consistency of the form mode and the text mode on the clustering prediction distribution is measured, and the smaller the value is, the more consistent the predictions of the two modes are; entropy regularization (collapse prevention) design entropy terms encourage overall uniformity and suppress excessive confidence: wherein the table data samples are evenly predicted distributed Average predictive distribution of text data samples , For entropy regularization loss, measuring the uniformity of prediction distribution, preventing all samples from being classified into the same class, wherein the larger the value is, the higher the entropy is, and the more uniform the distribution is; The total loss is: wherein the method comprises the steps of As the weight of the material to be weighed, For total loss, a smaller value indicates a better overall training effect of the model.
8. The method for clustering the high-entropy alloy crystal structure with the knowledge of language big models fused is characterized in that the clustering result is used for guiding the crystal structure design and the performance prediction of a new high-entropy alloy material, and specifically comprises the steps of carrying out structural commonality analysis on high-entropy alloy samples belonging to the same category in clustering output, extracting statistical rules of the category on the atomic composition proportion, the average atomic distance, the total energy distribution and the stress balance state, constructing a structure-category mapping rule base based on the statistical rules, and executing the whole flow process when a new high-entropy alloy atomic scale data is input to obtain the category labels of the new high-entropy alloy material; And feeding the inferred result back to a material high-throughput screening or first-principle computing process as priori knowledge, so as to reduce the search space of candidate structures and improve the research and development efficiency of the new material.
9. A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the computer program is characterized in that the processor realizes the step of the high-entropy alloy crystal structure clustering method fusing language big model knowledge according to any one of claims 1-8 when executing the computer program.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the step of the method for clustering high-entropy alloy crystal structures with fusion of language big model knowledge according to any one of claims 1 to 8.

Description

High-entropy alloy crystal structure clustering method integrating language large model knowledge Technical Field The invention relates to the technical field of machine learning, in particular to a high-entropy alloy crystal structure clustering method integrating language large model knowledge. Background The high-entropy alloy is an important breakthrough in the field of material science in recent years, and is characterized in that the high-entropy alloy is a multi-principal element solid solution system which is composed of five or more principal elements in a nearly equal atomic proportion. The unique component design ensures that the alloy has four effects which are difficult to reach by the traditional alloy, namely, a high entropy effect, a lattice distortion effect, a delayed diffusion effect and a cocktail effect, thereby showing remarkable advantages in the aspects of mechanical property, corrosion resistance, thermal stability, irradiation resistance and the like. However, this multi-principal component complex system also leads to extreme complexity of the microstructure, including the interweaving of various structural features such as local chemical order, nanoclusters, lattice distortion regions, etc., so that it is difficult for the conventional structural characterization method based on a single or a few features to fully and accurately reveal the essence of the microstructure. Therefore, the development of a novel analysis method capable of deeply analyzing the complex microstructure of the high-entropy alloy becomes a key scientific problem for establishing a reliable 'composition-structure-performance' correlation model and realizing material property design. In the technical field of microstructure characterization, a cluster analysis method based on atomic scale simulation data has become an important means for researching a local atomic environment of a material. Through simulation methods such as molecular dynamics and first sexual principle calculation, multidimensional physical characteristic data including atomic types, spatial positions, stress states, energy distribution and the like can be obtained. In the prior art, high-dimensional characteristics are processed by adopting a dimension reduction method such as principal component analysis, t-distribution random neighborhood embedding and the like, and atoms are grouped and classified by combining an unsupervised learning algorithm such as K-means clustering and density clustering. Although the method can distinguish different types of atomic local environments to a certain extent, the method still essentially belongs to mathematical classification based on numerical similarity, and cannot integrate physical rules in the field of material science into a clustering process. When a high-entropy alloy system is processed, the traditional numerical clustering method faces three core dilemmas that the pure data-driven paradigm cannot introduce language big model knowledge as a guide to cause the dislocation of a clustering result and a physical mechanism, the distinguishing capability of the dependent numerical characteristics is limited when different phase structures with similar physical states are described to influence the clustering accuracy, and most methods need to rely on known structure labels to conduct supervised learning, are difficult to be suitable for exploratory research of new materials with unknown components and structures, and restrict the application potential of the method in efficient material discovery. Disclosure of Invention The present invention has been made in view of the above-described problems occurring in the prior art. The invention provides a high-entropy alloy crystal structure clustering method fused with language big model knowledge, which solves the problems that a pure data-driven paradigm cannot introduce language big model knowledge as a guide to cause the dislocation of a clustering result and a physical mechanism, the distinguishing capability of the dependent numerical characteristics is limited when different phase structures with similar physical states are described to influence the clustering accuracy, and most methods need to rely on known structure labels for supervised learning, are difficult to be suitable for exploratory research of new materials with unknown components and structures, and restrict the application potential of the new materials in efficient material discovery. In order to solve the technical problems, the invention provides the following technical scheme: In a first aspect, the invention provides a method for clustering high-entropy alloy crystal structures by fusing language big model knowledge, which comprises the following steps: Performing cleaning treatment on the original high-entropy alloy atomic scale data, removing administrative fields irrelevant to a crystal structure, retaining core physical properties such as atomic species, atomic number, total