CN-121303234-B - Industrial cognition base system based on multi-mode contrast learning and execution method

CN121303234BCN 121303234 BCN121303234 BCN 121303234BCN-121303234-B

Abstract

The invention provides an industrial cognition base system based on multi-modal contrast learning and an execution method thereof, wherein the industrial cognition base system based on multi-modal contrast learning dynamically injects industrial domain knowledge into a multi-modal feature learning process in the form of a structured sub-map by constructing a system framework comprising a knowledge map management module and a cross-modal coding fusion module, and combines a contrast learning mechanism introducing industrial semantic constraint, so that a model obtained by final training not only can realize semantic alignment of multi-modal data, but also can ensure that the generated unified semantic representation and reasoning result strictly accords with a predefined industrial logic rule, thereby effectively overcoming the defect that a simple data driving method in the prior art possibly generates a result against industrial common sense, and remarkably improving the output reliability, decision reliability and practical application value of the industrial cognition system in key tasks such as fault diagnosis, state monitoring and the like.

Inventors

WANG HONGMEI
LIU CHANG
LI JIANHUA
LIU HONGYAN
JIN XIANG

Assignees

北京易智时代数字科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251212

Claims (9)

1. An industrial cognitive base system based on multi-modal contrast learning, comprising: The data acquisition and preprocessing module is used for synchronously acquiring data of a text mode, a visual mode, a time sequence mode and a structural knowledge mode from a multi-source system of an industrial field, and generating a standardized sample triplet after alignment, wherein the sample triplet comprises an image, a text description and a candidate identifier set pointing to a core entity in a preset target knowledge graph, and the candidate identifier set is acquired from the aligned text description in a mode of identifying a named entity and linking with a preliminary entity; The knowledge graph management module is used for storing and maintaining the target knowledge graph, executing semantic analysis and entity link according to the text description in the sample triplet and the identifier set, and extracting and constructing an associated sub-graph which is related to the current context and is rich in semantic relation from the stored target knowledge graph; the cross-modal coding and fusion module is used for extracting visual features, text features and knowledge features from the associated sub-atlas, fusing the visual features, the text features and the knowledge features to generate text features with enhanced knowledge, and mapping the text features and the visual features to vector representations in a unified semantic space; The system comprises a vector representation module, a contrast learning training module, a rule consistency discriminator, a rule violation logic rule calculation module and a rule violation logic rule calculation module, wherein the vector representation module is used for calculating contrast loss and industrial semantic constraint loss, and optimizing model parameters based on a compound loss function to obtain an optimized industrial cognition base model, the compound loss function used by the contrast learning training module comprises standard contrast loss and industrial semantic constraint loss, the industrial semantic constraint loss is punished for the vector representation based on the rule violation logic rule, the industrial semantic constraint loss is calculated based on the rule violation score, and the calculation formula for calculating the rule violation score comprises the following steps: Wherein, the Is a rule violation score, which is a non-negative scalar value used to quantify the degree of violation of an industrial rule by the ith sample, with a larger value indicating a more serious violation; Is a visual feature vector, and is output from a visual encoder of the cross-modal encoding and fusion module and used for representing semantic embedding of an image in an ith sample; The knowledge-enhanced text feature vector is output from a knowledge injection module of the cross-modal coding and fusion module and represents semantic embedding of the text in the ith sample after knowledge enhancement; is the violation penalty weight of the kth rule, is a non-negative scalar parameter used to adjust the importance of the rule in the total violation score; is a smooth violation intensity function of the kth rule, and is used for calculating the violation intensity of the kth rule under a given distance value, and the output value of the violation intensity function is smoothly increased from 0; is a distance function defined as the square of Euclidean distance, i.e For calculating the difference between the two vectors; is the visual projection matrix corresponding to the kth rule and is one A matrix for projecting visual features into a semantic subspace of interest for the kth rule; is the text projection matrix corresponding to the kth rule, is one A matrix for projecting knowledge-enhanced text features into the same regular semantic subspace, Representing the dimensions of the original feature vector, Representing the dimension of the projected feature vector; The online continuous learning module is used for monitoring a new sample data stream, detecting semantic drift, and triggering model increment updating when the semantic drift exceeds a threshold value; And the model service module is used for packaging the optimized industrial cognitive base model and providing an embedded computing, similarity inquiring and model updating interface.
2. The system of claim 1, wherein the data acquisition and preprocessing module is specifically configured to align visual frames, time-series segments with corresponding text descriptions based on a time stamp and a device identifier, and link industrial terms in text to knowledge-graph nodes via named entity recognition and entity linking techniques.
3. The system of claim 1, wherein the target knowledge graph stored by the knowledge graph management module comprises a device knowledge graph, a process parameter library, and a historical fault case library, supporting SPARQL queries and sub-graph extraction.
4. The system of claim 1, wherein the cross-modal encoding and fusion module comprises a visual encoder, a text encoder and a knowledge injection module, wherein the visual encoder extracts image features by adopting a Vision Transformer network, the text encoder extracts text features by adopting a Transformer encoder, and the knowledge injection module encodes sub-maps by adopting a graph attention network to obtain knowledge features and fuses the knowledge features with the text features to generate knowledge-enhanced text features.
5. The system of claim 1, wherein the computing is performed by The formula of (a) includes: Wherein, the Is the tolerance threshold of the kth rule, is a non-negative scalar, represents the maximum value of the square of the distance between the vision and text projection features allowed by the rule, and if the maximum value exceeds the maximum value, the violation degree is considered to be obviously increased; The sensitivity coefficient of the kth rule is a positive scalar and controls the rate at which the violation intensity increases with distance above the threshold.
6. The system of claim 5, wherein the projection matrix is initialized And The method comprises the following steps: Extracting embedded representations of entities and relationships involved in the kth rule from the knowledge-graph management module and mapping the embedded representations to D r -dimensional space via a linear transformation layer to initialize the projection matrices, respectively And ; Optimizing the projection matrix And The method comprises the following steps: during the training process of the contrast learning training module, the projection matrix is used for generating a projection matrix And As a learnable parameter, and adding a regularization term to the composite loss function, calculating a calculation formula of the regularization term includes: Wherein, the Is the projection matrix regularization loss term corresponding to the kth industry logic rule, And In order to initialize the matrix obtained in the step, The Frobenius norm of the matrix is represented to constrain the deviation between the optimized projection matrix and the initialization matrix.
7. The system of claim 5, wherein the penalty weights By an adaptive weight adjustment mechanism, the calculation formula comprises: Wherein, the The global punishment scaling factor is a positive scalar super-parameter, and is preset by a training person and used for integrally adjusting the intensity of punishment weights of all rules; the dynamic and static weight mixing coefficient is an adjustable super-parameter in the [0,1] interval and is used for balancing the proportion between the static preset importance of the rule and the dynamic violation intensity learned in the model training process; The static importance weight of the kth rule is a real number, and the parameter is derived from the rule importance score preset by a domain expert in the knowledge graph management module; Is an S-shaped function that maps inputs to (0, 1) intervals for mapping the static importance of expert scores into a stable range of values; is a hyperbolic tangent function that maps the input to the (-1, 1) interval, at non-negative inputs The actual output range is [0, 1) for mapping the scaled dynamic violation intensity to the bounded range; The scaling factor is a scaling factor of the dynamic violation intensity, the parameter is a positive scalar super parameter which is preset by a training person and is used for adjusting the numerical range of the dynamic violation intensity and controlling the sensitivity of the dynamic part to the final weight; Is the smooth dynamic violation intensity of the kth rule in the current training stage, the parameter is a non-negative real number, and is calculated by a momentum update mode to reflect the average violation intensity of the kth rule in the current training batch and avoid batch fluctuation, and the calculation is carried out The formula of (a) includes; Wherein, the For the smooth coefficient with the value range of [0.9,1), the method is used for controlling the retention degree of the historical violation intensity information in the smooth calculation; is of batch size; representing the smoothed dynamic violation intensity value at the t-th training iteration.
8. The system of claim 1, wherein the model service module is packaged as a micro-service and provides services through a plurality of application programming interfaces, the application programming interfaces comprising: the visual embedding interface is used for receiving an input image and returning a corresponding visual embedding vector; the knowledge enhancement text embedding interface is used for receiving an input text and returning a corresponding knowledge enhancement text embedding vector; the similarity calculation interface is used for calculating and returning the semantic similarity between the image and the text; and the model updating interface is used for receiving the new sample and triggering the online continuous learning module to execute incremental updating.
9. A method of performing an industrial multi-modal cognitive base, wherein the method is applied to the multi-modal contrast learning-based industrial cognitive base system of any one of claims 1 to 8, the method comprising: Constructing an industrial multi-modal heterogeneous training data set, synchronously acquiring data of a text mode, a visual mode, a time sequence mode and a structured knowledge mode from an industrial field multi-source system, and aligning to generate a standardized sample triplet, wherein the sample triplet comprises an image, a text description and a candidate identifier set pointing to a core entity in a preset target knowledge graph, and the candidate identifier set is acquired from the aligned text description in a mode of identifying a named entity and linking with a preliminary entity; The method comprises the steps of constructing an industrial enhancement type cross-modal encoder, comprising a visual encoder, a text encoder and a knowledge injection module, wherein the visual encoder, the text encoder and the knowledge injection module are used for extracting visual features, text features and knowledge features from the associated sub-atlas and fusing the visual features, generating knowledge enhancement text features, and mapping the knowledge enhancement text features and the visual features to vector representations in a unified semantic space; Designing a composite contrast loss function of the industrial semantic constraint, including standard contrast loss and industrial semantic constraint loss calculated based on a rule violation score for quantifying the degree to which the model output violates the industrial rule; Performing end-to-end contrast learning training, optimizing model parameters by using the vector representation and a compound contrast loss function, and outputting an optimized industrial cognition base model, wherein the compound loss function used by the contrast learning training module comprises standard contrast loss and industrial semantic constraint loss, the industrial semantic constraint loss penalizes whether the vector representation violates a predefined industrial logic rule or not based on a rule consistency discriminator, the industrial semantic constraint loss is calculated based on a rule violation score, and a calculation formula for calculating the rule violation score comprises the following steps: Wherein, the Is a rule violation score, which is a non-negative scalar value used to quantify the degree of violation of an industrial rule by the ith sample, with a larger value indicating a more serious violation; Is a visual feature vector, and is output from a visual encoder of the cross-modal encoding and fusion module and used for representing semantic embedding of an image in an ith sample; The knowledge-enhanced text feature vector is output from a knowledge injection module of the cross-modal coding and fusion module and represents semantic embedding of the text in the ith sample after knowledge enhancement; is the violation penalty weight of the kth rule, is a non-negative scalar parameter used to adjust the importance of the rule in the total violation score; is a smooth violation intensity function of the kth rule, and is used for calculating the violation intensity of the kth rule under a given distance value, and the output value of the violation intensity function is smoothly increased from 0; is a distance function defined as the square of Euclidean distance, i.e For calculating the difference between the two vectors; is the visual projection matrix corresponding to the kth rule and is one A matrix for projecting visual features into a semantic subspace of interest for the kth rule; is the text projection matrix corresponding to the kth rule, is one A matrix for projecting knowledge-enhanced text features into the same regular semantic subspace, Representing the dimensions of the original feature vector, Representing the dimension of the projected feature vector; An online continuous learning and model updating mechanism is deployed, a new sample data stream is monitored, semantic drift is detected, and model incremental updating is triggered when the semantic drift exceeds a threshold.

Description

Industrial cognition base system based on multi-mode contrast learning and execution method Technical Field The invention relates to the technical field of industrial artificial intelligence, in particular to an industrial cognitive base system based on multi-mode contrast learning and an execution method. Background With the increasing level of industrial intelligence, industrial sites have deployed a large number of heterogeneous data acquisition devices, producing multi-modal industrial data including visual images, text recordings, time series data, and structured knowledge. In the prior art, an industrial artificial intelligent system mainly adopts a single-mode or simple multi-mode fusion method, and performs joint training after extracting the characteristics of each mode through a deep learning model. However, the method has obvious limitation that various deep learning models are often focused on data-driven feature extraction and pattern recognition when processing industrial multi-pattern data, but lack of an effective mechanism for systematically integrating prior industrial knowledge such as field expert knowledge, equipment operation principles, process specifications and the like into a model learning process, so that the reasoning result of the model in an actual industrial scene may violate basic industrial logic and physical rules, and the reliability and credibility of output of the model are difficult to meet the strict requirements of industrial application on safety and accuracy. This defect severely restricts the in-depth application and popularization of artificial intelligence technology in the critical industrial field. Disclosure of Invention In view of this, the present invention provides an industrial cognitive base system based on multimodal contrast learning. One or more embodiments of the present disclosure are also directed to an industrial cognitive base method based on multi-modal contrast learning to address the technical deficiencies of the prior art. According to a first aspect of the present invention, there is provided an industrial cognitive base system based on multimodal comparative learning, comprising: The data acquisition and preprocessing module is used for synchronously acquiring data of a text mode, a visual mode, a time sequence mode and a structured knowledge mode from a multi-source system of an industrial field, and generating a standardized sample triplet after alignment, wherein the sample triplet comprises an image, a text description and a candidate identifier set pointing to a core entity in a preset target knowledge graph, and the candidate identifier set is acquired from the aligned text description in a mode of identifying a named entity and linking with a preliminary entity; The knowledge graph management module is used for storing and maintaining a target knowledge graph, executing semantic analysis and entity link according to text description and identifier set in the sample triplet, and extracting and constructing an associated sub-graph which is related to the current context and is rich in semantic relation from the stored target knowledge graph; the cross-modal coding and fusion module is used for extracting visual features, text features and knowledge features from the associated sub-atlas and fusing the visual features, generating knowledge-enhanced text features, and mapping the knowledge-enhanced text features and the visual features to vector representations in a unified semantic space; The contrast learning training module is used for calculating contrast loss and industrial semantic constraint loss by using vector representation, optimizing model parameters based on a composite loss function and obtaining an optimized industrial cognitive base model; The online continuous learning module is used for monitoring a new sample data stream, detecting semantic drift, and triggering model increment updating when the semantic drift exceeds a threshold value; and the model service module is used for packaging the optimized industrial cognitive base model and providing embedded computing, similarity query and model updating interfaces. In some embodiments, the data acquisition and preprocessing module is specifically configured to align the visual frames, the time sequence segments, and the corresponding text descriptions based on the time stamps and the device identifiers, and link the industrial terms in the text to the knowledge graph nodes through named entity recognition and entity linking techniques. In some embodiments, the target knowledge graph stored by the knowledge graph management module includes a device knowledge graph, a process parameter library, and a historical fault case library, supporting SPARQL queries and subgraph extraction. In some embodiments, the cross-modal encoding and fusion module comprises a visual encoder, a text encoder and a knowledge injection module, wherein the visual encoder adopts Vision Transformer ne