US-20260127855-A1 - MULTIMODAL EMISSIONS MODEL

US20260127855A1US 20260127855 A1US20260127855 A1US 20260127855A1US-20260127855-A1

Abstract

Emission images and emission records are obtained and paired to obtain image-record pairs. A subset of the image-record pairs is selected as a training dataset. For each image-record pair of the training dataset, a tensor embedding of the image is obtained from a vision encoder of a multi-modal large language model. Further, a tensor embedding of the record is obtained from a transformer-based text encoder of the multi-modal large language model. A contrastive learning engine transforms the tensor embeddings into a common embedding space. Image tensor embeddings and record tensor embeddings in the common embedding space to obtain a set of matched image-record pairs and a set of mismatched image-record pairs. The M-LLM is fine-tuned with the set of matched image-record pairs and the set of mismatched image-record pairs.

Inventors

Sunil Manikani
Gian-Marcio Gey
Harshada Shirish Modak

Assignees

SCHLUMBERGER TECHNOLOGY CORPORATION

Dates

Publication Date: 20260507
Application Date: 20251031
Priority Date: 20241101

Claims (20)

1 . A method comprising: obtaining a plurality of emission images and a plurality of emission records; pairing emission records of the plurality of emission records with emission images of the plurality of emission images to obtain a plurality of image-record pairs; selecting, from the plurality of image-record pairs, a subset of the plurality of image-record pairs as a training dataset; for each image-record pair of the training dataset, obtaining, from a vision encoder of a multi-modal large language model (M-LLM), a first tensor embedding corresponding to an emission image of the image-record pair, obtaining, from a transformer-based text encoder of the M-LLM, a second tensor embedding corresponding to an emission record of the image-record pair, and transforming, by a contrastive learning engine, the first tensor embedding and the second tensor embedding into a common embedding space; matching pairs using the first tensor embedding and the second tensor embedding in the common embedding space to obtain a set of matched image-record pairs and a set of mismatched image-record pairs; and fine-tuning the M-LLM with the set of matched image-record pairs and the set of mismatched image-record pairs.
2 . The method of claim 1 , further comprising: calculating, by the contrastive learning engine, a plurality of similarity scores between the set of matched image-record pairs and the set of mismatched image-record pairs; and applying, by the contrastive learning engine, a contrastive learning function to the plurality of similarity scores to compute a contrastive loss function value, wherein fine-tuning the M-LLM is by backpropagating a gradient of the contrastive loss function value through a plurality of neural network layers of the M-LLM.
3 . The method of claim 1 , wherein the M-LLM performs operations comprising: extracting, by the vision encoder, a plurality of features from an emission image; and generating a feature vector of the plurality of features as a tensor embedding of the emission image.
4 . The method of claim 1 , further comprising: obtaining, from the M-LLM, a first plurality of tensor embeddings corresponding to the plurality of emission images; storing the first plurality of tensor embeddings in a data repository; generating a second plurality of tensor embeddings, corresponding to the plurality of emission records; and storing the second plurality of tensor embeddings in the data repository, wherein the second plurality of tensor embeddings is paired with the first plurality of tensor embeddings according to the plurality of image-record pairs.
5 . The method of claim 1 , further comprising: configuring the vision encoder of the M-LLM by disabling a plurality of parameters of neural network layers of the vision encoder from being updated during fine-tuning, wherein the plurality of parameters comprises weights and biases of the neural network layers of the vision encoder.
6 . The method of claim 1 , further comprising: updating a plurality of parameters of a plurality of neural network layers of the transformer-based text encoder of the M-LLM, by backpropagating a gradient of the contrastive loss function value through the plurality of neural network layers of the transformer-based text encoder, wherein the plurality of parameters is not disabled from being updated.
7 . The method of claim 1 , further comprising: receiving a new emission image from a client application; generating, by the vision encoder of the fine-tuned M-LLM, a new image embedding corresponding to the new emission image, wherein the new image embedding is a tensor embedding; and comparing the new image embedding to a plurality of image embeddings in a data repository, wherein the plurality of image embeddings are tensor embeddings corresponding to the plurality of emission images.
8 . The method of claim 7 , further comprising: identifying a set of image embeddings of the plurality of image embeddings, wherein the set of image embeddings each satisfies a similarity threshold with respect to the new image embedding; selecting a set of emission record embeddings, wherein the set of emission record embeddings are paired with the set of image embeddings; and transmitting the set of emission record embeddings to the M-LLM.
9 . The method of claim 8 , further comprising: generating, by the M-LLM, using the set of emission record embeddings, a natural language summary of emission records corresponding to the set of emission record embeddings; and transmitting the natural language summary to the client application.
10 . The method of claim 1 , wherein the plurality of emission images is obtained from an emission image history in a data repository.
11 . A system, comprising: at least one computer processor; a multimodal large language model (M-LLM), executing on the at least one computer processor; and a multimodal training application, executing on the at least one computer processor and configured for performing operations comprising: obtaining a plurality of emission images and a plurality of emission records from a data repository, pairing emission records of the plurality of emission records with emission images of the plurality of emission images to obtain a plurality of image-record pairs, selecting, from the plurality of image-record pairs, a subset of the plurality of image-record pairs as a training dataset; for each image-record pair of the training dataset, obtaining, from a vision encoder of the M-LLM, a first tensor embedding corresponding to an emission image of the image-record pair, obtaining, from a transformer-based text encoder of the M-LLM, a second tensor embedding corresponding to an emission record of the image-record pair, and transforming, by a contrastive learning engine of the training application, the first tensor embedding and the second tensor embedding into a common embedding space; matching pairs using the first tensor embedding and the second tensor embedding in the common embedding space to obtain a set of matched image-record pairs and a set of mismatched image-record pairs; and fine-tuning the M-LLM with the set of matched image-record pairs and the set of mismatched image-record pairs.
12 . The system of claim 11 , wherein the M-LLM performs operations comprising: extracting, by the vision encoder, a plurality of features from an emission image; and generating a feature vector of the plurality of features as a tensor embedding of the emission image.
13 . The system of claim 11 , wherein the operations further comprise: obtaining, from the M-LLM, a first plurality of tensor embeddings corresponding to the plurality of emission images; storing the first plurality of tensor embeddings in a data repository; generating a second plurality of tensor embeddings, corresponding to the plurality of emission records; and storing the second plurality of tensor embeddings in the data repository, wherein the second plurality of tensor embeddings is paired with the first plurality of tensor embeddings according to the plurality of image-record pairs.
14 . The system of claim 11 , wherein the operations further comprise: configuring a vision encoder of the M-LLM by disabling a plurality of parameters of neural network layers of the vision encoder from being updated during fine-tuning, wherein the plurality of parameters comprises weights and biases of the neural network layers of the vision encoder.
15 . The system of claim 11 , wherein the operations further comprise: updating a plurality of parameters of a plurality of neural network layers of the M-LLM by backpropagating a gradient of the contrastive loss function value through the plurality of neural network layers of the M-LLM, wherein the plurality of parameters is not disabled from being updated.
16 . The system of claim 11 , wherein the operations further comprise: receiving a new emission image from a client application; generating, by the vision encoder of the fine-tuned M-LLM, a new image embedding corresponding to the new emission image, wherein the new image embedding is a tensor embedding; and comparing the new image embedding to a plurality of image embeddings in the data repository, wherein the plurality of image embeddings are tensor embeddings corresponding to the plurality of emission images.
17 . The system of claim 16 , wherein the operations further comprise: identifying a set of image embeddings of the plurality of image embeddings, wherein the set of image embeddings each satisfies a similarity threshold with respect to the new image embedding; selecting a set of emission record embeddings, wherein the set of emission record embeddings are paired with the set of image embeddings; and transmitting the set of emission record embeddings to the M-LLM.
18 . A non-transitory computer readable medium comprising computer readable program code for causing a computer system to perform operations comprising: obtaining a plurality of emission images and a plurality of emission records; pairing emission records of the plurality of emission records with emission images of the plurality of emission images to obtain a plurality of image-record pairs; selecting, from the plurality of image-record pairs, a subset of the plurality of image-record pairs as a training dataset; for each image-record pair of the training dataset, obtaining, from a vision encoder of a multi-modal large language model (M-LLM), a first tensor embedding corresponding to an emission image of the image-record pair, obtaining, from a transformer-based text encoder of the M-LLM, a second tensor embedding corresponding to an emission record of the image-record pair, and transforming, by a contrastive learning engine, the first tensor embedding and the second tensor embedding into a common embedding space; matching pairs using the first tensor embedding and the second tensor embedding in the common embedding space to obtain a set of matched image-record pairs and a set of mismatched image-record pairs; and fine-tuning the M-LLM with the set of matched image-record pairs and the set of mismatched image-record pairs.
19 . The non-transitory computer readable medium of claim 18 , wherein the operations further comprise: receiving a new emission image from a client application; generating, by a vision encoder of the fine-tuned M-LLM, a new image embedding corresponding to the new emission image, wherein the new image embedding is a tensor embedding; comparing the new image embedding to a plurality of image embeddings in a data repository, wherein the plurality of image embeddings are tensor embeddings corresponding to the plurality of emission images; and identifying a set of image embeddings of the plurality of image embeddings, wherein the set of image embeddings each satisfies a similarity threshold with respect to the new image embedding.
20 . The non-transitory computer readable medium of claim 19 , wherein the operations further comprise: selecting a set of emission record embeddings from the data repository, wherein the set of emission record embeddings are paired with the set of image embeddings; generating, by the fine-tuned M-LLM, using the set of emission record embeddings, a natural language summary of emission records corresponding to the set of emission record embeddings; and transmitting the natural language summary to the client application.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims benefit to India application No. 202411083722, filed in India on Nov. 1, 2024, and which is incorporated herein by reference. BACKGROUND Oil and gas production and downstream ecosystems face ongoing challenges in detecting and fixing emissions and leaks. Emissions and leaks may lead to operational inefficiencies, environmental damage, and financial losses. Emission and leak detection may entail manual labor, fixed sensors, or conventional image processing. Current detection methodologies may be unsuited to scale up to real-time, automatic leak detection and resolution. SUMMARY In general, emission images and emission records are obtained and paired to obtain image-record pairs. A subset of the image-record pairs is selected as a training dataset. For each image-record pair of the training dataset, a tensor embedding of the image is obtained from a vision encoder of a multi-modal large language model. Further, a tensor embedding of the record is obtained from a transformer-based text encoder of the multi-modal large language model. A contrastive learning engine transforms the tensor embeddings into a common embedding space. Image tensor embeddings and record tensor embeddings in the common embedding space to obtain a set of matched image-record pairs and a set of mismatched image-record pairs. The M-LLM is fine-tuned with the set of matched image-record pairs and the set of mismatched image-record pairs. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 shows a computing system, in accordance with one or more embodiments. FIG. 2 shows a flowchart of a method, in accordance with one or more embodiments. FIG. 3 shows an example workflow, in accordance with one or more embodiments. FIG. 4 shows an example implementation, in accordance with one or more embodiments. FIG. 5.1 and FIG. 5.2 show a computing system, in accordance with one or more embodiments. Like elements in the various figures are denoted by like reference numerals for consistency. DETAILED DESCRIPTION One or more embodiments are directed to fine-tuning a multimodal large language model (M-LLM) to process two data modalities. The data modalities include a hyperspectral emission image modality and a data serialization language modality. The hyperspectral emission image modality may be used for emission images. The data serialization language modality may be for emission records. The data serialization language may be any language that has data serialization, such as a markup language. For example, the data serialization language may be “(‘Yet another markup language’) Ain′t Markup Language” (YAML). In another example, the data serialization language may be Javascript Object Notation (JSON). The emission records may be metadata descriptions of emissions leaks. An emission record may be considered as a log of an emission leak and remediation event. The emission record may include details, for example, the time-stamp when the leak was detected, or identified, the nature of the leak, the personnel that fixed the leak, how the leak was resolved, and other contextual information. The M-LLM is fine-tuned on hyperspectral emission images paired with corresponding emission records. The outcome of fine-tuning the M-LLM is that the M-LLM learns the relationship between the visual data (i.e., in the emission images) and the textual descriptions (i.e., in the emission records). At runtime, or in the inferencing phase, the M-LLM may analyze new, previously “unseen” hyperspectral images captured in real time at a facility, and respond with the most similar, or likely associated, emission record(s). The emission record(s) may provide baseline information about how similar emissions were handled in the past. The baseline information including personnel, likely cause, the nature of the fix and other details may thus streamline the resolution process of the emission leak. Further the baseline information may serve as a suggestion for fixing detected emissions, minimizing downtime, and environmental impact. The M-LLM is fine-tuned to process both images (e.g., hyperspectral emission images) and text in the data serialization language. The M-LLM may perform emission detection and resolution by processing the images and text. By associating specific visual cues from hyperspectral data with detailed text records, the system can identify the nature of the emission leak and how to address the emission leak. Thus, the M-LLM may be fine-tuned to analyze input images, and further, natural language descriptions of emissions events. The fine-tuned M-LLM may have capabilities to respond with emission records that are similar to the natural language event descriptions or associated with the input images. The M-LLM may intercept live data from on-site cameras to continuously monitor for leaks. Additionally, the M-LLM may be deployed in a cloud-based framework for post-analysis of emission leak detection and repair. Attention is now