US-12620206-B2 - Training data synthesis for machine learning
Abstract
A method can include generating a plurality of synthetic objects and associated labels using a trained first machine learning system that is trained to generate a synthetic object based at least in part on a feature of a labeled object, an assigned label that represents the feature, and stochastic variation input; training a second machine learning model to predict labels for features of objects based at least in part on the plurality of synthetic objects and associated labels; and predicting a label for an unlabeled feature of an object using the second machine learning model.
Inventors
- Atul Laxman Katole
- Purnaprajna Raghavendra Mangsuli
- Hiren Maniar
Assignees
- SCHLUMBERGER TECHNOLOGY CORPORATION
Dates
- Publication Date
- 20260505
- Application Date
- 20220914
Claims (18)
- 1 . A method comprising: generating a plurality of synthetic objects and associated labels using a trained first machine learning system that is trained to generate a synthetic object based at least in part on a feature of a labeled object, an assigned label that represents the feature, and stochastic variation input, wherein the synthetic objects include a stochastic variation output, wherein the stochastic variation input comprises single channel images including uncorrelated Gaussian noise, and wherein the stochastic variation output comprises one or more of: erased or partially erased gridlines; width and intensity variations in the gridlines; noise on the gridlines; intensity variation in curves; and width variation in the curves; training a second machine learning model to predict labels for features of objects based at least in part on the plurality of synthetic objects and associated labels; and predicting a label for an unlabeled feature of an object using the second machine learning model.
- 2 . The method of claim 1 , further comprising: determining an uncertainty associated with labeling the unlabeled feature in the object using the trained second machine learning model; determining that the uncertainty is greater than a predetermined value; in response to determining that the uncertainty is greater than the predetermined value, soliciting an input of one or more training pairs of objects having the unlabeled feature of the object and an assigned label associated therewith; generating a new plurality of synthetic objects and associated labels using the trained first machine learning system and the one or more training pairs of objects; and training the second machine learning model to predict labels for features of objects based at least in part on the plurality of new synthetic objects and associated labels.
- 3 . The method of claim 1 , wherein the labeled object comprises a well log, wherein the feature comprises one or more of a header section, a depth track, and a plot segment, and wherein training the first machine learning model comprises training the first machine learning model to generate synthetic objects having variations of the one or more of the header section, the depth track, and the plot segment based on the labeled object.
- 4 . The method of claim 3 , wherein the variations include different relative locations for one or more of the header section, the depth track, and the plot segment in the individual synthetic objects.
- 5 . The method of claim 1 , wherein the labeled object comprises a plot of a well log or a seismic survey log, wherein the feature comprises one or more of a curve shape, a number of curves, a range of values, and a line style, and wherein training the first machine learning model comprises training the first machine learning model to generate synthetic objects having variations of the one or more of the curve shape, the number of curves, the range of values, and the line style.
- 6 . The method of claim 1 , wherein the labeled object comprises a header section of a well log, wherein the feature comprises a line style, units, or a scale in the header section, and wherein training the first machine learning model comprises training the first machine learning model to generate synthetic objects having variations of the one or more of the line style, the units, or the scale.
- 7 . The method of claim 6 , wherein the variations include different relative locations for display of the line style, the units, or the scale in the individual synthetic objects.
- 8 . The method of claim 1 , wherein the labeled object comprises a natural language search query, wherein the feature comprises one or more of a country, a state, an operator identity, and a field need, and wherein training the first machine learning model comprises training the first machine learning model to generate synthetic objects having variations of the one or more of the country, the state, the operator, and the field need, and wherein training the second machine learning model comprises training the second machine learning model to label natural language search queries as database-specific language search queries.
- 9 . The method of claim 1 , wherein the first machine learning system comprises a generative adversarial network.
- 10 . The method of claim 9 , wherein the generative adversarial network comprises a generator and a discriminator.
- 11 . A non-transitory, computer-readable medium storing instructions that, when executed by at least one processor of a computing system, cause the computing system to perform operations, the operations comprising: generating a plurality of synthetic objects and associated labels using a trained first machine learning system that is trained to generate a synthetic object based at least in part on a feature of a labeled object, an assigned label that represents the feature, and stochastic variation input, wherein the synthetic objects include a stochastic variation output, wherein the stochastic variation input comprises single channel images including uncorrelated Gaussian noise, and wherein the stochastic variation output comprises one or more of: erased or partially erased gridlines; width and intensity variations in the gridlines; noise on the gridlines; intensity variation in curves; and width variation in the curves; training a second machine learning model to predict labels for features of objects based at least in part on the plurality of synthetic objects and associated labels; and predicting a label for an unlabeled feature of an object using the second machine learning model.
- 12 . The medium of claim 11 , wherein the operations further comprise: determining an uncertainty associated with labeling the unlabeled feature in the object using the trained second machine learning model; determining that the uncertainty is greater than a predetermined value; in response to determining that the uncertainty is greater than the predetermined value, soliciting an input of one or more training pairs of objects having the feature of the unlabeled object and an assigned label associated therewith; generating a new plurality of synthetic objects and associated labels using the trained first machine learning system and the one or more training pairs of objects; and training the second machine learning model to predict labels for features of objects based at least in part on the plurality of new synthetic objects and associated labels.
- 13 . The medium of claim 11 , wherein the labeled object comprises a well log, wherein the feature comprises one or more of a header section, a depth track, and a plot segment, and wherein training the first machine learning model comprises training the first machine learning model to generate synthetic objects having variations of the one or more of the header section, the depth track, and the plot segment based on the labeled object.
- 14 . The medium of claim 11 , wherein the labeled object comprises a plot of a well log or a seismic survey log, wherein the feature comprises one or more of a curve shape, a number of curves, a range of values, and a line style, and wherein training the first machine learning model comprises training the first machine learning model to generate synthetic objects having variations of the one or more of the curve shape, the number of curves, the range of values, and the line style.
- 15 . The medium of claim 11 , wherein the labeled object comprises a header section of a well log, wherein the feature comprises a line style, units, or a scale in the header section, wherein training the first machine learning model comprises training the first machine learning model to generate synthetic objects having variations of the one or more of the line style, the units, and the scale, and wherein the variations include different relative locations for display of the line style, the units, and the scale in the individual synthetic objects.
- 16 . The medium of claim 11 , wherein the labeled object comprises a natural language search query, wherein the feature comprises one or more of a country, a state, an operator identity, and a field need, and wherein training the first machine learning model comprises training the first machine learning model to generate synthetic objects having variations of the one or more of the country, the state, the operator, and the field need, and wherein training the second machine learning model comprises training the second machine learning model to label natural language search queries as database-specific language search queries.
- 17 . A computing system, comprising: one or more processors; and a memory system including one or more non-transitory computer-readable media storing instructions that, when executed by at least one of the one or more processors, cause the computing system to perform operations, the operations comprising: generating a plurality of synthetic objects and associated labels using a trained first machine learning system that is trained to generate a synthetic object based at least in part on a feature of a labeled object, an assigned label that represents the feature, and stochastic variation input, wherein the synthetic objects include a stochastic variation output, wherein the stochastic variation input comprises single channel images including uncorrelated Gaussian noise, and wherein the stochastic variation output comprises one or more of: erased or partially erased gridlines; width and intensity variations in the gridlines; noise on the gridlines; intensity variation in curves; and width variation in the curves; training a second machine learning model to predict labels for features of objects based at least in part on the plurality of synthetic objects and associated labels; and predicting a label for an unlabeled feature of an object using the second machine learning model.
- 18 . The computing system of claim 17 , wherein the operations further comprise: determining an uncertainty associated with labeling the unlabeled feature in the object using the trained second machine learning model; determining that the uncertainty is greater than a predetermined value; in response to determining that the uncertainty is greater than the predetermined value, soliciting an input of one or more training pairs of objects having the unlabeled feature of the object and an assigned label associated therewith; generating a new plurality of synthetic objects and associated labels using the trained first machine learning system and the one or more training pairs of objects; and training the second machine learning model to predict labels for features of objects based at least in part on the plurality of new synthetic objects and associated labels.
Description
RELATED APPLICATIONS This application is a National Stage Entry of International Patent Application No. PCT/US2022/043485, filed 14 Sep. 2022, which claims priority to and the benefit of a U.S. Provisional Application having Ser. No. 63/261,156, filed 14 Sep. 2021, which is incorporated herein by reference in its entirety. BACKGROUND Machine learning models may be trained using training data, with the accuracy of the models generally proportional to the quantity and quality of the training data provided. The training data may be provided as “pairs”, including the raw data (e.g., an image or another object) and one or more labels that the raw data represents. These pairs are employed to form “connections” within the model, and eventually the model may be able to predict a label associated with new data, based on the data itself. Generally, the data are provided to a machine learning model from manually labeled data sets, which is time intensive. Unsupervised learning methods also exist, but without manual labels to train the machine learning model, unsupervised techniques tend to involve clustering algorithms, which may demand model refinements to provide meaningful clusters. Various machine learning models find use in computer graphics. In the computer graphics field, a raster graphics or bitmap image is a dot matrix data structure that represents a generally rectangular grid of pixels (points of color, grayscale, black and white), viewable via a bitmapped display (monitor). Raster images can be stored in image files with varying dissemination, production, generation, and acquisition formats. Common pixel formats include monochrome, grayscale, palletized, and full color, where color depth determines the fidelity of the colors represented and color space determines the range of color coverage, which may be less than the full range of human color vision. Raster images of seismic data and well logs may include segments as log header segments, curve segments, tables, text blocks, graphs, and/or other segments. Curve segments can represent petrophysical properties of rocks and their contained fluids in the form of graphs, as may be based on sensed data from one or more sensors. Values and meaning of curve segments are generally recognizable using information provided by a log header, text blocks, and other segments. A “legacy” raster image of seismic data may include images generated prior to digital data acquisition techniques. A legacy raster image may be a scanned image saved as a computer image file. Image files may adequately depict the non-digital log data such that a human user can review and understand the information collected; however, the files may not include the digital data represented by the curve, e.g., the values for the properties and depths that the curve represents. A machine learning model can be trained to extract information from raster images using training pairs of raster images and labels. However, again, the labeling process is time intensive. Thousands of pairs may be needed to adequately train a model, particularly where images are in a variety of formats. Further, “noise” may be present in scans of images (e.g., artifacts such as smudges that do not contain data that is represented by the curve), which can call for ever-larger training data sets to adequately train a machine learning model to handle. Another area where machine learning is applied is natural language processing. In particular, a machine learning model may be trained to interpret a natural language query from a user, and predict the syntax that is associated with this natural language query for database searching, to name one specific example. Natural language queries may be difficult to predict, as different users may employ different words in different orders. Moreover, especially in the context of oilfield environments, connections may be made between different types of data that may not be included in the natural language queries, but may assist in providing useful results. SUMMARY A method can include generating a plurality of synthetic objects and associated labels using a trained first machine learning system that is trained to generate a synthetic object based at least in part on a feature of a labeled object, an assigned label that represents the feature, and stochastic variation input; training a second machine learning model to predict labels for features of objects based at least in part on the plurality of synthetic objects and associated labels; and predicting a label for an unlabeled feature of an object using the second machine learning model. A non-transitory, computer-readable medium storing instructions that, when executed by at least one processor of a computing system, can cause the computing system to perform operations, where the operations can include: generating a plurality of synthetic objects and associated labels using a trained first machine learning system that is trained to generate a sy