US-20260127486-A1 - SYNTHETIC DATA TRANSPARENCY
Abstract
This disclosure describes techniques for collecting and storing data about how synthetic data is created by a data generation model. In one example, this disclosure describes a method that includes generating, by a computing system and based on a source dataset, a plurality of synthetic data items; storing, by the computing system, metadata about how the plurality of synthetic data items were generated; outputting, by the computing system, a user interface presenting information about the plurality of synthetic data items; detecting, by the computing system and based on interactions with the user interface, a request to present information about one or more specific synthetic data items included in the plurality of synthetic data items; and outputting, by the computing system based on the metadata and responsive to the request, an updated user interface presenting information about how the one or more specific synthetic data items were generated.
Inventors
- Marco Arriaga
- Jasmine De Gaia
- Qian Cao
Assignees
- WELLS FARGO BANK, N.A.
Dates
- Publication Date
- 20260507
- Application Date
- 20241105
Claims (20)
- 1 . A method comprising: generating, by a computing system and based on a source dataset, a plurality of synthetic data items; storing, by the computing system, metadata about how the plurality of synthetic data items were generated; outputting, by the computing system, a user interface presenting information about the plurality of synthetic data items; detecting, by the computing system and based on interactions with the user interface, a request to present information about one or more specific synthetic data items included in the plurality of synthetic data items; and outputting, by the computing system based on the metadata and responsive to the request, an updated user interface presenting information about how the one or more specific synthetic data items were generated.
- 2 . The method of claim 1 , wherein the one or more specific synthetic data items is one specific synthetic data item, and wherein detecting the request to present information about the one specific synthetic data item includes: detecting interactions with a listing of at least some of the plurality of synthetic data items, including the one specific synthetic data item.
- 3 . The method of claim 2 , wherein outputting the updated user interface includes: outputting information identifying a model used to generate the one specific synthetic data item and information about the source dataset used by the model to generate the one specific synthetic data item.
- 4 . The method of claim 1 , wherein the one or more specific synthetic data items is a plurality of specific synthetic data items having a common attribute, and wherein detecting the request to present information about the plurality of specific synthetic data items includes: detecting interactions with a line in a graph presenting information about the plurality of synthetic data items.
- 5 . The method of claim 4 , wherein outputting the updated user interface includes: identifying, based on the interactions with the line in the graph and the metadata, the plurality of specific synthetic data items having the common attribute; and outputting information about at least some of the plurality of specific synthetic data items having the common attribute.
- 6 . The method of claim 1 , wherein generating the plurality of synthetic data items includes generating, based on a plurality of source datasets, the plurality of synthetic data items, each of the source datasets including a plurality of source data items; and wherein the user interface further presents information about the plurality of source data items.
- 7 . The method of claim 6 , wherein the updated user interface is a first updated user interface, and wherein the method further comprises: detecting, by the computing system and based on interactions with the user interface, a request to present information about a specific source data item included in the plurality of source data items; and outputting, by the computing system and responsive to the request to present information about the specific source data item, a second updated user interface presenting information about which of the plurality of source datasets includes the specific source data item.
- 8 . The method of claim 1 , wherein the updated user interface is a first updated user interface, and wherein the method further comprises: detecting, by the computing system and based on interactions with the user interface, a request to present information about one or more specific source data items having a common attribute; and outputting, by the computing system based on the metadata and responsive to the request to present information about the one or more specific source data items, a second updated user listing at least some of the one or more specific source data items having the common attribute.
- 9 . The method of claim 1 , wherein generating the plurality of synthetic data includes: generating the plurality of synthetic data items using one or more neural networks.
- 10 . The method of claim 1 , further comprising: training, by the computing system, a machine learning model using the synthetic data; applying, by the computing system, the machine learning model to input data to make a prediction; and sending, by the computing system and based on the prediction, control signals to an external system, instructing the external system to perform an operation.
- 11 . A computing system comprising processing circuitry and a storage device, wherein the processing circuitry has access to the storage device and is configured to: generate, based on a source dataset, a plurality of synthetic data items; store metadata about the how the plurality of synthetic data items were generated; output a user interface presenting information about the plurality of synthetic data items; detect, based on interactions with the user interface, a request to present information about one or more specific synthetic data items included in the plurality of synthetic data items; and output, based on the metadata and responsive to the request, an updated user interface presenting information about how the one or more specific synthetic data items were generated.
- 12 . The computing system of claim 11 , wherein the one or more specific synthetic data items is one specific synthetic data item, and wherein to detect the request to present information about the one specific synthetic data item, the processing circuitry is further configured to: detect interactions with a listing of at least some of the plurality of synthetic data items, including the one specific synthetic data item.
- 13 . The computing system of claim 12 , wherein to output the updated user interface, the processing circuitry is further configured to: output information identifying a model used to generate the one specific synthetic data item and information about the source dataset used by the model to generate the one specific synthetic data item.
- 14 . The computing system of claim 11 , wherein the one or more specific synthetic data items is a plurality of specific synthetic data items having a common attribute, and wherein to detect the request to present information about the plurality of specific synthetic data items, the processing circuitry is further configured to: detect interactions with a line in a graph presenting information about the plurality of synthetic data items.
- 15 . The computing system of claim 14 , wherein to output the updated user interface, the processing circuitry is further configured to: identify, based on the interactions with the line in the graph and the metadata, the plurality of specific synthetic data items having the common attribute; and output information about at least some of the plurality of specific synthetic data items having the common attribute.
- 16 . The computing system of claim 11 , wherein to generate the plurality of synthetic data items, the processing circuitry is further configured to generating, based on a plurality of source datasets, the plurality of synthetic data items, each of the source datasets including a plurality of source data items; and wherein the user interface further presents information about the plurality of source data items.
- 17 . The computing system of claim 16 , wherein the updated user interface is a first updated user interface, and wherein the processing circuitry is further configured to: detect, based on interactions with the user interface, a request to present information about a specific source data item included in the plurality of source data items; and output, responsive to the request to present information about the specific source data item, a second updated user interface presenting information about which of the plurality of source datasets includes the specific source data item.
- 18 . The computing system of claim 11 , wherein the updated user interface is a first updated user interface, and wherein the processing circuitry is further configured to: detect, based on interactions with the user interface, a request to present information about one or more specific source data items having a common attribute; and output, based on the metadata and responsive to the request to present information about the one or more specific source data items, a second updated user listing at least some of the one or more specific source data items having the common attribute.
- 19 . The computing system of claim 11 , wherein generating the plurality of synthetic data, the processing circuitry is further configured to: generate the plurality of synthetic data items using one or more neural networks.
- 20 . Non-transitory computer-readable media comprising instructions that, when executed, cause processing circuitry of a computing system to: generate, based on a source dataset, a plurality of synthetic data items; store metadata about the how the plurality of synthetic data items were generated; output a user interface presenting information about the plurality of synthetic data items; detect, based on interactions with the user interface, a request to present information about one or more specific synthetic data items included in the plurality of synthetic data items; and output, based on the metadata and responsive to the request, an updated user interface presenting information about how the one or more specific synthetic data items were generated.
Description
TECHNICAL FIELD This disclosure relates to data processing, and more specifically, to techniques for managing synthetic data generated by a model. BACKGROUND Synthetic data is artificially generated information that mimics real-world data and is generated using a variety of techniques that aim to replicate the statistical properties of the real-world data. For example, synthetic data can be generated using relatively simple and transparent methods, such as rules-based data generation systems. Increasingly, however, synthetic data is generated using more complicated and less transparent techniques, such as through neural networks. Once generated, synthetic data is used for a variety of purposes, including to train machine learning models. SUMMARY This disclosure describes techniques for collecting and storing information about how synthetic data is created, thereby enabling a data traceability or data lineage capability that creates transparency around the synthetic data generation process. The disclosed techniques involve generating metadata about the process by which synthetic data is generated. The metadata identifies attributes of the source data used to generate synthetic data, the model or models used to generate the synthetic data, and other information about the process. Metadata could be generated by the model while synthetic data is created or generated at a different time based on information logged during creation of synthetic data. For example, state information for operations being performed at a record level by a model generating the synthetic data may be collected and logged, and then used to generate metadata. As described herein, the metadata can be used as the basis for visualizations about the synthetic data (e.g., a chart, distribution, graph, or similar illustration), providing insights into how a given instance or set of instances of synthetic data were generated. In some examples, such a visualization might reveal that a given instance of synthetic data was generated using a specific model, from a specified set of data sources derived over an identified time frame. Visualizations may provide information about many other attributes of the synthetic data, the source data, and/or the models used to generate the synthetic data. Metadata, visualizations, and other information about the synthetic data may be used in various types of analyses, which may involve determining whether the synthetic data was generated appropriately, whether the generated synthetic data is suitable for being used for a particular purpose, or whether the synthetic data complies with third-party or regulatory requirements. In some examples, this disclosure describes operations performed by a computing system in accordance with one or more aspects of this disclosure. In one specific example, this disclosure describes a method comprising generating, by a computing system and based on a source dataset, a plurality of synthetic data items; storing, by the computing system, metadata about how the plurality of synthetic data items were generated; outputting, by the computing system, a user interface presenting information about the plurality of synthetic data items; detecting, by the computing system and based on interactions with the user interface, a request to present information about one or more specific synthetic data items included in the plurality of synthetic data items; and outputting, by the computing system based on the metadata and responsive to the request, an updated user interface presenting information about how the one or more specific synthetic data items were generated. In another example, this disclosure describes a system comprising a storage system and processing circuitry having access to the storage system, wherein the processing circuitry is configured to carry out operations described herein. In yet another example, this disclosure describes a computer-readable storage medium comprising instructions that, when executed, configure processing circuitry of a computing system to carry out operations described herein. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a conceptual diagram of system in which synthetic data is generated, evaluated, and used to train a model, in accordance with one or more aspects of the present disclosure. FIG. 2 is a block diagram of system in which synthetic data is generated, evaluated, and used to train a model, in accordance with one or more aspects of the present disclosure. FIG. 3A through FIG. 3F are conceptual diagrams illustrating example user interfaces presented by a user interface device, in accordance with one or more aspects of the present disclosure. FIG. 4 is a flow diagram illustrating operations performed by an example computing system, in accordance with one or more aspects of the present disclosure. Although each of these Figures are referenced herein in connection with the description of one or more specific examples, such examples are merely illustrativ