US-12619596-B2 - Automatical generation and maintenance of data cards for datasets

US12619596B2US 12619596 B2US12619596 B2US 12619596B2US-12619596-B2

Abstract

In various examples, techniques for automatically generating and maintaining data cards for datasets is described herein. Systems and methods are disclosed that process a dataset in order to identify relevant information associated with the dataset. For example, the dataset may include and/or be associated with sources of information—such as files, documents, links, memos, research papers, annotations, labels, and/or the like—that describe data instances (e.g., images, audio clips, point clouds, etc.) included in the dataset. These sources of information may then be analyzed to retrieve the relevant information associated with the dataset. Systems and methods are then further disclosed that may use one or more language models to process input data associated with the relevant information in order to generate a data card associated with the dataset.

Inventors

Kimberly Le Truong
Arun George Zachariah
Rajat Keshri
Michael Boone

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260505
Application Date: 20240905

Claims (20)

1 . A method comprising: determining that a first version of a dataset has been updated to a second version of the dataset, the first version of the dataset being associated with a first data card that includes first information describing the first version of the dataset; based at least on the first version of the dataset being updated, determining whether one or more data instances that were accessed using one or more first links associated with the first version of the dataset are still accessible using one or more second links associated with the second version of the dataset; removing the one or more data instances from the second version of the dataset based at least on the one or more data instances not being accessible using the one or more second links associated with the second version of the dataset; after removing the one or more data instances, analyzing at least the second version of the dataset to identify second information describing at least one or more updates that occurred to the first version of the dataset to result in the second version of the dataset; generating, based at least on one or more language models processing input data that represents at least the second information, a second data card that includes at least the second information describing the at least the one or more updates to result in the second version of the dataset; and integrating the second data card with at least one machine learning pipeline for model development and deployment.
2 . The method of claim 1 , further comprising: determining, based at least on the second information, that at least one of one of one or more classes or one or more features has been added to the second version of the dataset, wherein the second data card includes a new data card from the first data card based at least on the at least one of the one or more classes or the one or more features being added to the second version of the dataset.
3 . The method of claim 1 , further comprising: determining, based at least on the second information, that at least one of one or more first classes or one or more first features from the first version of the dataset is similar to at least one of one or more second classes or one or more second features from the second version of the dataset, wherein the second data card includes an updated portion of the first data card.
4 . The method of claim 1 , further comprising: obtaining a template representing a format associated with the second data card, wherein: the input data is further representative of the template; and the second data card includes the second information arranged according to the format represented by the template.
5 . The method of claim 1 , further comprising: sending, to one or more user devices, a document that includes one or more queries associated with the second version of the dataset; and receiving, from the one or more user devices, third information that is related to the one or more queries, wherein the input data is further representative of the third information.
6 . The method of claim 1 , further comprising: determining, based at least on annotations associated with the second version of the dataset, whether one or more features associated with the second version of the dataset correspond to one or more protected classes, wherein the second data card further indicates whether the one or more features correspond to the one or more protected classes.
7 . The method of claim 6 , wherein the determining whether the one or more features correspond to the one or more protected classes comprises: comparing one or more first names represented by the annotations to one or more second names associated with the one or more protected classes; determining, based at least on the comparing, that the one or more first names are similar to the one or more second names; and determining that the one or more first names correspond to the one or more features.
8 . The method of claim 1 , wherein the first information from the first data card includes: how data instances associated with the first version of the dataset were collected; a size associated with the first version of the dataset; a number of the data instances associated with the first version of the dataset; a number of features associated with the first version of the dataset; a distribution associated with the features; whether one or more of the features are sensitive; or whether there is a bias associated with the first version of the dataset.
9 . A system comprising: one or more processors to: determine that a first version of a dataset has been updated to a second version of the dataset, the first version of the dataset being associated with a first data card that includes first information describing the first version of the dataset; based at least on the first version of dataset being updated, analyze the second version of the dataset to identify second information that describes one or more updates to the first version of the dataset to result in the second version of the dataset; generate, based at least on one or more language models processing input data representative of the second information, output data associated with a second data card that includes at least a portion of the second information; and integrate the second data card with at least one machine learning pipeline for model development and deployment.
10 . The system of claim 9 , wherein the one or more processors are further to: obtain a template representing a format associated with the second data card, wherein: the input data is further representative of the template; and the second data card includes the at least the portion of the second information arranged according to the format represented by the template.
11 . The system of claim 9 , wherein the one or more processors are further to: send, to one or more user devices, a document that includes one or more queries associated with the second version of the dataset; and receive, from the one or more user devices, third information that is related to the one or more queries, wherein the input data is further representative of the third information.
12 . The system of claim 9 , wherein the one or more processors are further to: determine, based at least on one or more annotations associated with the second version of the dataset, whether one or more features associated with the second version of the dataset correspond to one or more protected classes, wherein the input data is further representative of the one or more features, and the second data card further indicates whether the one or more features correspond to the one or more protected classes.
13 . The system of claim 12 , wherein the determination of whether the one or more features correspond to the one or more protected classes comprises: comparing one or more first names represented by the one or more annotations to one or more second names associated with the one or more protected classes; determining, based at least on the comparing, that the one or more first names are similar to the one or more second names; and determining that the one or more first names correspond to the one or more features.
14 . The system of claim 12 , wherein the one or more processors are further to: determine, for at least a feature of the one or more features, a number of instances associated with one or more categories corresponding to the feature; determine a variance associated with the feature based at least on the number of instances; and determine whether there is bias associated with the feature based at least on the variance, wherein the second data card further indicates whether there is bias associated with the feature.
15 . The system of claim 9 , wherein the one or more processors are further to: determine, based at least on the second information, that at least one of one of one or more classes or one or more features has been added to the second version of the dataset, wherein the second data card includes a new data card based at least on the at least one of the one or more classes or the one or more features being added to the second version of the dataset.
16 . The system of claim 9 , wherein the one or more processors are further to: determine, based at least on the second information, that at least one of one or more first classes or one or more first features from the first version of the dataset is similar to at least one or more second classes or one or more second features from the second version of the dataset, wherein the second data card includes an updated portion of the first data card.
17 . The system of claim 9 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing one or more simulation operations; a system for performing one or more digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing one or more deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for performing one or more generative AI operations; a system for performing operations using one or more large language models (LLMs); a system for performing operations using one or more small language models (SLMs) a system for performing operations using one or more vision language models (VLMs); a system for performing operations using one or more multi-modal language models; a system for performing one or more conversational AI operations; a system for generating synthetic data; a system for presenting at least one of virtual reality content, augmented reality content, or mixed reality content; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
18 . The system of claim 9 , wherein to analyze the second version of the dataset to identify the second information that describes the dataset comprises: generate a first embedding associated with a feature represented by the second version of the dataset; generate a second embedding for a protected class; determine that a similarity score between the first embedding and the second embedding satisfies a threshold score; and generate the second information to describe that the feature is associated with the protected class.
19 . The system of claim 9 , wherein the one or more processors are further to: based at least on the first version of the dataset being updated, determine whether one or more data instances that were accessed using one or more first links associated with the first version of the dataset are still accessible using one or more second links associated with the second version of the dataset; and remove the one or more data instances from the second version of the dataset based at least on the one or more data instances not being accessible using the one or more second links associated with the second version of the dataset, wherein the second version of the dataset is analyzed after removing the one or more data instances.
20 . One or more processors comprising: processing circuitry to: determine that a first version of a dataset has been updated to a second version of the dataset, the first version of the dataset being associated with a data card that describes the first version of the dataset; based at least on the first version of the dataset being updated, analyze at least the second version of the dataset to identify first information associated with at least one or more updates that occurred to the first version of the dataset to result in the second version of the dataset; determine, based at least on the first information and the data card, that at least one of one or more first classes or one or more first features from the first version of the dataset includes at least one of one or more second classes or one or more second features from the second version of the dataset; determine, based at least on the first version of the dataset being updated and the at least one of the one or more first classes or the one or more first features including the at least one of the one or more second classes or the one or more second features, to update the data card; generate, based at least on one or more language models processing the first information, an updated data card that includes at least second information describing the one or more updates added to the data card; and integrate the updated data card with at least one machine learning pipeline for model development and deployment.

Description

BACKGROUND Datasets may be used for a wide variety of applications including, but not limited to, training machine learning models to perform one or more processing tasks. As such, various datasets may include different types of data that are specific to the applications of the datasets. For example, a dataset that is being used to train a machine learning model to perform object detection may include images of objects while another dataset that is being used to train a machine learning model to perform speech recognition may include audio clips representing speech. Because of this, data cards may be used to ensure data clarity, transparency, and integrity across datasets and their applications. For example, a data card associated with a dataset may provide information related to the dataset, such as how the data was collected, a size of the dataset, a number of data instances included in the dataset, a number of features included in the dataset, distributions and/or statistics for features, possible sensitive features included in the dataset, and/or so forth. Conventional systems that generate data cards for datasets have users manually input the information into the data cards, such as by inputting descriptions for each field of the data cards. However, requiring users to input the information requires a large amount of time and computing resources (e.g., user devices), while also causing the data cards to be prone to user error. Additionally, since datasets may be created using different developers, formats of the data cards may be inconsistent across the datasets. For example, some developers generate data cards that include only the highest level of information, such as names of the datasets and links to resources associated with the datasets, while other developers generate data cards that include more exhaustive information, such as dataset features, distributions and/or statistics associated with the features, and possible sensitive features associated with the datasets. Furthermore, datasets may be updated to improve the datasets for their respective applications. For example, a first version of a dataset may include initial data instances while a second, updated version of the dataset may include new data instances that were added to the dataset for various reasons, such as reduce a possible bias of the dataset. However, in some circumstances, the data cards associated with the datasets may not be updated with the new versions of the datasets. When the data cards are not updated, it may be difficult to maintain the data clarity, transparency, and/or integrity associated with the current versions of the datasets. For example, if a data card does not reflect the current version of the dataset, then developers that use the data card may be unable to determine whether the dataset is adequate to perform specific applications, such as training machine learning models. SUMMARY Embodiments of the present disclosure relate to automatic generation and maintenance of data cards for datasets. Systems and methods are disclosed that process a dataset in order to identify relevant information associated with the dataset. For example, the dataset may include and/or be associated with sources of information—such as files, documents, links, memos, research papers, annotations, labels, and/or the like—that describe data instances (e.g., images, audio clips, point clouds, etc.) included in the dataset. These sources of information may then be analyzed to retrieve the relevant information associated with the dataset. Systems and methods are then further disclosed that may use one or more language models to process input data associated with the relevant information in order to generate a data card associated with the dataset. In some examples, the input data processed by the language model(s) may further be associated with additional information, such as a template indicating a format for the data card and/or inputted information from one or more users. In contrast to conventional systems, such as the conventional systems described above, the systems of the present disclosure, in some embodiments, may use the language model(s) to automatically generate data cards for datasets. As such, and in contrast to the conventional systems, users may not need to manually identify information that is relevant to the data cards and/or input the relevant information when generating data cards, which may save time and/or computing resources. Additionally, in contrast to the conventional systems, the systems of the present disclosure, in some embodiments, may automatically update data cards to represent accurate information related to datasets. For instance, and as described in more detail herein, when a dataset is updated to a new version such as removing data instances from and/or adding new data instances to the dataset, the systems of the present disclosure may automatically update the current data card for the dataset and/or gen