US-12626056-B2 - Generating natural language model insights for data charts using light language models distilled from large language models
Abstract
The present disclosure relates to systems, methods, and non-transitory computer readable media for generating naturally phrased insights about data charts using light language models distilled from large language models. To synthesize training data for the light language model, in some embodiments, the disclosed systems leverage insight templates for prompting a large language model for generating naturally phrased insights. In some embodiments, the disclosed systems anonymize and augment the synthesized training data to improve the accuracy and robustness of model predictions. For example, the disclosed systems anonymize training data by injecting noise into data charts before prompting the large language model for generating naturally phrased insights from insight templates. In some embodiments, the disclosed systems further augment the (anonymized) training data by splitting or partitioning data charts into folds that act as individual data charts.
Inventors
- Victor Soares Bursztyn
- Wei Zhang
- Prithvi Bhutani
- Eunyee Koh
- Abhisek Trivedi
Assignees
- ADOBE INC.
Dates
- Publication Date
- 20260512
- Application Date
- 20230620
Claims (20)
- 1 . A computer-implemented method comprising: generating, utilizing a data narrator model to process a data chart, an insight template defining a template structure for generating natural model insights that include natural language descriptions of the data chart; generating, from the insight template, a natural model insight comprising a natural language summarization of the data chart within a threshold degree of divergence from the insight template by using a large language model to paraphrase the data chart using natural language phrasing while adhering to the template structure of the insight template, wherein the threshold degree of divergence is defined according to a temperature parameter for the large language model where higher temperatures result in more creative predictions and lower temperatures result in less creative predictions; distilling the large language model into a light language model by tuning parameters of the light language model such that the parameters of the light language model produce the natural model insight from the insight template when tuned, the light language model having fewer parameters than a threshold number of parameters, wherein the threshold number of parameters is less than a number of parameters of the large language model; and generating, in response to prompting the light language model with an additional data chart, an additional natural model insight comprising a natural language summarization of the additional data chart.
- 2 . The computer-implemented method of claim 1 , further comprising generating the data chart by anonymizing training data to inject noise into data points included in the data chart to obfuscate the data points while retaining a data structure for the data chart.
- 3 . The computer-implemented method of claim 1 , further comprising generating the data chart by augmenting training data by: partitioning the training data into folds within the data chart, wherein the folds correspond to respective natural model insights; and modifying a granularity of the training data within the data chart to generate data points for the data chart at modified intervals.
- 4 . The computer-implemented method of claim 1 , wherein generating the natural model insight comprises paraphrasing anonymized, augmented data within the data chart utilizing the large language model while adhering to the template structure of the insight template according to a temperature value indicating the threshold degree of divergence.
- 5 . The computer-implemented method of claim 1 , wherein generating the natural model insight comprises: determining an insight type for the insight template generated by the data narrator model; selecting one or more natural insight examples from a set of natural insight examples generated for the insight type; and utilizing the large language model to generate the natural model insight by paraphrasing the data chart according to the one or more natural insight examples for the insight type.
- 6 . The computer-implemented method of claim 1 , further comprising: determining an edit distance between the natural model insight and the insight template; comparing the edit distance with an edit distance threshold; based on comparing the edit distance with the edit distance threshold, updating a temperature value associated with the light language model; and generating, utilizing the light language model, a new natural model insight based on updating the temperature value.
- 7 . The computer-implemented method of claim 5 , further comprising: determining a first insight type corresponding to a first fold within the data chart; determining a second insight type corresponding to a second fold within the data chart; and selecting one or more natural insight examples from a set of natural insight examples generated for the first insight type and one or more natural insight examples from a set of natural insight examples generated for the second insight type.
- 8 . A system comprising: one or more memory devices comprising a light language model distilled from a large language model, the large language model prompted to generate natural model insights comprising natural language summarizations of data charts by summarizing data charts using natural language phrases according to insight templates and within a threshold degree of divergence, wherein the threshold degree of divergence is defined according to a temperature parameter for the large language model where higher temperatures result in more creative predictions and lower temperatures result in less creative predictions, the light language model having fewer parameters than a threshold number of parameters, wherein the threshold number of parameters is less than a number of parameters of the large language model; and one or more processors configured to cause the system to: receive, from a client device, a user interaction requesting a natural model insight to describe a data chart in natural language phrases; in response to the user interaction, generate the natural model insight comprising a natural language summarization of the data chart by utilizing the light language model distilled from the large language model; and provide the natural model insight for display on the client device.
- 9 . The system of claim 8 , wherein the large language model is prompted to generate the natural model insights using training data anonymized by injecting noise into data charts to obfuscate the data charts while retaining data structures for the data charts.
- 10 . The system of claim 8 , wherein the large language model is prompted to generate the natural model insights using training data augmented by: partitioning data charts into folds corresponding to respective natural model insights; and modifying a granularity of the training data to generate data points at modified intervals.
- 11 . The system of claim 8 , wherein the light language model is distilled from the large language model by tuning parameters of the light language model such that, when tuned, the parameters of the light language model produce natural model insights from corresponding insight templates generated by a data narrator model to define template structures for the natural model insights.
- 12 . The system of claim 8 , wherein output from the light language model is validated by: determining an edit distance between a sample natural model insight and a corresponding insight template; comparing the edit distance with an edit distance threshold; and modifying a temperature value of the light language model based on comparing the edit distance with the edit distance threshold.
- 13 . The system of claim 8 , wherein the one or more processors are further configured to cause the system to generate the natural model insight by using parameters of the light language model distilled from the large language model using anonymized, augmented data charts.
- 14 . The system of claim 8 , wherein the one or more processors are further configured to cause the system to provide the natural model insight for display on the client device together with a graph depicting a visual representation of the data chart, wherein the natural model insight comprises a natural language paraphrasing of at least a portion of the graph.
- 15 . A non-transitory computer-readable medium storing executable instructions that, when executed by a processing device, cause the processing device to perform operations comprising: synthesizing anonymized training data by injecting noise into data points within a data chart to obfuscate the data points while retaining a data structure for the data chart; augmenting the anonymized training data by partitioning the data chart into multiple folds corresponding to respective natural model insights; generating, utilizing a data narrator model to process one or more data points within a fold of the data chart that has been augmented, an insight template from the augmented and anonymized training data defining a template structure for generating natural model insights that include natural language descriptions of the data chart; generating, from the insight template, a natural model insight comprising a natural language summarization of the data chart within a threshold degree of divergence from the insight template by using a large language model to paraphrase the one or more data points within the fold of the data chart using natural language phrasing while adhering to the template structure of the insight template, wherein the threshold degree of divergence is defined according to a temperature parameter for the large language model where higher temperatures result in more creative predictions and lower temperatures result in less creative predictions; and generating, by utilizing a light language model distilled from the large language model and in response to prompting the light language model with an additional data chart, an additional natural model insight comprising a natural language summarization of the additional data chart.
- 16 . The non-transitory computer-readable medium of claim 15 , wherein augmenting the anonymized training data further comprises modifying a granularity of the data chart to generate new data points for the data chart divided at different intervals.
- 17 . The non-transitory computer-readable medium of claim 15 , wherein generating the additional natural model insight further comprises distilling the large language model into the light language model by tuning parameters of the light language model such that the parameters of the light language model produce the additional natural model insight from the insight template when tuned.
- 18 . The non-transitory computer-readable medium of claim 15 , wherein generating the natural model insight comprises: determining insight types corresponding to a set of insight templates generated by the data narrator model; and generating, for each of the insight types, a set of natural insight examples comprising natural language paraphrases of anonymized-augmented data points within the data chart.
- 19 . The non-transitory computer-readable medium of claim 18 , wherein generating the natural model insight comprises: determining an insight type for the insight template generated by the data narrator model; selecting one or more natural insight examples from a set of natural insight examples generated for the insight type; and utilizing the large language model to generate the natural model insight by paraphrasing the one or more data points within the fold of the data chart while adhering to template structure of the insight template and following the one or more natural insight examples for the insight type.
- 20 . The non-transitory computer-readable medium of claim 18 , wherein the operations further comprise validating the natural model insight by determining that the natural model insight satisfies a threshold edit distance in relation to the insight template.
Description
CROSS REFERENCE TO RELATED APPLICATIONS This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/491,400, filed Mar. 21, 2023, entitled DISTILLING LANGUAGE MODELS UTILIZING INSIGHT PARAPHRASING, which is incorporated herein by reference in its entirety. BACKGROUND In the field of data captioning, large language models have become increasingly effective in various applications, such as generating captions to explain or summarize data represented in charts and graphs. These models, such as Generative Pretrained Transformer 3 (“GPT-3”), have revolutionized data captioning, enabling generation of data captions that paraphrase or summarize large datasets in word form. Despite the advances of existing data captioning systems, however, these prior systems continue to suffer from a number of disadvantages, such as accuracy in generating naturally phrased data captions and computational efficiency in relying on such large, expensive models. SUMMARY This disclosure describes one or more embodiments of systems, methods, and non-transitory computer readable media that solve one or more of the foregoing or other problems in the art by generating naturally phrased insights about data charts using light language models distilled from large language models. For example, the disclosed systems distill parameters learned in a large language model for generating natural model insights into a light language model that uses far fewer computational resources. In some embodiments, the disclosed systems train the light language model using specially synthesized training data (e.g., synthesized by the large language model) that is anonymized and augmented to improve the accuracy and robustness of model predictions. For example, the disclosed systems anonymize training data by injecting noise into data charts used to train a light language model. In some embodiments, the disclosed systems further augment the (anonymized) training data by splitting or partitioning training data charts into folds that act as individual data charts. From the synthesized training data, in some cases, the disclosed systems generate insight templates to guide a large language model for generating natural model insights. BRIEF DESCRIPTION OF THE DRAWINGS This disclosure describes one or more embodiments of the invention with additional specificity and detail by referencing the accompanying figures. The following paragraphs briefly describe those figures, in which: FIG. 1 illustrates an example system environment in which an insight generation system operates in accordance with one or more embodiments; FIG. 2 illustrates an overview of prompting and utilizing a large language model to distill into a light language model for generating natural model insights in accordance with one or more embodiments; FIG. 3 illustrates an example diagram of generating insight templates from synthesized training data in accordance with one or more embodiments; FIG. 4 illustrates an example diagram for generating a natural model insight from an insight template in accordance with one or more embodiments; FIG. 5 illustrates an example diagram for distilling a large language model into a light language model in accordance with one or more embodiments; FIG. 6 illustrates an example insight interface for generating and presenting a natural model insight in accordance with one or more embodiments; FIG. 7 illustrates an example table of experimental results for the insight generation system in accordance with one or more embodiments; FIG. 8 illustrates an example schematic diagram of an insight generation system in accordance with one or more embodiments; FIG. 9 illustrates an example flowchart of a series of acts for prompting and utilizing a large language model to distill into a light language model for generating natural model insights in accordance with one or more embodiments; FIG. 10 illustrates an example flowchart of a series of acts for generating and providing a natural model insight using a light language model distilled from a large language model in accordance with one or more embodiments; and FIG. 11 illustrates a block diagram of an example computing device in accordance with one or more embodiments. DETAILED DESCRIPTION This disclosure describes one or more embodiments of an insight generation system that generates natural model insights to paraphrase or caption data charts (or graphs) by training and implementing a light language model distilled from a pretrained large language model using anonymized, augmented training data. In some embodiments, the insight generation system synthesizes the training data to train or tune the light language model. Using the synthesized training data, in one or more embodiments, the insight generation system utilizes a data narrator model to generate insight templates that define template structures to guide or inform the generation of natural model insights. Indeed, in certain cases, the ins