US-12619872-B2 - System and method for filtering datasets using conditional-likelihood filtration

US12619872B2US 12619872 B2US12619872 B2US 12619872B2US-12619872-B2

Abstract

A system and method are provided for generating a trained model to filter data sets for filtering hate speech. The method includes obtaining an unfiltered corpus of data, obtaining a set of trigger phrases, and using the set of trigger phrases to generate a trained model which comprises at least one conditional likelihood of the trigger phrases conditioned on documents in the corpus of data. A system and method are also provided for filtering data sets for hate speech using pre-trained models. The method includes obtaining a pretrained model generated using a set of trigger phrases and which comprises at least one conditional likelihood of the trigger phrases conditioned on document in a corpus of data used to generate the pretrained model; using the pretrained model to filter an unfiltered dataset and generate a filtered dataset; and outputting the filtered dataset.

Inventors

Helen Ngo
Nicholas Frosst

Assignees

Cohere Inc.

Dates

Publication Date: 20260505
Application Date: 20220622

Claims (20)

1 . A method of generating a trained natural language processing model having a lower propensity of generating hate speech, the method comprising: obtaining an unfiltered dataset; obtaining a set of trigger phrases; determining, for each trigger phrase, a conditional likelihood of portions of the unfiltered dataset satisfying a threshold under a probability distribution for at least one trigger axis, wherein the conditional likelihood for each trigger phrase is determined based on a context provided by the respective portions of the unfiltered dataset; filtering the unfiltered dataset to remove the one or more portions having at least one determined conditional likelihood that satisfies the threshold to generate a filtered dataset; generating the trained natural language processing model using the filtered dataset; providing the trained natural language processing model to an application; and utilizing the trained natural language processing model in the application thereby reducing the conditional likelihood that the trained natural language processing model generates subject matter defined by the set of trigger phrases when generating a result from the utilizing.
2 . The method of claim 1 , further comprising obtaining a baseline model trained on the unfiltered dataset, wherein the probability distribution is represented by the baseline model and wherein the natural language processing model is generated by training the baseline model by finetuning the baseline model using the filtered dataset.
3 . The method of claim 1 , further comprising: generating a text phrase with the trained natural language processing model, the trained natural language processing model generating the text phrase at least in part based on received input.
4 . The method of claim 1 , further comprising: updating the set of trigger phrases; filtering the filtered dataset based on the updated trigger phrases to generate a further filtered dataset; and further training the trained natural language processing model with the further filtered dataset.
5 . The method of claim 1 , wherein one or more of the set of trigger phrases is appended to each portion of the unfiltered data.
6 . The method of claim 1 , wherein the set of trigger phrases are responsive to a first type of hate speech, and the threshold is responsive to a second type of hate speech.
7 . The method of claim 1 , wherein the threshold is based on a comparison of determined conditional likelihoods of entries of the dataset to properties of the unfiltered dataset.
8 . The method of claim 1 , wherein the at least one conditional likelihood is generated based on the relationship p(t|d) where t is a trigger phrase and d is an extract from the beginning of a document.
9 . The method of claim 1 , further comprising removing portions of the unfiltered dataset which include one or more blocked phrases, and the set of trigger terms do not include the one or more blocked phrases.
10 . The method of claim 1 , wherein the threshold includes a plurality of thresholds associated with different types of hate speech.
11 . A system for generating a trained natural language processing model having a lower propensity of generating hate speech, the system comprising: a processor; a memory in communication with the processor, the memory comprising computer executable instructions that when executed by the processor cause the processor to: obtain an unfiltered dataset; obtain a set of trigger phrases; determine, for each trigger phrase, a conditional likelihood of portions of the unfiltered dataset satisfying a threshold under a probability distribution for at least one trigger axis, wherein the conditional likelihood for each trigger phrase is determined based on a context provided by the respective portions of the unfiltered dataset; filter the unfiltered dataset to remove the one or more portions having at least one determined conditional likelihood that satisfies the threshold to generate a filtered dataset; generate the trained natural language processing model using the filtered dataset; provide the trained natural language processing model to an application; and utilize the trained natural language processing model in the application thereby reducing the conditional likelihood that the trained natural language processing model generates subject matter defined by the set of trigger phrases when generating a result from the utilizing.
12 . The system of claim 11 , further comprising instructions to obtain a baseline model trained on the unfiltered dataset, wherein the probability distribution is represented by the baseline model and wherein the natural language processing model is generated by training the baseline model by finetuning the baseline model using the filtered dataset.
13 . The system of claim 11 , wherein the instructions cause the processor to: generate a text phrase with the trained natural language processing model, the trained natural language processing model generating the text phrase at least in part based on received input.
14 . The system of claim 11 , wherein the instructions cause the processor to: update the set of trigger phrases; filter the filtered dataset based on the updated trigger phrases to generate a further filtered dataset; and further train the trained natural language processing model with the further filtered dataset.
15 . The system of claim 11 , wherein one or more of the set of trigger phrases is appended to each portion of the unfiltered data.
16 . The system of claim 11 , wherein the set of trigger phrases are responsive to a first type of hate speech, and the threshold is responsive to a second type of hate speech.
17 . The system of claim 11 , wherein the threshold is based on a comparison of determined conditional likelihoods of entries of the dataset to properties of the unfiltered dataset.
18 . The system of claim 11 , wherein the instructions cause the processor to: remove portions of the unfiltered dataset which include one or more blocked phrases, and wherein the set of trigger terms do not include the one or more blocked phrases.
19 . The system of claim 11 , wherein the threshold includes a plurality of thresholds associated with different types of hate speech.
20 . A non-transitory computer readable medium for training a neural network model including a first plurality of nodes, the computer readable medium comprising computer executable instructions to: obtain an unfiltered dataset; obtain a set of trigger phrases; determine, for each trigger phrase, a conditional likelihood of portions of the unfiltered dataset satisfying a threshold under a probability distribution for at least one trigger axis, wherein the conditional likelihood for each trigger phrase is determined based on a context provided by the respective portions of the unfiltered dataset; filter the unfiltered dataset to remove the one or more portions having at least one determined conditional likelihood that satisfies the threshold to generate a filtered dataset; generate the trained natural language processing model using the filtered dataset; provide the trained natural language processing model to an application; and utilize the trained natural language processing model in the application thereby reducing the conditional likelihood that the trained natural language processing model generates subject matter defined by the set of trigger phrases when generating a result from the utilizing.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to United States Provisional Patent Application No. 63/202,785 filed on Jun. 24, 2021, and the entire contents of which is incorporated herein by reference. TECHNICAL FIELD The following generally relates generally to filtering datasets, specifically to filtering such datasets using conditional-likelihood filtration, for example to filter hate speech. BACKGROUND Large pretrained neural language models pretrained on datasets scraped from the open web are responsible for many recent advancements in natural language processing (NLP) and have seen rapid adoption in real world applications. While this has enabled the development of new and impactful technology (e.g., translation), concerns have been raised that these models reflect and amplify the systemic bias and prejudice present in their training corpuses. Such language models are normally trained on data scraped from the open web, including text that covers a wide variety of modern discourse. Some of the perspectives represented in this discourse propagate harmful and oppressive views, such as racism, sexism, ableism, nationalism, etc. Moreover, some webpages include text which is otherwise harmful, toxic, abusive, obscene, or profane. Language models trained on a variety of internet corpuses have been compared, observing that models trained on Wikipedia, for example, exhibit lower expected maximum toxicity, suggesting that models acquire toxicity from their pretraining data. The resulting models may generate harmful text when prompted with innocuous prompts and have the potential to cause real world harm to end users. The size of these datasets makes human evaluation and filtration impractical if not impossible, as it would take many human lifetimes just to read the datasets in their entirety. One approach to this issue has been to employ word level blocklists. When creating a dataset of text, one can remove documents in the dataset if they contain a word on the blocklist. This is a simple way to remove documents with obvious hateful text, such as racial slurs, but may potentially miss hateful documents that do not use those words, as well as erroneously flag non-hateful documents which use the words in an academic, rhetorical, or expository context. For example, a word-level blocklist could flag academic discussions of racism for removal. Another approach is vocabulary shifting, which is a technique which learns a 2-dimensional representation of toxicity and non-toxicity for every token in a vocabulary, which is then used to boost the likelihood of non-toxic tokens. This suffers from the same problem as word-level blocklists, where tokens are assigned negative connotations regardless of their context. Self-debiasing mitigates corpus-based bias in language models at generation time by using the learned knowledge of a pretrained language model to identify biases in text with a zero-shot prompt-based approach. It is an object of the following to address at least one of the above-noted disadvantages. SUMMARY To address the shortcomings of word-level filtration and the need for large-scale language models that are not likely to generate harmful text, a new method for removing documents from the training data is proposed. A language model is used, which has been trained on an unfiltered corpus to compute the conditional likelihood of a trigger phrases conditioned on each document in the corpus, where a trigger phrase is a succinct statement of biased or hateful rhetoric. The computed likelihoods can then be used to filter the dataset, removing documents which were shown to increase the likelihood of these trigger phrases. It has been demonstrated that models trained on this filtered dataset are less likely to generate hateful text by measuring the relative likelihoods on examples from the RealToxicityPrompts database, while preserving performance on standard language modeling benchmarks. The methods described herein can be adapted iteratively over time to capture new forms of harm. The generalizability of the proposed methodology allows for the methodology to be run iteratively with new triggers. The following describes a method to enable researchers to programmatically and efficiently remove large volumes of harmful text by using the learned knowledge of a pretrained language model. The following also describes a verification of such a method through a human-labelled dataset of harmful text, and experiments which demonstrate that finetuning models on the resulting filtered dataset are less likely to generate harmful text. The following also provides an analysis highlighting problematic examples in existing standard language modeling benchmarks and the need for researchers creating evaluation benchmarks to identify harmful data in their benchmarks. In one aspect, there is provided a method of generating a trained model to filter data sets for filtering hate speech, the method comprising: o