Search

US-12619648-B2 - Practical supervised classification of data sets

US12619648B2US 12619648 B2US12619648 B2US 12619648B2US-12619648-B2

Abstract

The present invention relates to information retrieval. In order to facilitate a search and identification of documents, there is provided a computer-implemented method for training a classifier model for data classification in response to a search query. The computer-implemented method comprises: a) obtaining a dataset that comprises a seed set of labeled data representing a training dataset; b) training the classifier model by using the training dataset to fit parameters of the classifier model; c) evaluating a quality of the classifier model using a test dataset that comprises unlabeled data from the obtained dataset to generate a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset; d) determining a global risk value of misclassification and a reward value based on the classifier confidence score on the test dataset; e) iteratively updating the parameters of the classifier model and performing steps b) to d) until the global risk value falls within a predetermined risk limit value or an expected reward value is reached.

Inventors

  • Arunav Mishra
  • Henning Schwabe
  • Lalita Shaki Uribe Ordonez

Assignees

  • BASF SE

Dates

Publication Date
20260505
Application Date
20240115
Priority Date
20200807

Claims (18)

  1. 1 . A method for data classification based on one or more classifier(s), the method comprising the steps of: providing a classifier model associated with at least a global risk value and/or a reward value, wherein the classifier model is trained based on a dataset that comprises a seed set of labeled data representing a training dataset, wherein on training a quality of the classifier model using a test dataset that comprises unlabeled data from the obtained dataset to generate a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset was evaluated, wherein on training a global risk value of misclassification and a reward value based on the classifier confidence score on the test dataset was determined; providing data to be classified to the classifier model and determining one or more classifier(s) based on the trained classifier model and the provided data, wherein a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset is determined on training, wherein a classifier metric at different thresholds on classifier confidence score is determined, wherein the classifier metric represents a measure of a test's accuracy, wherein a reference threshold relating to a distribution of the classifier metric over the threshold on classifier confidence score is determined, wherein a threshold range that defines a recommended window according to a predefined criteria is determined, wherein the reference threshold is located within the threshold range, wherein the reward value at different thresholds on classifier confidence score is determined, and wherein the reward value includes at least one of a measure of information gain and a measure of decrease in uncertainty; and presenting, via an interface, a list of documents corresponding to the data classified by the classifier model based on the one or more classifier(s), wherein documents corresponding to data classified as relevant are ranked higher than and presented before documents classified as irrelevant in the list.
  2. 2 . A computer-implemented method for data classification, in particular in response to a search query, comprising: receiving a search query or sensor data; training a classifier model according to claim 1 ; and applying the trained classifier model on new search results or uploaded data.
  3. 3 . A computer-implemented method of predicting quality of a product, comprising the following steps: providing sensor data associated with the product; providing a classifier model trained according to the method of claim 1 , wherein the classifier model relates historical quality classifiers of the product to historic sensor data; determining a quality classifier based on the trained classifier model and the sensor data; and providing control data associated with the quality classifier.
  4. 4 . A computer program element for an apparatus, which when being executed by a processor is configured to carry out the method according to claim 1 .
  5. 5 . The method of claim 1 , wherein the classifier model is associated with a deployment risk based on global risk value, wherein the deployment risk for one or more classifier(s) is associated with the risk of false negatives, wherein the deployment risk is estimated a-priori before classifier model deployment.
  6. 6 . The method of claim 1 , wherein the classifier model is deployed for data classification based on the global risk value and/or the reward value.
  7. 7 . The method of claim 1 , wherein a user-defined threshold value is received on training that yields a desirable risk-reward pair for a use case based on the global risk value and the reward value at the user-defined threshold value.
  8. 8 . The method of claim 1 , wherein at least the global risk value and/or the reward value are provided on providing data to be classified to the classifier model and determining one or more classifier(s) based on the trained classifier model and the provided data.
  9. 9 . The method of claim 1 , wherein the global risk value related to misclassification and the reward value are determined on training based on the classifier confidence score on the test dataset.
  10. 10 . The method of claim 1 , wherein the global risk value and the reward value are expressed as an algebraic expression in an objective function.
  11. 11 . The method of claim 1 , wherein the data classification relates to one or more industrial application(s), and wherein the one or more classifier(s) relate to monitoring and/or controlling of a plant, a quality of a product, one or more properties of a chemical mixture, one or more plant properties or a fault detection of a battery.
  12. 12 . The method of claim 1 , wherein the data to be classified includes sensor data, wherein the classifier model is applied to sensor data or, wherein the classifier model relates one or more historical quality classifiers of a product to historic sensor data, wherein a quality classifier based on the trained classifier model and the sensor data is determined, wherein monitoring and/or control data associated with the quality classifier is provided; or wherein the data to be classified relates to a search query or sensor data, wherein the trained classifier model is applied to search results of the search query and/or uploaded search data; or wherein the data to be classified comprises at least one of: text data; image data; experimental data from chemical, biological, and/or physical experiments; plant operations data, business operations data; and machine-generated data in log files.
  13. 13 . The method of claim 1 , wherein the global risk value and the reward value relate to user-defined input based on the use case on training of the classifier model and prior to deployment of the classifier model.
  14. 14 . A computer-implemented method for data classification, in particular in response to a search query, comprising: receiving a search query or sensor data; providing a classifier model trained according to claim 1 ; and applying the trained classifier model on new search results or uploaded data.
  15. 15 . An apparatus for data classification, in particular in response to a search query, comprising: an input unit configured to receive a search query; a processing unit configured to carry out the method of claim 1 ; and an output unit configured to provide a processing result.
  16. 16 . An information retrieval system, comprising: an apparatus according to claim 12 ; and a data repository for storing data.
  17. 17 . An apparatus for data classification based on one or more classifier(s), the apparatus comprising: a processing unit configured to provide a classifier model associated with at least a global risk value and/or a reward value, wherein the classifier model is trained based on a dataset that comprises a seed set of labeled data representing a training dataset, wherein on training a quality of the classifier model using a test dataset that comprises unlabeled data from the obtained dataset to generate a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset was evaluated, wherein on training a global risk value of misclassification and a reward value based on the classifier confidence score on the test dataset was determined; an input unit configured to provide data to be classified to the classifier model and the processing unit is configured to determine one or more classifier(s) based on the trained classifier model and the provided data, wherein a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset is determined on training, wherein a classifier metric at different thresholds on classifier confidence score is determined, wherein the classifier metric represents a measure of a test's accuracy, wherein a reference threshold relating to a distribution of the classifier metric over the threshold on classifier confidence score is determined, wherein a threshold range that defines a recommended window according to a predefined criteria is determined, wherein the reference threshold is located within the threshold range, wherein the reward value at different thresholds on classifier confidence score is determined, and wherein the reward value includes at least one of a measure of information gain and a measure of decrease in uncertainty; and an output interface configured to present a list of documents corresponding to the data classified by the classifier model based on the one or more classifier(s), wherein documents corresponding to data classified as relevant are ranked higher than and presented before documents classified as irrelevant in the list.
  18. 18 . A non-transitory computer-readable storage medium having instructions encoded thereon that, when executed by a processor, cause the processor to: implement a classifier model associated with at least a global risk value and/or a reward value, wherein the classifier model is trained based on a dataset that comprises a seed set of labeled data representing a training dataset, wherein on training a quality of the classifier model using a test dataset that comprises unlabeled data from the obtained dataset to generate a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset was evaluated, wherein on training a global risk value of misclassification and a reward value based on the classifier confidence score on the test dataset was determined; providing data to be classified to the classifier model and determining one or more classifier(s) based on the trained classifier model and the provided data, wherein a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset is determined on training, wherein a classifier metric at different thresholds on classifier confidence score is determined, wherein the classifier metric represents a measure of a test's accuracy, wherein a reference threshold relating to a distribution of the classifier metric over the threshold on classifier confidence score is determined, wherein a threshold range that defines a recommended window according to a predefined criteria is determined, wherein the reference threshold is located within the threshold range, wherein the reward value at different thresholds on classifier confidence score is determined and wherein the reward value includes at least one of a measure of information gain and a measure of decrease in uncertainty; and presenting, via an interface, a list of documents corresponding to the data classified by the classifier model based on the one or more classifier(s), wherein documents corresponding to data classified as relevant are ranked higher than and presented before documents classified as irrelevant in the list.

Description

CROSS-REFERENCE TO RELATED APPLICATION(S) The present application is a divisional application of U.S. application Ser. No. 17/394,994 filed on Aug. 5, 2021, which claims the benefit of priority of European Patent Application No. 20190061.0, filed on Aug. 7, 2020, the disclosures of which are hereby incorporated by reference herein in their entirety. FIELD OF THE INVENTION The present invention generally relates to information retrieval and in particular to a computer-implemented method for training a classifier model for data classification in particular in response to a search query, a computer-implemented method for data classification in particular in response to a search query, an apparatus for data classification in response to a search query, an information retrieval system, a computer-implemented method of predicting quality of a product, a computer program element, and a computer readable medium. BACKGROUND OF THE INVENTION In, for example, technology monitoring, documents have to be filtered quickly and with minimum reading effort to find the most important ones for the use case. Search tools return a list of documents in response to a search query. The documents from the list may be ranked according to their relevance to the search query. For example, highly relevant documents may be ranked higher than, and may be displayed in a list above, documents of a lesser relevance. This allows a user to quickly and conveniently identify the most relevant documents retrieved in response to the query. Existing supervised binary machine learning classifiers used to filter out irrelevant documents from search results may suffer two deficits. The first deficit may be poor separation of classes leads to many documents left to read, namely low precision and recall. The second deficit may be no measure to estimate the deployment risk of false negatives in the irrelevant class, namely risk of overlooking of relevant items during deployment of trained classifier. In practice, the number of false negatives is determined after the fact by intellectually checking all the items, which defeats the purpose of saving effort. Classification problems exist also in industries in particular chemical industries, e.g. in quality control during production. Misclassifications in such an environment may be highly problematic, and may lead to increased waste and increased costs. Currently, the classification quality can only be enhanced by increasing the data pool for training. This is not always possible for industrial applications. Thus, there is a need to make classification applicable in chemical industries. SUMMARY OF THE INVENTION There may be a need to facilitate a search and identification of documents. There may also be a need to facilitate classification in industrial processes. The object of the present invention is solved by the subject-matter of the independent claims, wherein further embodiments are incorporated in the dependent claims. It should be noted that the following described aspects of the invention apply also for the computer-implemented method for training a classifier model for data classification in particular in response to a search query, the computer-implemented method for data classification in particular in response to a search query, the apparatus, the information retrieval system, the computer-implemented method of predicting quality of a product, the computer program element, and the computer readable medium. It should be noted, although the description often provides document classification in response to a query as an example, the method is applicable to classifier problems mentioned below. According to a first aspect of the present invention, there is provided a computer-implemented method for training a classifier model for data classification, in particular in response to a search query. The computer-implemented method comprises: a) obtaining a dataset that comprises a seed set of labeled data representing a training dataset;b) training the classifier model by using the training dataset to fit parameters of the classifier model;c) evaluating a quality of the classifier model using a test dataset that comprises unlabeled data from the obtained dataset to generate a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset;d) determining a global risk value of misclassification and a reward value based on the classifier confidence score on the test dataset;e) iteratively updating the parameters of the classifier model and performing steps b) to d) until the global risk value falls within a predetermined risk limit value or an expected reward value is reached. In other words, a method is proposed for training a classifier model for data classification in a dataset. The dataset may comprise one or more of text data, image data, experimental data from chemical, biological, and/or physical experiments, plant operations data, business o