US-12626055-B2 - Automated data classification error correction through spatial analysis using machine learning

US12626055B2US 12626055 B2US12626055 B2US 12626055B2US-12626055-B2

Abstract

Aspects of the present disclosure provide techniques for automated data classification error correction through machine learning. Embodiments include receiving a set of predicted labels corresponding to a set of consecutive text strings that appear in a particular order in a document, including: a first text string corresponding to a first predicted label; a second text string that follows the first text string in the particular order and corresponds to a second predicted label; and a third text string that follows the second text string in the particular order and corresponds to a third predicted label. Embodiments include providing inputs to a machine learning model based on: the third text string; the second text string; the second predicted label; and the first predicted label. Embodiments include determining a corrected third label for the third text string based on an output provided by the machine learning model in response to the inputs.

Inventors

Mithun Ghosh
Vignesh Thirukazhukundram Subrahmaniam

Assignees

INTUIT INC.

Dates

Publication Date: 20260512
Application Date: 20231009

Claims (20)

1 . A method for automated data classification error correction through machine learning, comprising: receiving a set of predicted labels corresponding to a set of consecutive text strings that appear in a particular order in a document, wherein the set of consecutive text strings comprises: a first text string corresponding to a first predicted label of the set of predicted labels; and a second text string that follows the first text string in the particular order and corresponds to a second predicted label of the set of predicted labels; providing one or more inputs to a machine learning model based on: the second text string; and the first predicted label; wherein the machine learning model has been trained through a supervised learning process based on training data; and wherein the machine learning model comprises: one or more layers that generate an embedding based on the second text string; encoding logic that generates an encoding of the first predicted label; combination logic that combines the embedding with the encoding to produce a combined result; and an output layer that generates one or more outputs based on the combined result; determining a corrected second label for the second text string based on an output of the one or more outputs generated by the machine learning model in response to the one or more inputs; replacing the second predicted label with the corrected second label for the second text string; receiving user input related to the corrected second label, wherein the machine learning model is re-trained through a process in which parameters of the machine learning model are iteratively adjusted based on the user input.
2 . The method of claim 1 , wherein the machine learning model determines character-level embeddings of a plurality of characters from the second text string, wherein: each respective character-level embedding of the character-level embeddings is a vector representation of a respective character of the plurality of characters.
3 . The method of claim 2 , wherein the encoding logic determines a one-hot encoded vector representing the first predicted label based on a set of possible labels.
4 . The method of claim 3 , wherein the combination logic combines the one-hot encoded vector with the character-level embeddings.
5 . The method of claim 4 , wherein the machine learning model processes the combined result through one or more fully-connected layers.
6 . The method of claim 5 , wherein the machine learning model processes one or more outputs from the one or more fully-connected layers through a softmax layer to determine the corrected second label.
7 . The method of claim 1 , further comprising: automatically populating a particular variable with the second text string based on the corrected second label; or providing output to a user via a user interface based on the second text string and the corrected second label.
8 . The method of claim 1 , wherein the machine learning model was trained to determine character-level embeddings based on training data comprising features of text strings associated with known labels indicating known classifications of the text strings.
9 . The method of claim 1 , further comprising using the machine learning model to determine a corrected third label for a third text string that follows the second text string in the particular order based on: the third text string; the second text string; the corrected second label; and the second predicted label.
10 . A method for training a machine learning model, comprising: receiving training data comprising a set of known labels corresponding to a set of consecutive text strings that appear in a particular order in a document, wherein the set of consecutive text strings comprises: a first text string corresponding to a first known label of the set of known labels; and a second text string that follows the first text string in the particular order and corresponds to a second known label of the set of known labels; providing one or more inputs to a machine learning model based on: the second text string; and the first known label; wherein the machine learning model comprises: one or more layers that generate an embedding based on the second text string; encoding logic that generates an encoding of the first known label; combination logic that combines the embedding with the encoding to produce a combined result; and an output layer that generates one or more outputs based on the combined result; determining a predicted second label for the second text string based on an output of the one or more outputs generated by the machine learning model in response to the one or more inputs; and adjusting one or more parameters of the machine learning model based on a comparison of the predicted second label with the known second label.
11 . The method of claim 10 , wherein the machine learning model determines character-level embeddings of a plurality of characters from the second text string via an embedding layer, wherein: each respective character-level embedding of the character-level embeddings is a vector representation of a respective character of the plurality of characters; and adjusting the one or more parameters of the machine learning model based on the comparison of the predicted second label with the known second label comprises adjusting one or more parameters of the embedding layer.
12 . The method of claim 11 , wherein the encoding logic determines a one-hot encoded vector representing the first known label based on a set of possible labels.
13 . The method of claim 12 , wherein the combination logic combines the one-hot encoded vector with the character-level embeddings.
14 . The method of claim 13 , wherein the machine learning model processes the combined result through one or more fully-connected layers.
15 . A system, comprising: one or more processors; and a memory comprising instructions that, when executed by the one or more processors, cause the system to: receive a set of predicted labels corresponding to a set of consecutive text strings that appear in a particular order in a document, wherein the set of consecutive text strings comprises: a first text string corresponding to a first predicted label of the set of predicted labels; and a second text string that follows the first text string in the particular order and corresponds to a second predicted label of the set of predicted labels; provide one or more inputs to a machine learning model based on: the second text string; and the first predicted label; wherein the machine learning model has been trained through a supervised learning process based on training data; and wherein the machine learning model comprises: one or more layers that generate an embedding based on the second text string; encoding logic that generates an encoding of the first predicted label; combination logic that combines the embedding with the encoding to produce a combined result; and an output layer that generates one or more outputs based on the combined result; determine a corrected second label for the second text string based on an output of the one or more outputs generated by the machine learning model in response to the one or more inputs; replace the second predicted label with the corrected second label for the second text string; receive user input related to the corrected second label, wherein the machine learning model is re-trained through a process in which parameters of the machine learning model are iteratively adjusted based on the user input.
16 . The system of claim 15 , wherein the machine learning model determines character-level embeddings of a plurality of characters from the second text string, wherein: each respective character-level embedding of the character-level embeddings is a vector representation of a respective character of the plurality of characters; and the machine learning model was trained to determine the character-level embeddings based on training data comprising features of text strings associated with known labels indicating known classifications of the text strings.
17 . The system of claim 16 , wherein the encoding logic determines a one-hot encoded vector representing the first predicted label based on a set of possible labels.
18 . The system of claim 17 , wherein the combination logic combines the one-hot encoded vector with the character-level embeddings.
19 . The system of claim 18 , wherein the machine learning model processes the combined result through one or more fully-connected layers.
20 . The system of claim 19 , wherein the machine learning model processes one or more outputs from the one or more fully-connected layers through a softmax layer to determine the corrected second label.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of co-pending U.S. patent application Ser. No. 18/050,092, filed Oct. 27, 2022, the contents of which are incorporated herein by reference in their entirety. INTRODUCTION Aspects of the present disclosure relate to techniques for correcting errors in data classification through spatial analysis using machine learning. In particular, techniques described herein involve utilizing labels of text strings that appear prior to a given text string in a document such as a spreadsheet as features for determining a corrected label for the given text string using a machine learning model. BACKGROUND Every year millions of people, businesses, and organizations around the world utilize software applications to assist with countless aspects of life. In some cases, a software application may automatically classify data, such as for importing data from a document into the application. However, automatic classifications may be inaccurate in some cases. For example, techniques for classifying text based only on the text itself may result in erroneous classifications. The string “35759” may, for example, be incorrectly classified as a zip code based on automated analysis when the number actually refers to a monetary amount. As such, there is a need in the art for improved techniques of reducing and/or correcting incorrect automated data classifications. BRIEF SUMMARY Certain embodiments provide a method for automated data classification error correction through machine learning. The method generally includes: receiving a set of predicted labels corresponding to a set of consecutive text strings that appear in a particular order in a document, wherein the set of consecutive text strings comprises: a first text string corresponding to a first predicted label of the set of predicted labels; a second text string that follows the first text string in the particular order and corresponds to a second predicted label of the set of predicted labels; and a third text string that follows the second text string in the particular order and corresponds to a third predicted label of the set of predicted labels; providing one or more inputs to a machine learning model based on: the third text string; the second text string; the second predicted label; and the first predicted label; determining a corrected third label for the third text string based on an output provided by the machine learning model in response to the one or more inputs; replacing the third predicted label with the corrected third label for the third text string; and performing, by a computing application, one or more actions based on the third text string and the corrected third label. Other embodiments provide a method for training a machine learning model. The method generally includes: receiving training data comprising a set of known labels corresponding to a set of consecutive text strings that appear in a particular order in a document, wherein the set of consecutive text strings comprises: a first text string corresponding to a first known label of the set of known labels; a second text string that follows the first text string in the particular order and corresponds to a second known label of the set of known labels; and a third text string that follows the second text string in the particular order and corresponds to a third known label of the set of known labels; providing one or more inputs to a machine learning model based on: the third text string; the second text string; the second known label; and the first known label; determining a predicted third label for the third text string based on an output provided by the machine learning model in response to the one or more inputs; and adjusting one or more parameters of the machine learning model based on a comparison of the predicted third label with the known third label. Other embodiments provide a system comprising one or more processors and a non-transitory computer-readable medium comprising instructions that, when executed by the one or more processors, cause the system to perform a method. The method generally includes: receiving a set of predicted labels corresponding to a set of consecutive text strings that appear in a particular order in a document, wherein the set of consecutive text strings comprises: a first text string corresponding to a first predicted label of the set of predicted labels; a second text string that follows the first text string in the particular order and corresponds to a second predicted label of the set of predicted labels; and a third text string that follows the second text string in the particular order and corresponds to a third predicted label of the set of predicted labels; providing one or more inputs to a machine learning model based on: the third text string; the second text string; the second predicted label; and the first predicted label; determining a corrected third label for the third text s