US-20260127666-A1 - INTELLIGENT DATASTREAM MERGING LEVERAGING MACHINE LEARNING FOR EXTRACTION, TRANSFORMATION, AND LOADING
Abstract
Systems and methods are provided for information identification and categorization such that data from a first dataset may be matched to or identified with data from a second dataset through use of a first, unsupervised machine learning algorithm followed by use of a second, rules-based machine learning algorithm. The novel combination of the algorithms to identify, combine, and/or categorize data sets preferably relies upon at least two different data sources that are not fully communicative with one another, such that the information that each data source contains may be a subset of information of the other data source(s). Rules may associate particular items including dates, amounts, text, etc. within the data. One notable application is the creation of a time-based payment expectation. Novel application of machine learning algorithms permits scalable and repeatable processes for the identification and categorization that has not been previously possible within available resource budgets.
Inventors
- Mat Lavoie
- Pawel Kuras
Assignees
- WWW.TRUSTSCIENCE.COM INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20241101
Claims (20)
- 1 . A system for efficiently matching and merging data comprising: a first datastore on a first server that includes a compilation of data regarding an individual that includes debt amounts, credit identifiers, expected payment information, and actual payment information; a second datastore on one or more second servers that each include data that identifies debits to an account of the individual; a data compilation server containing instructions for executing a data compilation engine for obtaining debit and credit data regarding the individual from the first datastore and the second datastore; a parallel processing system containing instructions for executing and applying unsupervised machine learning algorithms to debit and credit data obtained by the data compilation engine to analyze and cluster the first data and the second data based upon structures or patterns, resulting in a clustered data set; a parallel processing system containing instructions for applying at least one rules-based machine learning model to the clustered data set to determine a time-based expected payment burden for the individual; and an output device for outputting the time-based expected payment burden for the individual.
- 2 . The system of claim 1 , wherein the first datastore houses credit bureau data and the first datastore includes data identifying at least one expected frequency of payment and at least one most recent payment by the individual.
- 3 . The system of claim 1 , wherein the second datastore comprises data regarding a plurality of financial accounts linked to the individual.
- 4 . The system of claim 1 , further comprising: a parallel processing system making an automated decision, using a second machine learning algorithm, whether to extend credit to the individual based on data including the expected payment burdens for the individual; and an output device for publishing to the individual an offer of credit.
- 5 . The system of claim 1 , wherein the instructions for applying at least one rules-based machine learning model include instructions for applying: at least one rule for associating similar dates in first data and second data; at least one rule for associating similar amounts in first data and second data; and at least one rule for associating similar text in first data and second data.
- 6 . The system of claim 1 , wherein the time-based expected payment burden for the individual includes: indications of monthly or quarterly actual payments made by the individual with respect to a plurality of debts; and indications of monthly or quarterly actual payments made by the individual based upon debits in the second data that do not correspond to actual payment information in the data obtained from the first datastore.
- 7 . The system of claim 6 , wherein the time-based expected payment burden for the individual further includes indications of expected monthly or quarterly debits that are not reflected in data obtained from the first datastore.
- 8 . A method to efficiently match and merge data comprising: obtaining debit and credit data regarding an individual from a plurality of data sources including first data from a first data source and second data from at least one second data source, wherein the first data obtained from the first data source includes a compilation of data regarding the individual that includes debt amounts, credit identifiers, expected payment information, and actual payment information, wherein the second data obtained from the at least one second data source identifies debits to one or more accounts of the individual; applying unsupervised machine learning algorithms to the first data and the second data to analyze and cluster the first data and the second data based upon structures or patterns, resulting in a clustered data set; applying at least one rules-based machine learning model to the clustered data set to determine a time-based expected payment burden for the individual; and outputting the time-based expected payment burden for the individual.
- 9 . The method of claim 8 , wherein the first data source is a credit bureau and the first data includes data identifying at least one expected frequency of payment and at least one most recent payment by the individual.
- 10 . The method of claim 8 , wherein the at least one second data source comprises a plurality of financial accounts linked to the individual.
- 11 . The method of claim 8 , further comprising: making an automated decision, using a second machine learning algorithm, whether to extend credit to the individual based on data including the expected payment burdens for the individual; and publishing to the individual an offer of credit.
- 12 . The method of claim 8 , wherein at least one of the at least one rules-based machine learning model includes: at least one rule for associating similar dates in first data and second data; at least one rule for associating similar amounts in first data and second data; and at least one rule for associating similar text in first data and second data.
- 13 . The method of claim 8 , wherein the time-based expected payment burden for the individual includes: indications of monthly or quarterly actual payments made by the individual with respect to a plurality of debts; and indications of monthly or quarterly actual payments made by the individual based upon debits in the second data that do not correspond to actual payment information in the first data.
- 14 . The method of claim 13 , wherein the time-based expected payment burden for the individual further includes indications of expected monthly or quarterly debits that are not reflected in the first data.
- 15 . A non-transitory computer-readable storage medium comprising: instructions that, when executed by a device comprising processor, facilitate performance of operations comprising: obtaining debit and credit data regarding an individual from a plurality of data sources including first data from a first data source and second data from at least one second data source, wherein the first data obtained from the first data source includes a compilation of data regarding the individual that includes debt amounts, credit identifiers, expected payment information, and actual payment information, wherein the second data obtained from the at least one second data source identifies debits to one or more accounts of the individual; applying unsupervised machine learning algorithms to the first data and the second data to analyze and cluster the first data and the second data based upon structures or patterns, resulting in a clustered data set; applying at least one rules-based machine learning model to the clustered data set to determine a time-based expected payment burden for the individual; and outputting the expected payment burdens for the individual.
- 16 . The medium of claim 15 , wherein the first data source is a credit bureau and the first data includes data identifying at least one expected frequency of payment and at least one most recent payment by the individual.
- 17 . The medium of claim 15 , wherein the at least one second data source comprises a plurality of financial accounts linked to the individual.
- 18 . The medium of claim 15 , further comprising: making an automated decision, using a second machine learning algorithm, whether to extend credit to the individual based on data including the expected payment burdens for the individual; and publishing to the individual an offer of credit.
- 19 . The medium of claim 15 , wherein at least one of the at least one rules-based machine learning model includes: at least one rule for associating similar dates in first data and second data; at least one rule for associating similar amounts in first data and second data; and at least one rule for associating similar text in first data and second data.
- 20 . The medium of claim 15 , wherein the time-based expected payment burden for the individual includes: indications of monthly or quarterly actual payments made by the individual with respect to a plurality of debts; and indications of monthly or quarterly actual payments made by the individual based upon debits in the second data that do not correspond to actual payment information in the first data.
Description
TECHNICAL FIELD This invention relates generally to the use of machine learning algorithms in serial application to identify and categorize potentially related data from a plurality of data sources which permits, for example, determination of a time-based expected payment burden. More particularly, this invention relates to the field of intelligent datastream merging, leveraging machine learning for entity resolution, extraction, transformation, and loading (“ETL”). BACKGROUND In various contexts, persons or entities seeking to identify and consolidate information, encounter problems that make the identification and categorization of such information difficult when performed with a conventional computer. And use of human labor or intellect to solve the problem is both impractical and untimely due, at least in part, to the large amount of data that must be processed and the compressed times in which such processing must occur. One specific example of this identification and categorization relates to the combination of a person, family, or company's income data and that same entity's debt data. Specifically, a problem may occur when considering the information disparities between a credit bureau and one or more banks of the entity. The problem includes a common situation in which no single source of a debt profile of an entity currently exists. Certain information may be available in a credit bureau report that compiles information from certain sources, but not all sources, information from various debts that are outstanding, information regarding payments on such debts, and information regarding whether payments were timely or missed, along with other information that may be relevant in determining whether an entity is expected to be able to pay debts or not. Other information may be available in banking records, such as account balances, account balance history, debits, credits, handwritten check records, electronic check records, as well as information regarding payees, dates, amounts, and other information associated with such checks. This information might be available in similar or different form for handwritten checks versus checks that are issued through an automated payment system or an electronic check payment system. For example, on a handwritten check written in cursive, the bank may not have an OCR capture or other data that accurately portrays the payee name or memo line information that indicates the purpose of the check. Whereas, in electronic payments, such information (if it is included) is often in digital form that was typed at some point by the account holder. To build the most complete data set, it may be desirable to combine information from both a credit bureau and a bank, and more preferably from multiple credit bureaus and multiple banks where such information is available. It is often the case that banking records hold an incomplete set of data. This might not be true for a person who pays solely through banks, who is up to date with payments on all accounts, and who does not prepay any payments or pay amounts other than the exact balance due in any given payment. However, the situation is very different where persons receive income or make payments outside of the banking system. Many people may find themselves the recipients of cash payments that are never recorded within a bank. Similarly, such people might make their own cash payments on certain obligations and collect a handwritten paper receipt therefrom. For example, a person might physically present themselves at a utility company, pay in cash, and obtain a receipt for the cash payments without ever interacting with a bank. Such cash payments are also possible with respect to multiple types of accounts where outstanding balances or regularly occurring balances are incurred. Such payments may not be recorded in a bank account's records. In addition to this, people that use banks less frequently may prefer to purchase money orders from various vendors for making payments. Such money orders may be purchased with cash and sent directly to the entity to whom a person has an obligation. And such money orders might never appear on banking records. Similarly, various persons purchase cashier's checks or other secured methods of payment for paying various obligations without tying such payments to any bank account associated with the person. Thus, it becomes important to identify both credits and debits in banking and to classify them appropriately. One situation where banking and credit bureau records might further appear inconsistently is a situation where various payments are not shown in full detail. For example, a person may use a credit card, cash, or a check to pay for gasoline for an automobile. Such a payment might not be recorded as a gasoline purchase. In some establishments, it is possible to purchase groceries, gasoline, or automotive repair services in the same facility. A payment to that facility may not register as bei