EP-4740118-A1 - METHOD, SYSTEM, AND COMPUTER PROGRAM PRODUCT FOR CLASSIFYING NETWORK ACTIVITY BASED ON CLASSIFICATION-SPECIFIC DATA PATTERNS
Abstract
Described are a system, method, and computer program product for classifying network activity based on classification-specific data patterns. The method includes receiving training data associated with historic network activity. The method includes training a supervised learning model to output a classification of a plurality of classifications associated with network activity. The method includes training an unsupervised learning model to output an outlier score associated with each classification. The method includes receiving activity data in a subsequent time period and determining outlier scores for the activity data. Determining the outlier scores includes inputting the activity data to the trained unsupervised learning model and determining the outlier scores based on the trained unsupervised learning model. The method includes, in response to each outlier score satisfying a threshold, generating a classification of the activity data based on the trained supervised learning model.
Inventors
- XU, HAO
- Tangri, Anurag
- CHETIA, Chiranjeet
Assignees
- Visa International Service Association
Dates
- Publication Date
- 20260513
- Application Date
- 20230707
Claims (20)
- 1 . A computer-implemented method, comprising: receiving, with at least one processor, training data in a first time period associated with historic network activity; training, with at least one processor, a supervised learning model based on the training data, to produce a trained supervised learning model configured to output at least one classification of a plurality of classifications associated with network activity; training, with at least one processor, an unsupervised learning model based on the training data, to produce a trained unsupervised learning model configured to output an outlier score associated with each classification of the plurality of classifications given an input data point associated with network activity; receiving, with at least one processor, activity data in a second time period subsequent the first time period associated with network activity; determining, with at least one processor, a plurality of outlier scores for the activity data, each outlier score of the plurality of outlier scores being associated with a classification of the plurality of classifications, wherein determining the plurality of outlier scores comprises: inputting the activity data to the trained unsupervised learning model; and determining the plurality of outlier scores based on at least one output of the trained unsupervised learning model; comparing, with at least one processor, each outlier score of the plurality of outlier scores to at least one threshold; and in response to each outlier score of the plurality of outlier scores satisfying the at least one threshold, generating, with at least one processor, a classification of the activity data based on the trained supervised learning model.
- 2. The computer-implemented method of claim 1 , wherein the at least one threshold comprises a plurality of thresholds, the method further comprising: determining, with at least one processor, a plurality of candidate thresholds; determining, with at least one processor, a plurality of combinations based on the plurality of candidate thresholds and the plurality of classifications, wherein each combination of the plurality of combinations comprises a plurality of pairings, and wherein each pairing of the plurality of pairings of each combination comprises a candidate threshold of the plurality of candidate thresholds associated with a classification of the plurality of classifications; generating, with at least one processor, a plurality of training scores based on the trained unsupervised learning model and the training data; determining, with at least one processor, a plurality of performance metrics associated with the plurality of combinations based on the plurality of training scores; determining, with at least one processor, a plurality of maximum thresholds based on the plurality of performance metrics; and determining, with at least one processor, the plurality of thresholds based on the plurality of maximum thresholds.
- 3. The method of claim 2, wherein generating the plurality of training scores based on the trained unsupervised learning model and the training data comprises: inputting a plurality of data points of the training data to the trained unsupervised learning model; receiving a plurality of outputs from the trained unsupervised learning model based on the plurality of data points; and generating each training score of the plurality of training scores based on an output of the plurality of outputs.
- 4. The method of claim 2, wherein each performance metric of the plurality of performance metrics is associated with a combination of the plurality of combinations, and wherein determining the plurality of performance metrics comprises, for each performance metric of the plurality of performance metrics: determining a number of the plurality of training scores that satisfy the plurality of candidate thresholds associated with the plurality of classifications for the combination associated with the performance metric, wherein the performance metric is based on the number.
- 5. The method of claim 4, wherein the plurality of performance metrics comprises a user-selected performance metric; and wherein determining the plurality of maximum thresholds based on the plurality of performance metrics comprises: comparing each performance metric of the plurality of performance metrics to a threshold score; determining a subset of performance metrics of the plurality of performance metrics comprising performance metrics that satisfy the threshold score; and determining the plurality of maximum thresholds based on the subset of performance metrics.
- 6. The method of claim 1 , further comprising: repeating over a plurality of time periods: receiving, with at least one processor, new activity data in a new time period of the plurality of time periods associated with new network activity; determining, with at least one processor, a plurality of new outlier scores for the new activity data based on the trained unsupervised learning model, each new outlier score of the plurality of new outlier scores being associated with a classification of the plurality of classifications; comparing, with at least one processor, each new outlier score of the plurality of new outlier scores to the at least one threshold; and in response to at least one new outlier score of the plurality of new outlier scores not satisfying the at least one threshold, determining, with at least one processor, that a prediction of a classification for the new activity data based on the trained supervised learning model would be unreliable; and in response to a number of predictions over the plurality of time periods determined to be unreliable satisfying a threshold number, updating, with at least one processor, the trained supervised learning model.
- 7. The method of claim 6, wherein updating the trained supervised learning model comprises: receiving, with at least one processor, new training data based on network activity over the plurality of time periods; and training, with at least one processor, the supervised learning model based on the new training data.
- 8. The method of claim 1 , wherein the classification of the activity data is associated with a type of anomalous network activity, the method further comprising: performing, with at least one processor, at least one remediative action based on the classification of the activity data, wherein the at least one remediative action comprises: transmitting at least one alert to a computing device of a user; disabling at least one user account associated with the activity data; changing permissions of at least one user account associated with the activity data; or any combination thereof.
- 9. A system comprising at least one processor programmed or configured to: receive training data in a first time period associated with historic network activity; train a supervised learning model based on the training data, to produce a trained supervised learning model configured to output at least one classification of a plurality of classifications associated with network activity; train an unsupervised learning model based on the training data, to produce a trained unsupervised learning model configured to output an outlier score associated with each classification of the plurality of classifications given an input data point associated with network activity; receive activity data in a second time period subsequent the first time period associated with network activity; determine a plurality of outlier scores for the activity data, each outlier score of the plurality of outlier scores being associated with a classification of the plurality of classifications, wherein, when determining the plurality of outlier scores, the at least one processor is programmed or configured to: input the activity data to the trained unsupervised learning model; and determine the plurality of outlier scores based on at least one output of the trained unsupervised learning model; compare each outlier score of the plurality of outlier scores to at least one threshold; and in response to each outlier score of the plurality of outlier scores satisfying the at least one threshold, generate a classification of the activity data based on the trained supervised learning model.
- 10. The system of claim 9, wherein the at least one threshold comprises a plurality of thresholds, and wherein the at least one processor is further programmed or configured to: determine a plurality of candidate thresholds; determine a plurality of combinations based on the plurality of candidate thresholds and the plurality of classifications, wherein each combination of the plurality of combinations comprises a plurality of pairings, and wherein each pairing of the plurality of pairings of each combination comprises a candidate threshold of the plurality of candidate thresholds associated with a classification of the plurality of classifications; generate a plurality of training scores based on the trained unsupervised learning model and the training data; determine a plurality of performance metrics associated with the plurality of combinations based on the plurality of training scores; determine a plurality of maximum thresholds based on the plurality of performance metrics; and determine the plurality of thresholds based on the plurality of maximum thresholds.
- 1 1 . The system of claim 10, wherein each performance metric of the plurality of performance metrics is associated with a pairing of each combination of the plurality of combinations, and wherein, when determining the plurality of performance metrics, the at least one processor is programmed or configured to, for each performance metric of the plurality of performance metrics: determine a number of the plurality of training scores that satisfy the candidate threshold associated with the pairing associated with the performance metric, wherein the performance metric is based on the number.
- 12. The system of claim 1 1 , wherein the plurality of performance metrics comprises a user-selected performance metric; and wherein, when determining the plurality of maximum thresholds based on the plurality of performance metrics, the at least one processor is programmed or configured to: compare each performance metric of the plurality of performance metrics to a threshold score; determine a subset of performance metrics of the plurality of performance metrics comprising performance metrics that satisfy the threshold score; and determine the plurality of maximum thresholds based on the subset of performance metrics.
- 13. The system of claim 9, wherein the at least one processor is further programmed or configured to: repeat over a plurality of time periods: receive new activity data in a new time period of the plurality of time periods associated with new network activity; determine a plurality of new outlier scores for the new activity data based on the trained unsupervised learning model, each new outlier score of the plurality of new outlier scores being associated with a classification of the plurality of classifications; compare each new outlier score of the plurality of new outlier scores to the at least one threshold; and in response to at least one new outlier score of the plurality of new outlier scores not satisfying the at least one threshold, determine that a prediction of a classification for the new activity data based on the trained supervised learning model would be unreliable; and in response to a number of predictions over the plurality of time periods determined to be unreliable satisfying a threshold number, update the trained supervised learning model.
- 14. The system of claim 13, wherein, when updating the trained supervised learning model, the at least one processor is programmed or configured to: receive new training data based on network activity over the plurality of time periods; and train the supervised learning model based on the new training data.
- 15. A computer program product comprising at least one non- transitory computer-readable medium comprising one or more instructions that, when executed by at least one processor, cause the at least one processor to: receive training data in a first time period associated with historic network activity; train a supervised learning model based on the training data, to produce a trained supervised learning model configured to output at least one classification of a plurality of classifications associated with network activity; train an unsupervised learning model based on the training data, to produce a trained unsupervised learning model configured to output an outlier score associated with each classification of the plurality of classifications given an input data point associated with network activity; receive activity data in a second time period subsequent the first time period associated with network activity; determine a plurality of outlier scores for the activity data, each outlier score of the plurality of outlier scores being associated with a classification of the plurality of classifications, wherein, when determining the plurality of outlier scores, the at least one processor is programmed or configured to: input the activity data to the trained unsupervised learning model; and determine the plurality of outlier scores based on at least one output of the trained unsupervised learning model; compare each outlier score of the plurality of outlier scores to at least one threshold; and in response to each outlier score of the plurality of outlier scores satisfying the at least one threshold, generate a classification of the activity data based on the trained supervised learning model.
- 16. The computer program product of claim 15, wherein the at least one threshold comprises a plurality of thresholds, and wherein the one or more instructions further cause the at least one processor to: determine a plurality of candidate thresholds; determine a plurality of combinations based on the plurality of candidate thresholds and the plurality of classifications, wherein each combination of the plurality of combinations comprises a plurality of pairings, and wherein each pairing of the plurality of pairings of each combination comprises a candidate threshold of the plurality of candidate thresholds associated with a classification of the plurality of classifications; generate a plurality of training scores based on the trained unsupervised learning model and the training data; determine a plurality of performance metrics associated with the plurality of combinations based on the plurality of training scores; determine a plurality of maximum thresholds based on the plurality of performance metrics; and determine the plurality of thresholds based on the plurality of maximum thresholds.
- 17. The computer program product of claim 16, wherein each performance metric of the plurality of performance metrics is associated with a combination of the plurality of combinations, and wherein the one or more instructions that cause the at least one processor to determine the plurality of performance metrics cause the at least one processor to, for each performance metric of the plurality of performance metrics: determine a number of the plurality of training scores that satisfy the plurality of candidate thresholds associated with the plurality of classifications for the combination associated with the performance metric, wherein the performance metric is based on the number.
- 18. The computer program product of claim 17, wherein the plurality of performance metrics comprises a user-selected performance metric; and wherein the one or more instructions that cause the at least one processor to determine the plurality of maximum thresholds based on the plurality of performance metrics cause the at least one processor to: compare each performance metric of the plurality of performance metrics to a threshold score; determine a subset of performance metrics of the plurality of performance metrics comprising performance metrics that satisfy the threshold score; and determine the plurality of maximum thresholds based on the subset of performance metrics.
- 19. The computer program product of claim 15, wherein the one or more instructions further cause the at least one processor to: repeat over a plurality of time periods: receive new activity data in a new time period of the plurality of time periods associated with new network activity; determine a plurality of new outlier scores for the new activity data based on the trained unsupervised learning model, each new outlier score of the plurality of new outlier scores being associated with a classification of the plurality of classifications; compare each new outlier score of the plurality of new outlier scores to the at least one threshold; and in response to at least one new outlier score of the plurality of new outlier scores not satisfying the at least one threshold, determine that a prediction of a classification for the new activity data based on the trained supervised learning model would be unreliable; and in response to a number of predictions over the plurality of time periods determined to be unreliable satisfying a threshold number, update the trained supervised learning model.
- 20. The computer program product of claim 19, wherein the one or more instructions that cause the at least one processor to update the trained supervised learning model cause the at least one processor to: receive new training data based on network activity over the plurality of time periods; and train the supervised learning model based on the new training data.
Description
METHOD, SYSTEM, AND COMPUTER PROGRAM PRODUCT FOR CLASSIFYING NETWORK ACTIVITY BASED ON CLASSIFICATION-SPECIFIC DATA PATTERNS BACKGROUND 1 . Technical Field [0001] This disclosure relates generally to network monitoring, and, in some nonlimiting embodiments or aspects, to systems, methods, and computer program products for classifying network activity based on classification-specific data patterns to determine classification reliability. 2. Technical Considerations [0002] Classification models may be used to classify new data as belonging to one or more classifications (e.g., classes, groupings, etc.), based on patterns learned from historical data. However, classification model output may be inaccurate or difficult to interpret based on the patterns of the new data, particularly for patterns that might differ from the historical data. Furthermore, classification models may be biased based on data imbalances, where some classifications have disproportionately more data points for training, which may result in higher rates of false positives or false negatives for classifications with fewer training data points. Additionally, while a classification model may be accurate for one time period, data patterns for a given classification may change in a subsequent time period, making output classifications inaccurate over time. Numerous computational inefficiencies are associated with such inaccurate and biased classification models, such as wasted computing resources (e.g., memory, bandwidth, processing time, etc.) expended in responding to false positives, more computing resources required to train a sufficiently accurate model, and slower reaction time to detect that a classification model has become unreliable for continued use. Such inefficiencies are especially troublesome for classification models that are configured to classify network activity (e.g., transactions, user access to computing resources, communications, etc.), so that a modeling system can react accordingly (e.g., respond to anomalous activity, disable nefarious user accounts, etc.). [0003] There is a need in the art for a technical solution to better predict classifications for new data, which is resilient to data imbalance, data bias, and shifting data patterns over time. There is a further need in the art for a technical solution to assess the reliability of the output of classification models, both for the use of generating classifications and determining when a model may need to be retrained. SUMMARY [0004] According to some non-limiting embodiments or aspects, provided are systems, methods, and computer program products for classifying network activity based on classification-specific data patterns that overcome some or all of the deficiencies identified above. [0005] According to some non-limiting embodiments or aspects, provided is a computer-implemented method for classifying network activity based on classificationspecific data patterns. The method includes receiving, with at least one processor, training data in a first time period associated with historic network activity. The method also includes training, with at least one processor, a supervised learning model based on the training data to produce a trained supervised learning model configured to output at least one classification of a plurality of classifications associated with network activity. The method further includes training, with at least one processor, an unsupervised learning model based on the training data to produce a trained unsupervised learning model configured to output an outlier score associated with each classification of the plurality of classifications given an input data point associated with network activity. The method further includes receiving, with at least one processor, activity data in a second time period subsequent the first time period associated with network activity. The method further includes determining, with at least one processor, a plurality of outlier scores for the activity data, each outlier score of the plurality of outlier scores being associated with a classification of the plurality of classifications. Determining the plurality of outlier scores includes inputting the activity data to the trained unsupervised learning model and determining the plurality of outlier scores based on at least one output of the trained unsupervised learning model. The method further includes comparing, with at least one processor, each outlier score of the plurality of outlier scores to at least one threshold. The method further includes, in response to each outlier score of the plurality of outlier scores satisfying the at least one threshold, generating, with at least one processor, a classification of the activity data based on the trained supervised learning model. [0006] In some non-limiting embodiments or aspects, the at least one threshold may include a plurality of thresholds. The method may also include determining, with at least one proce