US-12619684-B1 - Anomaly detection using a semi-supervised locally adaptive similarity kernel

US12619684B1US 12619684 B1US12619684 B1US 12619684B1US-12619684-B1

Abstract

A method of detecting anomalies in data, includes receiving a dataset with a plurality of multidimensional data points (MDDPs) wherein a portion of the plurality of MDDPs are labeled and wherein other MDDPs of the plurality of MDDPs are unlabeled; based on a neighborhood size k of the plurality of MDDPs, computing a neighborhood radius ox for each MDDP in a reference dataset computed for the plurality of MDDPs; and generating a locally adaptive similarity (LAS) kernel of a newly arrived MDDP (NAMDDP) based on the neighborhood radius ox. The method additionally includes applying a random walk model to the LAS kernel to determine a probability of the NAMDDP being an anomaly; and if the NAMDDP is an anomaly, outputting data associated with an alarm or notification responsive to the detection of the anomaly.

Inventors

Amit Bermanis
Amir Averbuch
David Segev

Assignees

ThetaRay Ltd.

Dates

Publication Date: 20260505
Application Date: 20240509

Claims (20)

1 . A method of detecting anomalies in data, comprising: receiving a dataset comprising a plurality of multidimensional data points (MDDPs) wherein a portion of the plurality of MDDPs are labeled and wherein other MDDPs of the plurality of MDDPs are unlabeled; based on a neighborhood size k of the plurality of MDDPs, computing a neighborhood radius σ x for each MDDP in a reference dataset computed for the plurality of MDDPs; generating a locally adaptive similarity (LAS) kernel of a newly arrived MDDP (NAMDDP) based on the neighborhood radius σx; applying a single step random walk model to the LAS kernel to determine a probability of the NAMDDP being an anomaly; and if the NAMDDP is an anomaly, outputting data associated with an alarm or notification responsive to the detection of the anomaly.
2 . The method of claim 1 , wherein the computing a neighborhood radius σ x for each MDDP in a reference dataset computed for the plurality of MDDPs based on a neighborhood size k of the plurality of MDDPs, includes: receiving the neighborhood size k for the plurality of MDDPs; computing the reference dataset for the plurality of MDDPs; and computing the neighborhood radius σ, for each MDDP in the reference dataset based on the neighborhood size k.
3 . The method of claim 1 , further comprising applying a k-nearest neighbors (KNN) algorithm to each MDDP in the reference dataset for use in the generating of the LAS kernel.
4 . The method of claim 1 , further comprising assigning to each MDDP and/or NAMDDP a score that reflects a normality in a probability assignment between 0 and 1 based on a scoring function S.
5 . The method of claim 4 , wherein a score equal to or approximating 0 is associated with a normal data point, and a score equal to or approximating 1 is associated with an abnormal data point.
6 . The method of claim 1 , further comprising assigning values of −1, 0, and 1 to the data points in the reference data set, wherein the value −1 is associated with a normal data point, the value 1 is associated with an abnormal data point, and the value 0 is associated with an unknown data point.
7 . The method of claim 1 , wherein the LAS kernel is described by a similarity measure S: n × n →[0,1].
8 . A computer program product, comprising: a non-transitory tangible storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method of detecting anomalies in data that includes: receiving a dataset comprising a plurality of multidimensional data points (MDDPs) wherein a portion of the plurality of MDDPs are labeled and wherein other MDDPs of the plurality of MDDPs are unlabeled; based on a neighborhood size k of the plurality of MDDPs, computing a neighborhood radius σ x for each MDDP in a reference dataset computed for the plurality of MDDPs; generating a locally adaptive similarity (LAS) kernel of a newly arrived MDDP (NAMDDP) based on the neighborhood radius σ x ; applying a single step random walk model to the LAS kernel to determine a probability of the NAMDDP being an anomaly; and if the NAMDDP is an anomaly, outputting data associated with an alarm or notification responsive to the detection of the anomaly.
9 . The computer program product of claim 8 , wherein the computing a neighborhood radius σ x for each MDDP in a reference dataset computed for the plurality of MDDPs based on a neighborhood size k of the plurality of MDDPs, includes: receiving the neighborhood size k for the plurality of MDDPs; computing the reference dataset for the plurality of MDDPs; and computing the neighborhood radius σ x for each MDDP in the reference dataset based on the neighborhood size k.
10 . The computer program product of claim 8 , wherein the method further includes applying a k-nearest neighbors (KNN) algorithm to each MDDP in the reference dataset for use in the generating of the LAS kernel.
11 . The computer program product of claim 8 wherein the method further includes assigning to each MDDP and/or NAMDDP a score that reflects a normality in a probability assignment between 0 and 1 based on a scoring function S.
12 . The computer program product of claim 11 , wherein a score equal to or approximating 0 is associated with a normal data point, and a score equal to or approximating 1 is associated with an abnormal data point.
13 . The computer program product of claim 8 , wherein the method further includes assigning values of −1, 0, and 1 to the data points in the reference data set, wherein the value −1 is associated with a normal data point, the value 1 is associated with an abnormal data point, and the value 0 is associated with an unknown data point.
14 . The computer program product of claim 8 , wherein the LAS kernel is described by a similarity measure S: n × n ≥[0,1].
15 . A computer system, comprising: a hardware processor configurable to perform a method for detecting anomalies in data that includes: receiving a dataset comprising a plurality of multidimensional data points (MDDPs) wherein a portion of the plurality of MDDPs are labeled and wherein other MDDPs of the plurality of MDDPs are unlabeled; based on a neighborhood size k of the plurality of MDDPs, computing a neighborhood radius σ x for each MDDP in a reference dataset computed for the plurality of MDDPs; generating a locally adaptive similarity (LAS) kernel of a newly arrived MDDP (NAMDDP) based on the neighborhood radius σ x ; applying a single step random walk model to the LAS kernel to determine a probability of the NAMDDP being an anomaly; and if the NAMDDP is an anomaly, outputting data associated with an alarm or notification responsive to the detection of the anomaly.
16 . The computer system of claim 15 , wherein the computing a neighborhood radius σx for each MDDP in a reference dataset computed for the plurality of MDDPs based on a neighborhood size k of the plurality of MDDPs, includes: receiving the neighborhood size k for the plurality of MDDPs; computing the reference dataset for the plurality of MDDPs; and computing the neighborhood radius σ x for each MDDP in the reference dataset based on the neighborhood size k.
17 . The computer system of claim 15 , wherein the method further includes applying a k-nearest neighbors (KNN) algorithm to each MDDP in the reference dataset for use in the generating of the LAS kernel.
18 . The computer system of claim 15 , wherein the method further includes assigning to each MDDP and/or NAMDDP a score that reflects a normality in a probability assignment between 0 and 1 based on a scoring function S.
19 . The computer system of claim 15 , wherein the method further includes assigning values of −1, 0, and 1 to the data points in the reference data set, wherein the value −1 is associated with a normal data point, the value 1 is associated with an abnormal data point, and the value 0 is associated with an unknown data point.
20 . The computer system of claim 15 , wherein the LAS kernel is described by a similarity measure S: n × n →[0,1].

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority from U.S. Provisional patent application No. 63/465,384 filed May 10, 2023, which is incorporated herein by reference in its entirety. FIELD Embodiments disclosed herein relate in general to methods and systems for anomaly detection of anomalies in data by derivation of scoring functions, and in particular to detection of anomalies among multidimensional data points (MDDPs) in a semi-supervised way. BACKGROUND Huge amounts of data are generated by many sources. “Data” refers to a collection of information, the result of experience, observation, measurement, streaming, computing, sensing or experiment, other information within a computer system, or a set of premises that may consist of numbers, characters, images, or as measurements of observations. Static and dynamic “high dimensional big data” (HDBD) are common in a variety of fields. Exemplarily, such fields include finance (e.g., fraud detection, crimes, money laundering, human traffic, terrorist activities), intrusion detection in cybersecurity, insurance, healthcare, streaming, energy, medical diagnosis, transportation, communication networking (i.e. protocols such as TCP/IP, UDP, HTTP, HTTPS, ICMP, SMTP, DNS, FTPS, SCADA, wireless and Wi-Fi) and streaming, process control and predictive analytics/maintenance, social networking, imaging, e-mails, governmental databases, industrial data, aviation, stock market, acoustics, bioinformatics, among many more. HDBD is a collection of MDDPs. A MDDP, is one unit of data from the original (source, raw) HDBD. A MDDP may be expressed as a combination of numeric, Boolean, integer, floating, binary or real characters. HDBD datasets (or databases) include MDDPs that may be either fixed or may accumulate constantly (dynamic). MDDPs may include (or may be described by) hundreds or thousands of features. The term “feature” (or “parameter”) refers to an individual measurable property of phenomena being observed. A feature may also be “computed”, i.e. be an aggregation of different features to derive an average, a median, a standard deviation, etc. “Feature” is also normally used to denote a piece of information relevant for solving a computational task related to a certain application. More specifically, “features” may refer to specific structures ranging from simple structures to more complex structures such as objects. The choice of features in a particular application may be highly dependent on the specific problem at hand. Features can be described in numerical (3.14), Boolean (yes, no), ordinal (never, sometimes, always), or categorical (A, B, O) manner. SUMMARY In various embodiments, there is provided a method of detecting anomalies in data, including receiving a dataset with a plurality of multidimensional data points (MDDPs) wherein a portion of the plurality of MDDPs are labeled and wherein other MDDPs of the plurality of MDDPs are unlabeled; based on a neighborhood size k of the plurality of MDDPs, computing a neighborhood radius ox for each MDDP in a reference dataset computed for the plurality of MDDPs; and generating a locally adaptive similarity (LAS) kernel of a newly arrived MDDP (NAMDDP) based on the neighborhood radius σx. The method additionally includes applying a random walk model to the LAS kernel to determine a probability of the NAMDDP being an anomaly; and if the NAMDDP is an anomaly, outputting data associated with an alarm or notification responsive to the detection of the anomaly. In some embodiments, the computing a neighborhood radius σx for each MDDP in a reference dataset computed for the plurality of MDDPs based on a neighborhood size k of the plurality of MDDPs, includes receiving the neighborhood size k for the plurality of MDDPs; computing the reference dataset for the plurality of MDDPs; and computing the neighborhood radius σ, for each MDDP in the reference dataset based on the neighborhood size k. In some embodiments, the method further includes applying a k-nearest neighbors (KNN) algorithm to each MDDP in the reference dataset for use in the generating of the LAS kernel. In some embodiments, the method further includes assigning to each MDDP and/or NAMDDP a score that reflects a normality in a probability assignment between 0 and 1 based on a scoring function S. Optionally, a score equal to or approximating 0 is associated with a normal data point, and a score equal to or approximating 1 is associated with an abnormal data point. In some embodiments, the method further includes assigning values of −1, 0, and 1, to the data points in the reference data set, wherein the value −1 is associated with a normal data point, the value 1 is associated with an abnormal data point, and the value 0 is associated with an unknown data point. In some embodiments, the LAS kernel is described by a similarity measure S:n×n→[0,1]. In various embodiments, there is provided a computer program product, including a n