Abstract: A method (300) for dynamically detecting point anomalies in time-series data is disclosed. The method (300) includes receiving time series data (208) for a current time unit. The method (300) includes dynamically creating a set of time buckets in the time unit based on a data distribution of points in the current time bucket; calculating in real-time, a first distance metric of the point corresponding to current time bucket based on the data distribution of one or more points; comparing the first distance metric of the point with a bucketing threshold value of the current time bucket; assigning the point to one of the current time bucket or a next time bucket based on the comparison; and detecting point anomalies in the set of time buckets based on a data distribution of the set of points in each of concurrent time buckets of each of a plurality of time units. [To be published with FIG. 3]
Description:DESCRIPTION
Technical Field
This disclosure relates generally to anomaly detection, and more particularly to a method and system for dynamically detecting point anomalies in time-series data.
Background
Anomaly detection plays an important role in identifying unusual events or observations in data that statistically differ significantly from rest of the data. Such unusual events or observations are anomalies that may indicate potential problems, such as credit card fraud, server problems, or cyberattacks. In time-series data, the anomalies may be real anomalies or contextual anomalies (or contextual spikes). The contextual anomalies may correspond to data points that deviate significantly within a specific context but appear normal outside of that context. In other words, the contextual anomalies may only be anomalous within some specific situations. For example, if users of a particular machine typically start their day at 10 AM, and a CPU usage of 30% is observed for the machine around 10 AM, then the CPU usage may not indicate an anomaly. However, if the CPU usage of 30% is observed for the machine at 6 AM, the CPU usage may indicate an anomaly and may raise suspicion of a cyberattack on the machine.
In the present state of the art, various time-series techniques are being used to detect contextual anomalies, such as ARIMA, Kalman Filters of FBProphet, etc. Further, the techniques may identify trends and deviations in the time series data. However, the time-series techniques require up to 6 cycles of seasonality data to be trained to capture seasonality in data. Additionally, the time-series techniques may require hyperparameter tuning and prior assumptions about data distribution, which may not be feasible in dynamic and evolving environments.
By way of an example, a goal may be to detect contextual anomalies within device usage data (e.g., CPU or memory) on devices with limited resources. The computational requirements of training the time series techniques on the device usage data may exceed the capabilities of resource-constrained environments, such as a consumer laptop. A static threshold is typically used to address the resource limitations. However, this approach fails to capture contextual anomalies.
The present invention is directed to overcome one or more limitations stated above or any other limitations associated with the known arts.
SUMMARY
In one embodiment, a method for dynamically detecting point anomalies in time-series data is disclosed. In one example, the method may include receiving time series data for a current time unit. It may be noted that the time series data may include a plurality of points of each of one or more variables. The method may further include dynamically creating a set of time buckets in the time unit based on a data distribution of the plurality of points. It may be noted that each of the set of time buckets may include a set of points from the plurality of points. It may also be noted that each of the set of time buckets may correspond to a variable time interval in the time unit. Further, dynamically creating the set of time buckets may include, for each point of the plurality of points in chronological order, the method may further include calculating in real time, a first distance metric of the point corresponding to a current time bucket based on the data distribution of one or more of the plurality of points in the current time bucket. It may be noted that the current time bucket may be one of the set of time buckets. It may also be noted that a time value may be associated with the point is greater than an end time value of the variable time interval of the current time bucket. The method may further include comparing the first distance metric of the point with a bucketing threshold value of the current time bucket. The method may further include assigning the point to one of the current time bucket or a next time bucket based on the comparison. It may be noted that the next time bucket may be one of the set of time buckets. It may also be noted that the next time bucket may be arranged next in chronological order to the current time bucket. The method may further include detecting one or more point anomalies in each of the set of time buckets based on a data distribution of the set of points in each of concurrent set of time buckets of each of a plurality of time units. It may be noted that the plurality of time units may include the current time unit.
In another embodiment, a system for dynamically detecting point anomalies in time-series data is disclosed. In one example, the system may include a processor and a computer-readable medium communicatively coupled to the processor. The computer-readable medium may store processor-executable instructions, which, on execution, may cause the processor to receive time series data for a current time unit. It may be noted that the time series data may include a plurality of points of each of one or more variables. The processor-executable instructions, on execution, may further cause the processor to dynamically create a set of time buckets in the time unit based on a data distribution of the plurality of points. It may be noted that the set of time buckets may include a set of points from the plurality of points. It may also be noted that each of the set of time buckets may correspond to a variable time interval in the time unit. Further, dynamically creating the set of time buckets may include, for each point of the plurality of points in chronological order, the processor-executable instructions, on execution, may further cause the processor to calculate in real time, a first distance metric of the point corresponding to a current time bucket based on the data distribution of one or more of the plurality of points in the current time bucket. It may be noted that the current time bucket may be one of the set of time buckets. It may also be noted that a time value may be associated with the point is greater than an end time value of the variable time interval of the current time bucket. The processor-executable instructions, on execution, may further cause the processor to compare the first distance metric of the point with a bucketing threshold value of the current time bucket. The processor-executable instructions, on execution, may further cause the processor to assign the point to one of the current time bucket or a next time bucket based on the comparison. It may be noted that the next time bucket may be one of the set of time buckets. It may also be noted that the next time bucket may be arranged next in chronological order to the current time bucket. The processor-executable instructions, on execution, may further cause the processor to detect one or more point anomalies in each of the set of time buckets based on a data distribution of the set of points in each of concurrent set of time buckets of each of a plurality of time units. It may be noted that the plurality of time units may include the current time unit.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
FIG. 1 is a block diagram of an exemplary system for dynamically detecting point anomalies in time-series data, in accordance with some embodiments of the present disclosure.
FIG. 2 illustrates a functional block diagram of an exemplary system for dynamically detecting point anomalies in time-series data, in accordance with some embodiments of the present disclosure.
FIG. 3 illustrates a flow diagram of an exemplary process for dynamically detecting point anomalies in time-series data, in accordance with some embodiments of the present disclosure.
FIG. 4 illustrates a flow diagram of an exemplary process for dynamically creating time buckets in a time unit, in accordance with some embodiments of the present disclosure.
FIG. 5 illustrates a flow diagram of an exemplary process for detecting point anomalies in time buckets, in accordance with some embodiments of the present disclosure.
FIGS. 6A illustrates an exemplary confusion matrix showing results of time bucket-based anomaly detection analysis for contextual spikes in univariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 6B illustrates an exemplary confusion matrix showing results of time bucket-based anomaly detection analysis for real anomalies in univariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 7A illustrates an exemplary confusion matrix showing results of time bucket-based anomaly detection analysis for contextual spikes in multivariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 7B illustrates an exemplary confusion matrix showing results of time bucket-based anomaly detection analysis for real anomalies in multivariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 8A illustrates an exemplary comparison table representing performance data of time bucket-based anomaly detection analysis and pre-existing anomaly detection analysis techniques in univariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 8B illustrates an exemplary comparison table representing performance data of time bucket-based anomaly detection analysis and pre-existing anomaly detection analysis techniques in multivariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 9A illustrates an exemplary comparison table representing training and inference times of time bucket-based anomaly detection analysis and pre-existing anomaly detection analysis techniques in univariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 9B illustrates an exemplary comparison table representing training and inference times of time bucket-based anomaly detection analysis with binary search-based bucket allocation and pre-existing anomaly detection analysis techniques in univariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 10A illustrates an exemplary comparison table representing training and inference times of time bucket-based anomaly detection analysis and pre-existing anomaly detection analysis techniques in multivariate test set, in accordance with some embodiments of the present disclosure.
FIGS. 10B illustrates an exemplary comparison table representing training and inference times of time bucket-based anomaly detection analysis with binary search-based bucket allocation and pre-existing anomaly detection analysis techniques in multivariate test set, in accordance with some embodiments of the present disclosure.
FIG. 11 is a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
DETAILED DESCRIPTION
Exemplary embodiments are described with reference to the accompanying drawings. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
Referring now to FIG. 1, an exemplary system 100 for dynamically detecting point anomalies in time-series data is illustrated, in accordance with some embodiments of the present disclosure. The system 100 may include a computing device 102. The computing device 102 may be, for example, but may not be limited to, server, desktop, laptop, notebook, netbook, tablet, smartphone, mobile phone, or any other computing device, in accordance with some embodiments of the present disclosure. The computing device 102 may dynamically detect point anomalies in time series data through a dynamically created set of time buckets. The time series data may be univariate or multivariate.
As will be described in greater detail in conjunction with FIGS. 2 – 13, the computing device 102 may receive time series data for the current time unit. It should be noted that the time series data may include a plurality of points of one or more variables. The computing device 102 may further dynamically create a set of time buckets in the current time unit based on a data distribution of the plurality of points. It should be noted that each of the set of time buckets may include a set of points from the plurality of points. It should also be noted that each of the set of time buckets may correspond to a variable time interval in the current time unit. Further, to dynamically create the set of time buckets, for each point of the plurality of points in chronological order, the computing device 102 may further calculate in real-time, a first distance metric of the point corresponding to a current time bucket based on the data distribution of one or more of the plurality of points in the current time bucket. It should be noted that the current time bucket may be one of the set of time buckets. It should also be noted that a time value associated with the point is greater than an end time value of the variable time interval of the current time bucket. For each point of the plurality of points in chronological order, the computing device 102 may further compare the first distance metric of the point with a bucketing threshold value of the current time bucket. For each point of the plurality of points in chronological order, the computing device 102 may further assign the point to one of the current time bucket or a next time bucket based on the comparison. It should be noted that the next time bucket may be one of the set of time buckets. It should also be noted that the next time bucket may be arranged next in chronological order to the current time bucket. The computing device 102 may further detect one or more point anomalies in each of the set of time buckets based on a data distribution of the set of points in each of concurrent set of time buckets of each of a plurality of time units. It should be noted that the plurality of time units may include the current time unit.
In some embodiments, the computing device 102 may include one or more processors 104 and a memory 106. Further, the memory 106 may store instructions that, when executed by the one or more processors 104, may cause the one or more processors 104 to dynamically detect point anomalies in time-series data, in accordance with aspects of the present disclosure. The memory 106 may also store various data (for example, time series data, distance metrics, time buckets, anomalies, and the like) that may be captured, processed, and/or required by the system 100. The memory 106 may be a non-volatile memory (e.g., flash memory, Read Only Memory (ROM), Programmable ROM (PROM), Erasable PROM (EPROM), Electrically EPROM (EEPROM) memory, etc.) or a volatile memory (e.g., Dynamic Random Access Memory (DRAM), Static Random-Access memory (SRAM), etc.).
The system 100 may further include a display 108. The system 100 may interact with a user interface 110 accessible via the display 108. The system 100 may also include one or more external devices 112. In some embodiments, the computing device 102 may interact with the one or more external devices 112 over a communication network 114 for sending or receiving various data. The communication network 114 may include, for example, but may not be limited to, a wireless fidelity (Wi-Fi) network, a light fidelity (Li-Fi) network, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a satellite network, the internet, a fiber optic network, a coaxial cable network, an infrared (IR) network, a radio frequency (RF) network, and a combination thereof. The one or more external devices 112 may include, but may not be limited to, a remote server, a laptop, a netbook, a notebook, a smartphone, a mobile phone, a tablet, or any other computing device.
Referring now to FIG. 2, a functional block diagram of an exemplary system 200 for dynamically detecting point anomalies in time-series data, in accordance with some embodiments of the present disclosure. FIG. 2 is explained in conjunction with FIG. 1. The system 200 may be analogous to the computing device 102. The system 200 may include, within the memory 106, a time series data receiving module 202, a time bucket creating module 204, and an anomaly detecting module 206.
The time series data receiving module 202 may receive time series data 208 for a current time unit (for example, an hour, a day, a week, or a month). It may be noted that the time series data 208 may include a plurality of points of each of one or more variables. Each of the plurality of points may include a variable value and a corresponding time value in the time series data 208. As will be appreciated, the time series data 208 may be univariate data when the time series data 208 includes the plurality of points of one variable. The time series data 208 may be multivariate data when the time series data 208 includes the plurality of points of each of two or more variables. The time series data 208 may be obtained from a plurality of data sources. By way of an example, the one or more variables in the time series data 208 may correspond to one or more computational metrics (such as, CPU load, CPU utilization, CPU time, memory usage, flops, clock, idle time, GPU utilization, GPU memory usage, power, memory bandwidth, etc.) obtained from a plurality of computing devices in an organization. Further, the time series data receiving module 202 may send the time series data 208 to the time bucket creating module 204.
The time bucket creating module 204 may dynamically create a set of time buckets in the current time unit based on a data distribution of the plurality of points, to obtain bucketed time series data 210A. It may be noted that each of the set of time buckets may include a set of points from the plurality of points. It may also be noted that each of the set of time buckets may correspond to a variable time interval in the current time unit. In other words, the set of time buckets may be dynamically created based on the variable value and the time value for each of the plurality of points. To dynamically create the set of time buckets, for each point of the plurality of points in chronological order, the time bucket creating module 204 may calculate in real time, a first distance metric of the point corresponding to a current time bucket based on the data distribution of one or more of the plurality of points in the current time bucket. It may be noted that the current time bucket may be one of the set of time buckets. It may be noted that the time value associated with the point is greater than an end time value of the variable time interval of the current time bucket.
To calculate the first distance metric of the point, the time bucket creating module 204 may calculate a first mean for the current time bucket based on the data distribution of the one or more of the plurality of points in the current time bucket. Further, the time bucket creating module 204 may calculate a first standard deviation for the current time bucket based on the data distribution of the one or more of the plurality of points in the current time bucket.
Further, the time bucket creating module 204 may calculate the first distance metric of the point corresponding to the current time bucket based on the first mean and the first standard deviation. It may be noted that the first distance metric may be a Z-score when the time series data is univariate data, or may be a Mahalanobis distance when the time series data is multivariate data. The Z-score may be calculated using equation (1).
Z= ((X- μ))/σ
(1)
Where:
X = the point being evaluated
μ = The mean of the current data bucket
σ = The standard deviation of the current time bucket
The Mahalanobis distance may be calculated using equation (2):
……..(2)
Where:
X = The point being evaluated
μ = The mean vector of the current data bucket
S = The covariance matrix of the current data bucket
S−1 = The inverse of the covariance matrix
Further, the time bucket creating module 204 may determine the bucketing threshold value based on the first mean and the first standard deviation. Further, the time bucket creating module 204 may compare the first distance metric of the point with a bucketing threshold value of the current time bucket. Further, the time bucket creating module 204 may assign the point to one of the current time bucket or a next time bucket based on the comparison. It may be noted that the next time bucket may be one of the set of time buckets. It may also be noted that the next time bucket may be arranged next in chronological order to the current time bucket.
To assign the point to one of the current time bucket or the next time bucket, when the first distance metric is within the bucketing threshold value, the time bucket creating module 204 may assign the point to the current time bucket. Further, the time bucket creating module 204 may dynamically adjust an end time value of the variable time interval corresponding to the current time bucket to a time value associated with the point in the time series data.
When the first distance metric is beyond the bucketing threshold value, the time bucket creating module 204 may assign the point to the next time bucket. Upon assigning the point to the next time bucket, the time bucket creating module 204 may compare a count of the one or more of the plurality of points in the current time bucket with a predefined minimum threshold number of points. When the count is less than the predefined minimum threshold number of points, the time bucket creating module 204 may combine the next time bucket with the current time bucket. In this scenario, the set of time buckets may include the combined time bucket. Alternatively, when the count is more than the predefined minimum threshold number of points, the time bucket creating module 204 may uphold the creation of the next time bucket. In this scenario, the set of time buckets may include the current time bucket and the next time bucket. Once the set of time buckets is created, the bucketed time series data 210A is obtained. Further, the time bucket creating module 204 may send the bucketed time series data 210A to the anomaly detecting module 206.
The anomaly detecting module 206 may detect one or more point anomalies 210B in each of the set of time buckets based on a data distribution of the set of points in each of concurrent set of time buckets of each of a plurality of time units. It may be noted that the plurality of time units may include the current time unit. Thus, while the set of time buckets is dynamically created based on the data distribution of the time series data 208 in a single time unit (i.e., the current time unit), a context for the anomaly detection is obtained from the time series data 208 of previous time units (i.e., the plurality of time buckets excluding the current time bucket). As is explained hereinafter, the context for the anomaly detection is based on dividing the time series data 208 of the previous time units into the set of time buckets obtained from the time series data 208 of the current time bucket.
To detect the one or more point anomalies in the time series data 208, for each time bucket of the set of time buckets and for each point of the set of points in the time bucket, the anomaly detecting module 204 may calculate a second distance metric of the point based on the data distribution of the set of points in the time bucket of the current time unit and a set of historical points in a concurrent time bucket in each of remaining of the plurality of time units (i.e., the previous time units). The second distance metric may be a Z-score when the time series data is univariate data. Alternatively, the second distance metric may be a Mahalanobis distance when the time series data is multivariate data. In other words, the first distance metric is same as the second distance metric. However, the value of the first distance metric may not be same as the value of the second distance metric. This is because the first distance metric is determined based on the data distribution of points in a single time unit (i.e., the current time unit). On the other hand, the second distance metric is determined based on the data distribution of points in a single time bucket of the plurality of time units. Additionally, the first distance metric is used to dynamically create time buckets, whereas the second distance metric is used to detect point anomalies.
To calculate the second distance metric of the point, the anomaly detecting module 206 may calculate a second mean for the time bucket based on the data distribution of the set of points and the set of historical points. Further, the anomaly detecting module 206 may calculate a second standard deviation for the time bucket based on the data distribution of the set of points and the set of historical points. Further, the anomaly detecting module 206 may calculate the second distance metric of the point corresponding to the time bucket based on the second mean and the second standard deviation. The anomaly detecting module 206 may also determine the anomaly detection threshold value based on the second mean and the second standard deviation.
Upon determining the second distance metric and the anomaly detection threshold value, the anomaly detecting module 206 may compare the second distance metric of the point with an anomaly detection threshold value of the time bucket. Based on the comparison, the anomaly detecting module 206 may establish the point as a point anomaly in the time bucket. The point anomaly may correspond to an outlier point in the data distribution of the set of points and the set of historical points. Similarly, the anomaly detecting module 206 may obtain the one or more point anomalies 210B in the time series data 208.
In an alternative embodiment, the system 200 may include a server (not shown in the figure) communicatively coupled to the computing device 102. In such an embodiment, the computing device 102 may periodically (typically once a week) send the time series data 208 (e.g., system usage data) to the server. Additionally, the computing device 102 may periodically (typically once a week) request the server for two types of metrics, i.e., time bucket sizes and metrics for each of the set of time buckets (e.g., mean, standard deviation, mean vector, covariance vector, etc.), required for anomaly detection. The server may then compute each of the two types of metrics. Further, the server may return the two types of metrics to the computing device 102. The computing device 102 may then continuously monitor the time series data 208 for anomalies based on the two types of metrics received. For every device resource consumption measurement, the computing device 102 may check the time bucket based on the start time and end time (i.e., time bucket sizes), and may then assign the time buckets to the time series data 208. Then, the time series data 208 is standardized using the equations (1) and (2). Once the data is standardized, the computing device 102 may use standard statistical thresholds to determine whether a data point is an anomaly or not. This embodiment may significantly reduce resource consumption on the computing device 102 by offloading the computationally intensive statistical calculations to the server, ensuring that only periodic data exchanges take place rather than continuous processing.
It should be noted that all such aforementioned modules 202 – 206 may be represented as a single module or a combination of different modules. Further, as will be appreciated by those skilled in the art, each of the modules 202 – 206 may reside, in whole or in parts, on one device or multiple devices in communication with each other. In some embodiments, each of the modules 202 – 206 may be implemented as dedicated hardware circuit comprising custom application-specific integrated circuit (ASIC) or gate arrays, off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. Each of the modules 202 – 206 may also be implemented in a programmable hardware device such as a field programmable gate array (FPGA), programmable array logic, programmable logic device, and so forth. Alternatively, each of the modules 202 – 206 may be implemented in software for execution by various types of processors (e.g., processor 104). An identified module of executable code may, for instance, include one or more physical or logical blocks of computer instructions, which may, for instance, be organized as an object, procedure, function, or other construct. Nevertheless, the executables of an identified module or component need not be physically located together, but may include disparate instructions stored in different locations which, when joined logically together, include the module and achieve the stated purpose of the module. Indeed, a module of executable code could be a single instruction, or many instructions, and may even be distributed over several different code segments, among different applications, and across several memory devices.
As will be appreciated by one skilled in the art, a variety of processes may be employed for dynamically detecting point anomalies in time-series data. For example, the exemplary system 100 and the associated computing device 102 may dynamically detect point anomalies in time-series data by the processes discussed herein. In particular, as will be appreciated by those of ordinary skill in the art, control logic and/or automated routines for performing the techniques and steps described herein may be implemented by the system 100 and the associated computing device 102 either by hardware, software, or combinations of hardware and software. For example, suitable code may be accessed and executed by the one or more processors on the system 100 to perform some or all of the techniques described herein. Similarly, application specific integrated circuits (ASICs) configured to perform some or all of the processes described herein may be included in the one or more processors on the system 100.
Referring now to FIG. 3, an exemplary process 300 for dynamically detecting point anomalies in time-series data is depicted via a flow chart, in accordance with some embodiments of the present disclosure. The process 300 may be implemented by the computing device 102 of the system 100.
In some embodiments, the process 300 may include receiving, by a time series data receiving module (such as the time series data receiving module 202), time series data (such as the time series data 208) for a current time unit, at step 302. It may be noted that the time series data may include a plurality of points of each of one or more variables. The process 300 may further include dynamically creating, by a time bucket creating module (such as the time bucket creating module 204), a set of time buckets in the current time unit based on a data distribution of the plurality of points, at step 304. It may be noted that each of the set of time buckets may include a set of points from the plurality of points. It may also be noted that each of the set of time buckets may correspond to a variable time interval in the current time unit. The step 304 includes steps 306, 308, and 310.
For each point of the plurality of points in chronological order, the process 300 may include calculating in real-time, by the time bucket creating module, a first distance metric of the point corresponding to a current time bucket based on the data distribution of one or more of the plurality of points in the current time bucket, at step 306. It may be noted that the current time bucket may be one of the set of time buckets. It may also be noted that a time value associated with the point is greater than an end time value of the variable time interval of the current time bucket. In an embodiment, the first distance metric may be a Z-score when the time series data corresponds to one variable (i.e., univariate data). In another embodiment, the first distance metric may be a Mahalanobis distance when the time series data corresponds to two or more variables (i.e., multivariate data). This step is explained in greater detail in conjunction with FIG. 4.
Further, for each point of the plurality of points in chronological order, the process 300 may include comparing, by the time bucket creating module, the first distance metric of the point with a bucketing threshold value of the current time bucket, at step 308. Further, for each point of the plurality of points in chronological order, the process 300 may include assigning, by the time bucket creating module, the point to one of the current time bucket or a next time bucket based on the comparison, at step 310. It may be noted that the next time bucket may be one of the set of time buckets. It may also be noted that the next time bucket may be arranged next in chronological order to the current time bucket. This step is explained in greater detail in conjunction with FIG. 4.
The process 300 may further include detecting, by an anomaly detecting module (such as the anomaly detecting module 206), one or more point anomalies in each of the set of time buckets based on a data distribution of the set of points in each of concurrent set of time buckets of each of a plurality of time units, at step 312. It may be noted that the plurality of time units may include the current time unit. This step is explained in greater detail in conjunction with FIG. 5.
Referring now to FIG. 4, an exemplary process 400 for dynamically creating time buckets in a time unit is illustrated via a flow chart, in accordance with some embodiments of the present disclosure. FIG. 4 is explained in conjunction with FIGS. 1, 2, and 3. The process 400 may be implemented by the computing device 102 of the system 100.
The process 400 may include calculating, by a time bucket creating module (such as the time bucket creating module 204), a first mean for the current time bucket based on the data distribution of the one or more of the plurality of points in the current time bucket, at step 402. Further, the process 400 may include calculating, by the time bucket creating module, the first standard deviation for the current time bucket based on the data distribution of the one or more of the plurality of points in the current time bucket, at step 404. Further, the process 400 may include calculating, by the time bucket creating module, the first distance metric of the point corresponding to the current time bucket based on the first mean and the first standard deviation, at step 406. It may be noted that the first distance metric may be a Z-score when the time series data corresponds to one variable. The first distance metric may be a Mahalanobis distance when the time series data corresponds to two or more variables.
Additionally, the process 400 may include defining the bucketing threshold value, at step 408. When the time series data corresponds to one variable, the bucketing threshold value may be predefined by a user. When the time series data corresponds to two or more variables, the bucketing threshold value may be determined, by the time bucket creating module, based on a distance metric mean and a distance metric standard deviation corresponding to the data distribution of the plurality of points in the time series data.
Further, a check may be performed at step 410 by the time bucket creating module to determine whether the first distance metric is within the bucketing threshold value. When the first distance metric is within the bucketing threshold value (‘Yes’ path), the process 400 may proceed to step 412. When the first distance metric is beyond the bucketing threshold value (‘No’ path), the process 400 may proceed to step 416.
When the first distance metric is within the bucketing threshold value, the process 400 may include assigning, by the time bucket creating module, the point to the current time bucket, at step 412. Further, the process 400 may include dynamically adjusting, by the time bucket creating module, an end time value of the variable time interval corresponding to the current time bucket to a time value associated with the point in the time series data, at step 414.
When the first distance metric is beyond the bucketing threshold value, the process 400 may include assigning, by the time bucket creating module, the point to the next time bucket, at step 416. Further, at step 418, a check may be performed by the time bucket creating module to determine whether a count of the one or more of the plurality of points in the current time bucket with a predefined minimum threshold number of points, at step 420. When the count is less than the predefined minimum threshold number of points (‘Yes’ path), the process 400 may include combining, by the time bucket creating module, the next time bucket with the current time bucket, at step 422.
By way of a first example, in the time series data 208, a single variable may be subjected to analysis for anomaly detection (i.e., univariate data). The time series data receiving module 202 may receive the time series data 208 (for example, a system usage dataset), a column corresponding to a variable in the time series data 208 (i.e., the column to analyse, for example, %cpu), a bucketing threshold (i.e., maximum allowed Z-score for assigning a point to a time bucket), and a minimum bucket size (i.e., minimum number of points per time bucket).
The time bucket creating module 204 may initialize variables such as a bucket ID (‘bucket_id’), a current time bucket (‘current_bucket’), a bucket size dictionary (‘bucket_sizes’), and the like. Each time bucket (including the current time bucket) may be a list mapped to a corresponding bucket ID. For a first iteration, the bucket ID may be defined as ‘0’ (i.e., bucket_id = 0). Further, a first data point (‘first_point’) from the plurality of points (in chronological order) in the time series data 208 may be assigned to the bucket ID ‘0’. For example, if the time series data 208 is in the form of a table, the first data point may correspond to the first row of the table. It should be noted that in the first iteration, the bucket ID ‘0’ may be the current time bucket. Thus, upon assigning the first data point to the bucket ID ‘0’, the first data point may be added to the current time bucket. The bucket size dictionary may be created to track sizes of time buckets. Thus, the bucket size dictionary may include the sizes of the time buckets. Additionally, a variable to track the size of the current time bucket (‘current_bucket_size’) may be defined. A minimum bucket size (‘min_bucket_size’) may also be defined. The minimum bucket size variable may prevent creation of extremely small time buckets (i.e., time buckets with a low number of points). This would also ensure that more relevant and meaningful time buckets are created for the time series data 208.
Further, for subsequent iterations, the time bucket creating module 204 may iterate through the remaining of the plurality of points in the time series data 208, to calculate the Z-score distance for each of the remaining of the plurality of points. For each new data point (‘new_point’) in the time series data 208 (starting from the second row), a mean (i.e., the first mean) and a standard deviation (std) (i.e., the first standard deviation) of the data points in the current time bucket may be computed. Further, when the standard deviation is equal to zero, a check may be performed to determine whether the new data point is equal to the mean. When the new data point is equal to the mean, the Z-score distance may be determined as 0 (since there is no deviation of the new data point from the mean). Otherwise, when the new data point is not equal to the mean, the Z-score distance may be determined as infinity or indeterminate (i.e., ∞). Further, when the standard deviation is not equal to zero, the Z-score distance may be calculated using equation (3).
Z-score distance = ∣new_point−mean∣/std (3)
Further, the time bucket creating module 204 may compare the Z-score distance with a bucketing threshold value. In an embodiment, the bucketing threshold value may be a predefined value (for example, ‘0.5’, ‘0.6’, etc.).
When the Z-score distance is less than the bucketing threshold value, or when the size of the current time bucket is less than the minimum bucket size, the new data point may be merged into the current time bucket. In other words, the new data point may be assigned to the bucket ID of the current time bucket. Alternatively, when the Z-score distance is more than the bucketing threshold value, or when the size of the current time bucket is more than the minimum bucket size, a new time bucket is initialized. To initialize the new time bucket, the bucket ID variable may be incremented by 1. For example, if the bucket ID is 0, upon failing to meet the threshold or minimum bucket size criteria, the bucket ID may be incremented to 1. Upon incrementing the bucket ID, the new data point may be assigned to the new time bucket (i.e., the incremented bucket ID).
Further, to handle small time buckets (i.e., the time buckets including a number of points less than the minimum bucket size), the time bucket creating module 204 may first check whether a last created time bucket (i.e., the current time bucket once the new time bucket is initialized) has lesser points than the minimum bucket size and the bucket ID of the last created time bucket is greater than 0. When the check is successfully completed, the last created time bucket may be merged with a previous time bucket. For example, when the number points in a time bucket with the bucket ID ‘1’ is less than the minimum bucket size, the time bucket with the bucket ID ‘1’ may be merged with the time bucket with the bucket ID ‘0’.
Further, the time bucket creating module 204 may return results in form of the bucketed time series data 210A. In other words, the time bucket creating module 204 may output the bucketed time series data 210A, where the plurality of points is assigned to the dynamically created set of time buckets. Further, the time bucket creating module 204 may perform anomaly detection using the bucketed time series data 210A and the time series data 208 of remaining of a plurality of time units (i.e., one or more time units previous to the current time unit). For anomaly detection, point anomalies may be identified by comparing the data points of a plurality of time units within a single time bucket. This is explained in conjunction with FIG. 5.
By way of a second example, in the time series data 208, two or more variables may be subjected to analysis for anomaly detection (i.e., multivariate data). The time series data receiving module 202 may receive the time series data 208 (for example, a system usage dataset), a bucketing threshold value (i.e., maximum allowed Mahalanobis distance for assigning a point to a time bucket), and a minimum bucket size (i.e., minimum number of points per time bucket). In multivariate data, each point may be a vector including values corresponding to two or more variables.
The time bucket creating module 204 may initialize variables such as a bucket ID (‘bucket_id’), a current time bucket (‘current_bucket’), a bucket size dictionary (‘bucket_sizes’), and the like. For a first iteration, the bucket ID may be defined as ‘0’ (i.e., bucket_id = 0). Further, a first data point (‘first_point’) from the plurality of points (in chronological order) in the time series data 208 may be assigned to the bucket ID ‘0’. It should be noted that in the first iteration, the bucket ID ‘0’ may be the current time bucket. Thus, upon assigning the first data point to the bucket ID ‘0’, the first data point may be added to the current time bucket. A variable to track the size of the current time bucket (‘current_bucket_size’) may be defined. A minimum bucket size (‘min_bucket_size’) may also be defined.
Further, for subsequent iterations, the time bucket creating module 204 may iterate through each of the remaining of the plurality of points in the time series data 208, to calculate the Mahalanobis distance thereof. For each new data point (‘new point’) in the time series data 208 (starting from the second row in the time series data 208), the time bucket creating module 204 may compute the Mahalanobis distance of the new data point corresponding to the current time bucket. To compute the Mahalanobis distance, the time bucket creating module 204 may check whether the current time bucket includes more than one point. When the current time bucket includes one point, the Mahalanobis distance is defined as ‘0’ (since there is zero deviation from the mean of the current time bucket).
When the current time bucket includes more than one point, the time bucket creating module 204 may compute a mean vector (‘mean_vec’) for the current time bucket. The mean vector may be a mean of the points (i.e., vectors) in current time bucket. Additionally, the time bucket creating module 204 may compute a covariance matrix (‘cov_matrix’) of the the current time bucket. Further, an inverse of the covariance matrix (‘inv_cov_matrix’) may be computed. Further, the bucket creating module 204 may compute the Mahalanonbis distance for the new data point based on the mean vector and the inverse of the covariance matrix of the current time bucket using the equation (2).
Further, the time bucket creating module 204 may perform a check to determine whether the Mahalanobis distance is less than the bucketing threshold value, or whether the current bucket size is less than the minimum bucket size. When either of the bucketing threshold value criteria or the minimum bucket size criteria is true, the new data point is classified into the current time bucket. Thus, the new data point is assigned to the bucket ID of the current time bucket.
Alternatively, when both the bucketing threshold value criteria and the minimum bucket size criteria are false, the time bucket creating module 204 may increment the bucket ID by 1 (‘new_bucket_id’). Thus, a new bucket may be initialized with the new data point. The new data point may be assigned to the incremented bucket ID (i.e., ‘new_bucket_id’). Further, the time bucket creating module 204 may return results in form of the bucketed time series data 210A. Further, the time bucket creating module 204 may perform anomaly detection using the bucketed time series data 210A and the time series data 208 of remaining of a plurality of time units. This is explained in conjunction with FIG. 5.
By way of a third example, the time bucket creating module 204 may adaptively determine the bucketing threshold value for the time buckets when the time series data 208 is multivariate. of the Mahalanobis distance may be computed. For computation, the time series data receiving module 202 may receive the time series data 208.
Further, the time bucket creating module 204 may compute a mean vector (‘mean_vec’) as the mean of the plurality of points in the time series data 208. Further, the time bucket creating module 204 may compute a covariance matrix (‘cov_matrix’) based on the time series data 208. Further, the time bucket creating module 204 may compute an inverse of the covariance matrix (‘inv_cov_matrix’). Further, the time bucket creating module 204 may calculate Mahalanobis distance of each of the plurality of points corresponding to the mean vector and the covariance vector. A list of Mahalanobis distances for the plurality of points may then be obtained. First, an distance list may be initialized as an empty list. Further, for each row (i.e., point) in the time series data 208, the Mahalanobis distance may be computed using equation (4).
Mahalanobis distance = mahalanobis_distance (row, mean_vec, inv_cov_matrix)
(4)
Further, the time bucket creating module 204 may store the Mahalanobis distance for each point in the distance list. Further, the time bucket creating module 204 may compute the bucketing threshold value based on a mean and a standard deviation of the Mahalonobis distances in the distance list. Thus, the time bucket creating module 204 may compute a mean of the Mahalonobis distances stored in the distance list. Further, the time bucket creating module 204 may compute a standard deviation of the Mahalonobis distances stored in the distance list. Further, the time bucket creating module 204 may compute the bucketing threshold value based on the mean and the standard deviation of the Mahalonobis distances using equation (5).
Bucketing threshold value = mean_dist + (k * std_dist) (5)
Where k may be chosen based on the data points.
It may be noted that when a point falls under the bucketing threshold value, or when the bucket size of a time bucket is less than the minimum bucket size, the point is merged into the time bucket. Otherwise, a new bucket is started with the point.
Referring now to FIG. 5, an exemplary process 500 for detecting point anomalies in time buckets is depicted via a flow chart, in accordance with some embodiments of the present disclosure. FIG. 5 is explained in conjunction with FIGS. 1, 2, 3, and 4. The process 500 may be implemented by the computing device 102 of the system 100.
For each time bucket of the set of time buckets and for each point of the set of points in the time bucket, the process 500 may include calculating, by an anomaly detecting module (such as the anomaly detecting module 206), a second distance metric of the point based on the data distribution of the set of points in the time bucket of the current time unit and a set of historical points in a concurrent time bucket in each of remaining of the plurality of time units, at step 502. It may be noted that the second distance metric may be the Z-score when the time series data corresponds to one variable. The second distance metric may be the Mahalanobis distance when the time series data corresponds to two or more variables. In other words, for anomaly detection, all points in a given time bucket are taken from the time series data of the plurality of time units. By way of an example, the plurality of time units may include 30 days and the time buckets may be 10:00 AM to 6:00 PM (i.e., first time bucket) and 6:00 PM to 10:00 PM (i.e., second time bucket). The time series data for each of the 30 days is retrieved. Then, all the points in the first time bucket for each of the 30 days are taken as a dataset. Then, the second distance metric is calculated for each of the points corresponding to the data distribution of the dataset. Similarly, the second distance metric is calculated for all the points in the second time bucket for each of the 30 days. The step 502 may include steps 504, 506, and 508.
The process 500 may include calculating, by the anomaly detecting module, a second mean for the time bucket based on the data distribution of the set of points and the set of historical points, at step 504. Further, the process 500 may include calculating, by the anomaly detecting module, the second standard deviation for the time bucket based on the data distribution of the set of points and the set of historical points, at step 506. Thereafter, the process 500 may proceed to step 508 and step 510. The process 500 may include calculating, by the anomaly detecting module, the second distance metric of the point corresponding to the time bucket based on the second mean and the second standard deviation, at step 508. Thereafter, the process 500 may proceed to step 512.
Additionally, the process 500 may include determining, by the anomaly detecting module, the anomaly detection threshold value based on the second mean and the second standard deviation, at step 510. Thereafter, the process 500 may proceed to step 512.
The process 500 may include comparing, by the anomaly detecting module, the second distance metric of the point with the anomaly detection threshold value of the time bucket, at step 512. Further, the process 500 may include establishing, by the anomaly detecting module, the point as a point anomaly in the time bucket based on the comparison, at step 514. It may be noted that the point anomaly may correspond to an outlier point in the data distribution of the set of points and the set of historical points.
In an embodiment, when the time series data corresponds to one variable, and after bucketing the data, the point whose Z-score is less than 3. The point may be considered as an anomaly. Further, when the time series data corresponds to two or more variables, and after bucketing the data, the points whose squared Mahalanobis distance may fall within the threshold are considered normal, while the points may exceed the threshold are flagged as anomalies. It may be noted that the threshold may be computed using a 99.7% confidence level (α = 0.003) for d = 4 features.
One or more embodiments of the present disclosure are described in detail with reference to the following examples. However, these examples are only for illustrative purposes and are not intended to limit the scope of the one or more embodiments of the present disclosure.
EXAMPLES
By way of an example, to evaluate anomaly detection techniques, a system usage dataset (analogous to the time series data 208) was collected from a laptop over a 4-hour period. The system usage dataset includes system metrics, such as CPU usage, memory usage, swap usage, and network activity (bytes sent/received). For examples corresponding to univariate experiments, only the CPU usage data is considered, whereas for examples corresponding to multivariate experiments, all the parameters are considered.
To simulate a longer operational period, the 4-hour data was extrapolated to 30 days, creating a continuous system usage dataset. Anomalies were then artificially introduced to test detection accuracy.
Further, two types of anomalies were injected in the univariate and multivariate data. The two types of anomalies include contextual spikes (i.e., contextual anomalies), and real anomalies (i.e., point anomalies). The contextual spikes were time-dependent and occur at specific daily time windows, such as between 09:55 hrs to 10:05 hrs, between 15:55 hrs to 16:05 hrs, etc. As will be appreciated, in an ideal anomaly detection scenario, the contextual spikes should not be identified as point anomalies since they are repeated everyday. So, when considering the context of the spikes, if the spikes are being repeated around the same time bucket (or window) everyday, such spikes are not point anomalies.
The real anomalies were randomly distributed across the 30-day dataset. Moreover, the real anomalies represented abrupt and unexpected changes, independent of time patterns. The anomalies in the univariate data were inserted as extreme values of the CPU usage data, whereas in multivariate data, anomalies of various combinations of variables were inserted. The various combinations included a high memory, low CPU usage combination, a high CPU usage, low Memory, and disk spike combination, a low memory, high swap usage combination, all resources maxed combination, and a CPU spike, memory unchanged combination.
The real anomalies of high memory, low CPU usage combination were based on a rule that the memory utilization spikes to 95-100%, while CPU utilization drops to 0-2%. The real anomalies of the high memory, low CPU usage combination represents cases where memory-intensive background processes run without significant CPU involvement. The real anomalies of the high-CPU, low memory, and disk spike combination were based on a rule that CPU usage increase to 90-100%, memory drop to 0-10%, and disk activity spikes to 90-100%. The real anomalies of the high-CPU, low memory, and disk spike combination simulate intensive computation tasks with high disk access and minimal memory usage. The real anomalies of the low memory, high swap usage combination were based on a rule that memory utilization drops to 0-5%, while swap usage rises to 90-100%, The real anomalies of the low memory, high swap usage combination represent memory exhaustion scenarios where the system relies heavily on swap memory. The real anomalies of the all resource maxed combination were based on a rule that CPU, memory, disk, and swap usage all rise to 98-100%. The real anomalies of the all resource maxed combination simulate system overload conditions where all resources are critically utilized. The real anomalies of the CPU spike, memory unchanged combination were based on a rule that CPU utilization rises to 85-100%, while memory fluctuates slightly within a ±10% range of its previous value The real anomalies of the CPU spike, memory unchanged combination model short bursts of CPU-intensive processes without significant impact on memory. The anomaly insertion in the system usage dataset allows for realistic testing of anomaly detection techniques by ensuring that injected anomalies resemble real-world irregularities rather than artificial patterns.
The system usage dataset was then be divided into train and test sets (80:20 ratio, i.e., 80% of the dataset formed the train set and 20% of the dataset formed the test set). The time buckets were dynamically created and the anomaly detection threshold value was determined from the train set. The test set was bucketed according to the time buckets extracted from the train set, and anomalies were detected in the test set based on the anomaly detection threshold computed using the train set.
Referring now to FIG.6A, an exemplary confusion matrix 600A showing results of time bucket-based anomaly detection analysis for contextual spikes in univariate test set is illustrated, in accordance with some embodiments of the present disclosure. The confusion matrix 600A provides a comparison of true and false predictions of Z-score anomalies 602A with true and false inserted CPU spikes 604A among a plurality of data points of the CPU usage variable in the system usage dataset. The confusion matrix 600A includes a column for false predicted Z-score anomalies 606A and a column for true predicted Z-score anomalies 608A. Additionally, the confusion matrix 600A includes a row for false inserted CPU spikes 610A and a row for true inserted CPU spikes 612A. Thus, the rows define the actual data and the columns define the predicted data.
‘1681’ data points were the false predicted Z-score anomalies 606A and were also the false actual inserted CPU spikes 610A. In other words, ‘1679’ data points are ‘True Negatives’ (TN).
‘23’ data points were the true predicted Z-score anomalies 608A but were the false actual inserted CPU spikes 610A. In other words, ‘25’ data points are ‘False Positives’ (FP).
‘24’ data points were the false predicted Z-score anomalies 606A but were the true actual inserted CPU spikes 612A. In other words, ‘24’ data points are ‘False Negatives’ (FN).
‘0’ data points were the true predicted Z-score anomalies 608A and were also the true actual inserted CPU spikes 612A. In other words, ‘0’ data points are ‘True Positives’ (TP).
Thus, the test set included 24 contextual spikes (i.e., inserted CPU spikes) and none of the 24 contextual spikes was falsely detected as an anomaly (i.e., Z-score anomaly). Thus, ‘0’ True Positives in this case denotes effectiveness of the dynamically created bucketing in not classifying contextual spikes as anomalies.
Referring now to FIG. 6B, an exemplary confusion matrix 600B showing results of time bucket-based anomaly detection analysis for real anomalies in the univariate test set is illustrated, in accordance with some embodiments of the present disclosure. The confusion matrix 600B provides a comparison of true and false predictions of Z-score anomalies 602B with true and false CPU real anomalies 604B among a plurality of data points of the CPU usage variable in the system usage dataset. The confusion matrix 600B may include a column for false predicted Z-score anomalies 606B and a column for true predicted Z-score anomalies 608B. Additionally, the confusion matrix 600B may include a row for false CPU real anomalies 610B and a row for true CPU real anomalies 612B. Thus, the rows define the actual data and the columns define the predicted data.
‘1705’ data points were the false predicted Z-score anomalies 606B and were also the false CPU real anomalies 610B. In other words, ‘1705’ data points are TN.
‘7’ data points were the true predicted Z-score anomalies 608B but were the false CPU real anomalies 610B. In other words, ‘7’ data points are FP.
‘0’ data points were the false predicted Z-score anomalies 606B but were the true CPU real anomalies 612B. In other words, ‘0’ data points are FN.
‘16’ data points were the true predicted Z-score anomalies 608B and were also the true CPU real anomalies 612B. In other words, ‘16’ data points are TP.
Thus, the test set included 16 real anomalies (i.e., CPU real anomalies) and all of the 16 real anomalies were detected as an anomaly (i.e., Z-score anomaly). Thus, ‘16’ True Positives in this case denotes effectiveness of the dynamically created bucketing in identifying real anomalies (i.e., point anomalies).
Referring now to FIG. 7A, an exemplary confusion matrix 700A showing results of time bucket-based anomaly detection analysis for contextual spikes in multivariate test set is illustrated, in accordance with some embodiments of the present disclosure. The confusion matrix 700A provides a comparison of true and false predictions of Mahalanobis anomalies 702A with true and false inserted CPU spikes 704A among a plurality of data points. The confusion matrix 700A includes a column for false predicted Mahalanobis anomalies 706A and a column for true predicted Mahalanobis anomalies 708A. Additionally, the confusion matrix 700A includes a row for false inserted CPU spikes 710A and a row for true actual inserted CPU spikes 712A.
‘1678’ data points were the false predicted Mahalanobis anomalies 706A and were also the false inserted CPU spikes 710A. In other words, ‘1678’ data points are ‘TN’.
‘26’ data points were the true predicted Mahalanobis anomalies 708A but were the false inserted CPU spikes 710A. In other words, ‘26’ data points are ‘FP’.
‘24’ data points may be the false predicted Mahalanobis anomalies 706A but were the true inserted CPU spikes 712A. In other words, ‘24’ data points are ‘FN’.
‘0’ data points were the true predicted Mahalanobis anomalies 708A and were also the true inserted CPU spikes 712A. In other words, ‘0’ data points are ‘TP’.
Thus, the test set included 24 contextual spikes (i.e., inserted CPU spikes) and none of the 24 contextual spikes was falsely detected as an anomaly (i.e., Mahalanobis anomaly). Thus, ‘0’ True Positives in this case denotes effectiveness of the dynamically created bucketing in not classifying contextual spikes as anomalies.
Referring now to FIG. 7B, an exemplary confusion matrix 700B showing results of time bucket-based anomaly detection analysis for real anomalies in multivariate test set is illustrated, in accordance with some embodiments of the present disclosure. The confusion matrix 700B provides a comparison of true and false predictions of Mahalanobis anomalies 702B with true and false cpu real anomalies 704B among a plurality of data points. The confusion matrix 700B includes a column for false predicted Mahalanobis anomalies 706B and a column for true predicted Mahalanobis anomalies 708B. Additionally, the confusion matrix 700B includes a row for false cpu real anomalies 710B and a row for true cpu real anomalies 712B.
‘1702’ data points were the false predicted Mahalanobis anomalies 706B and were also the false cpu real anomalies 710B. In other words, ‘1702’ data points are ‘TN’.
‘13’ data points were the true predicted Mahalanobis anomalies 708B but were the false cpu real anomalies 710B. In other words, ‘13’ data points are ‘FP’.
‘0’ data points may be the false predicted Mahalanobis anomalies 706A but were the true cpu real anomalies 712A. In other words, ‘0’ data points are ‘FN’.
‘13’ data points were the true predicted Mahalanobis anomalies 708A and were also the true cpu real anomalies 712A. In other words, ‘0’ data points are ‘TP’.
Thus, the test set included 13 real anomalies (i.e., CPU real anomalies) and all of the 13 real anomalies were detected as anomalies (i.e., Mahalanobis anomalies). Thus, ‘13’ True Positives in this case denotes effectiveness of the dynamically created bucketing in identifying real anomalies (i.e., point anomalies).
Referring now to FIG. 8A, an exemplary comparison table 800A representing performance data of dynamic time bucket-based anomaly detection analysis and pre-existing anomaly detection analysis techniques for univariate test set is illustrated, in accordance with some embodiments of the present disclosure. Anomaly detection was performed on the univariate test set by the time bucket-based anomaly detection analysis technique and each of the pre-existing anomaly detection analysis techniques. The comparison table 800A may include a column for a real anomaly 802A, a column for a contextual spikes 804A, and a column for anomaly detection techniques 806A. The real anomaly column 802A may include a column for True Positive (TP) anomalies predicted, a column for False Positive (FP) anomalies predicted, a column for True Negative (TN) anomalies predicted, and a column for False Negative (FN) anomalies predicted. The contextual spikes column 804A may also include a column for TP, a column for FP, a column for TN, and a column for FN.
In an ideal scenario, in the case of the contextual spikes 804A, none of the contextual spikes 804A should be identified as point anomalies because they are repeating events in the time buckets. Hence, TP for contextual spikes 804A should be less. In the case of real anomalies 802A, all the real anomalies 802A should be identified as point anomalies because they are exceptional events. Hence, FN for real anomalies 802A should be less.
For the anomaly detection technique 806A ‘ARIMA’ 808A, among the real anomalies 802A, ‘16’ TP data points, ‘26’ FP data points, ‘1686’ TN data points, and ‘0’ FN data points were obtained. Among the contextual spikes 804A, ‘24’ TP data points, ‘18’ FP data points, ‘1686’ TN data points, and ‘0’ FN data points were obtained.
For the anomaly detection technique 806A ‘Kalman Filters’ 810A, among the real anomalies 802A, ‘16’ TP data points, ‘31’ FP data points, ‘1681’ TN data points, and ‘0’ FN data points were obtained. Among the contextual spikes 804A, ‘24’ TP data points, ‘23’ FP data points, ‘1681’ TN data points, and ‘0’ FN data points were obtained.
For the anomaly detection technique 806A ‘FBProphet’ 812A, among real anomalies 802A, ‘4’ TP data points, ‘118’ FP data points, ‘1594’ TN data points, and ‘12’ FN data points were obtained. Among contextual spikes 804A, ‘6’ TP data points, ‘116’ FP data points, ‘1588’ TN data points, and ‘18’ FN data points.
For the anomaly detection technique 806A ‘Dynamic Window Z-score’ 814A, among the real anomalies 802A, ‘16’ TP data points, ‘7’ FP data points, ‘1705’ TN data points, and ‘0’ FN data points were obtained. Among the contextual spikes 804A, ‘0’ TP data points, ‘23’ FP data points, ‘1681’ TN data points, and ‘24’ FN data points were obtained.
For the anomaly detection technique 806A ‘MAD’ 816A, among the real anomalies 802A, ‘16’ TP data points, ‘101’ FP data points, ‘1611’ TN data points, and ‘0’ FN data points were obtained. Among the contextual spikes 804A, ‘24’ TP data points, ‘34’ FP data points, ‘1670’ TN data points, and ‘0’ FN data points were obtained.
Thus, in the univariate test set, through the dynamic window Z-score 814A (i.e., the time bucket-based anomaly detection analysis), all the 16 real anomalies 802A were successfully identified and none of the 24 contextual spikes 804A were identified as anomalies. Other anomaly detection techniques 806A mostly identified the 24 contextual spikes 804A as anomalies. FBProphet 812A was the closest to the dynamic window Z-score 814A in identifying the contextual spikes 804A as non-anomalies (only 6 out of 24 contextual spikes identified as anomalies). However, FBProphet 812A failed to optimally detect the real anomalies 802A (only 4 out of 16 real anomalies identified as anomalies).
Referring now to FIG. 8B, an exemplary comparison table 800B representing performance data of dynamic time bucket-based anomaly detection analysis and pre-existing anomaly detection analysis techniques for multivariate test set is illustrated, in accordance with some embodiments of the present disclosure. Anomaly detection was performed on the multivariate test set by the time bucket-based anomaly detection analysis technique and each of the pre-existing anomaly detection analysis techniques. The comparison table 800B may include a column for a real anomaly 802B, a column for a contextual spikes 804B, and a column for anomaly detection techniques 806B. The real anomaly column 802B may include a column for TP anomalies predicted, a column for FP anomalies predicted, a column for TN anomalies predicted, and a column for FN anomalies predicted. The contextual spikes column 804B may also include a column for TP, a column for FP, a column for TN, and a column for FN.
For the anomaly detection technique 806B ‘PCA’ 808B, among the real anomalies 802B, ‘8’ TP data points, ‘15’ FP data points, ‘1700’ TN data points, and ‘5’ FN data points were obtained. Among the contextual spikes 804B, ‘15’ TP data points, ‘8’ FP data points, ‘1696’ TN data points, and ‘9’ FN data points.
For the anomaly detection technique 806B ‘DBSAN’ 810B, among the real anomalies 802B, ‘1’ TP data point, ‘1’ FP data point, ‘1714’ TN data points, and ‘12’ FN data points were obtained. Among the contextual spikes 804B, ‘1’ TP data point, ‘1’ FP data point, ‘1703’ TN data points, and ‘23’ FN data points were obtained.
For the anomaly detection technique 806B ‘Mahalanobis’ 812B, among the real anomalies 802B, ‘13’ TP data points, ‘31’ FP data points, ‘1684’ TN data points, and ‘0’ FN data points were obtained. Among the contextual spikes 804B, ‘24’ TP data points, ‘20’ FP data points, ‘1684’ TN data points, and ‘0’ FN data points were obtained.
For the anomaly detection technique 806B ‘Dynamic Window Mahalanobis’ 814B (i.e., the time bucket-based anomaly detection analysis), among the real anomalies 802B, ‘12’ TP data points, ‘31’ FP data points, ‘1681’ TN data points, and ‘1’ FN data point were obtained. Among the contextual spikes 804B, ‘19’ TP data points, ‘24’ FP data points, ‘1680’ TN data points, and ‘5’ FN data points were obtained.
For the anomaly detection technique 806B ‘MCD’ 816B, among the real anomalies 802B, ‘13’ TP data points, ‘104’ FP data points, ‘1611’ TN data points, and ‘0’ FN data points were obtained. Among the contextual spikes 804B, ‘24’ TP data points, ‘93’ FP data points, ‘1611’ TN data points, and ‘0’ FN data points.
Thus, in the multivariate test set, through the dynamic window Mahalanobis 814B (i.e., the time bucket-based anomaly detection analysis), all the 13 real anomalies 802B were successfully identified and none of the 24 contextual spikes 804B were identified as anomalies. Other anomaly detection techniques 806B mostly identified the 24 contextual spikes 804B as anomalies. DBSAN 810B was the closest to the dynamic window Mahalanobis 814B in identifying the contextual spikes 804B as non-anomalies (only 1 out of 24 contextual spikes identified as anomalies). However, DBSAN 810B failed to optimally detect the real anomalies 802B (only 1 out of 13 real anomalies identified as anomalies).
Referring now to FIG. 9A, an exemplary comparison table 900A representing training and inference times of time bucket-based anomaly detection analysis and pre-existing anomaly detection analysis techniques in univariate test set is illustrated, in accordance with some embodiments of the present disclosure. The comparison table 900A includes a column for anomaly detection technique 902A, a column for training time 904A, and a column for inference time 906A.
For the anomaly detection technique 902A ‘ARIMA’ 908A, the training time 904A was ‘0.045’ seconds and the inference time 906A was ‘190.297’ seconds.
For the anomaly detection technique 902A ‘Kalman Filters’ 910A, the training time 904A was ‘1.556’ seconds and the inference time 906A was ‘1.416’ seconds.
For the anomaly detection technique 902A ‘FBProphet’ 912A, the training time 904A was ‘0.448’ seconds and the inference time 906A was ‘25.813’ seconds.
For the anomaly detection technique 902A ‘Z-test’ 914A, the training time 904A was ‘0.001’ second and the inference time 906A was ‘0.004’ seconds.
For the anomaly detection technique 902A ‘Dynamic Window Z-score’ 916A, the training time 904A was ‘3.060’ seconds and the inference time 906A was ‘0.609’ seconds.
Thus, in terms of the training and inference times, the time bucket-based anomaly detection analysis technique (i.e., ‘Dynamic Window Z-score’ 916A) takes longer training time than the pre-existing anomaly detection techniques. This is due to an additional complexity involved due to bucketing and statistical calculations (mean, standard deviation, etc.) for the time buckets. The inferencing time of the time bucket-based anomaly detection analysis technique is second only to that of the Z-Test 914A. However, the training and inferencing times of the time bucket-based anomaly detection analysis technique may be further improved through use of binary searching, as will be shown in conjunction with FIG. 9B.
Referring now to FIG. 9B, an exemplary comparison table 900B representing training and inference times of time bucket-based anomaly detection analysis with binary search-based bucket allocation and pre-existing anomaly detection analysis techniques in univariate test set is illustrated, in accordance with some embodiments of the present disclosure. The comparison table 900B includes a column for anomaly detection technique 902B, a column for training time 904B, and a column for inference time 906B. The training time 904B and the inference time 906B for the anomaly detection techniques 902B ARIMA 908A, Kalman Filters 910A, FBProphet 912A, and Z Test 914A have already been stated in the comparison table 900A.
To reduce the training time 904B and the inference time 904B of the time bucket-based anomaly detection analysis, binary search was used on the sorted start times of the time buckets to quickly locate the closest possible time bucket for a given time. Then, the detected index corresponding to the located time bucket is used to check if the start and end times of the located time bucket actually contain the given time. This may avoid time delay.
For the anomaly detection technique 902B ‘Dynamic Window Z-score’ 916B (i.e., time bucket-based anomaly detection analysis with binary search-based bucket allocation), the training time 904B was ‘3.060’ seconds and the inference time 906B was ‘0.609’ seconds.
Thus, the training time 904B and the inference time 906B were significantly reduced for the ‘Dynamic Window Z-score’ 916B upon using binary search for bucket allocation. The reduced training time 904B and the inference time 906B are comparable with that of the pre-existing techniques 908A-914A.
Referring now to FIG. 10A, an exemplary comparison table 1000A representing training and inference times of time bucket-based anomaly detection analysis and pre-existing anomaly detection analysis techniques in multivariate test set is illustrated, in accordance with some embodiments of the present disclosure. The comparison table 1000A includes a column for anomaly detection technique 1002A, a column for training time 1004A, and a column for inference time 1006A.
For the anomaly detection technique 1002A ‘PCA’ 1008A, the training time 1004A was ‘0.031’ seconds and the inference time 1006A was ‘0.001’ seconds.
For the anomaly detection technique 1002A ‘DBSAN’ 1010A, the training time 1004A was ‘0.947’ seconds and the inference time 1006A was ‘0.085’ seconds.
For the anomaly detection technique 1002A ‘Mahalanobis’ 1012A, the training time 1004A was ‘0.011’ seconds and the inference time 1006A was ‘0.007’ seconds.
For the anomaly detection technique 1002A ‘Dynamic Window Mahalanobis’ 1014A, the training time 1004A was ‘2.931’ seconds and the inference time 1006A was ‘1.273’ seconds.
Thus, in terms of the training and inference times, the time bucket-based anomaly detection analysis technique (i.e., ‘Dynamic Window Mahalanobis’ 1014A) takes longer training time and inferencing time than the pre-existing anomaly detection techniques. This is due to an additional complexity involved due to bucketing and statistical calculations (mean, standard deviation, etc.) for the time buckets. However, the training and inferencing times of the time bucket-based anomaly detection analysis technique may be further improved through use of binary searching, as will be shown in conjunction with FIG. 10B.
Referring now to FIG. 10B, an exemplary comparison table 1000B representing training and inference times of time bucket-based anomaly detection analysis with binary search-based bucket allocation and pre-existing anomaly detection analysis techniques in multivariate test set is illustrated, in accordance with some embodiments of the present disclosure. The comparison table 1000B includes a column for anomaly detection technique 1002B, a column for training time 1004B, and a column for inference time 1006B. The training time 1004B and the inference time 1006B for the anomaly detection techniques 1002B PCA 1008A, DBSAN 1010A, and Mahalanobis 1012A have already been stated in the comparison table 1000A.
To reduce the training time 1004B and the inference time 1004B of the time bucket-based anomaly detection analysis, binary search was used on the sorted start times of the time buckets to quickly locate the closest possible time bucket for a given time. Then, the detected index corresponding to the located time bucket is used to check if the start and end times of the located time bucket actually contain the given time. This may avoid time delay.
For the anomaly detection technique 1002B ‘Dynamic Window Mahalanobis’ 1014B (i.e., time bucket-based anomaly detection analysis with binary search-based bucket allocation), the training time 1004B was ‘0.962’ seconds and the inference time 1006B was ‘0.710’ seconds.
Thus, the training time 1004B and the inference time 1006B were significantly reduced for the ‘Dynamic Window Mahalanobis’ 1014B upon using binary search for bucket allocation. The reduced training time 1004B and the inference time 1006B are comparable with that of the pre-existing techniques 1008A-1012A.
As will be also appreciated, the above-described techniques may take the form of computer or controller implemented processes and apparatuses for practicing those processes. The disclosure can also be embodied in the form of computer program code containing instructions embodied in tangible media, such as floppy diskettes, solid state drives, CD-ROMs, hard drives, or any other computer-readable storage medium, wherein, when the computer program code is loaded into and executed by a computer or controller, the computer becomes an apparatus for practicing the invention. The disclosure may also be embodied in the form of computer program code or signal, for example, whether stored in a storage medium, loaded into and/or executed by a computer or controller, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
The disclosed methods and systems may be implemented on a conventional or a general-purpose computer system, such as a personal computer (PC) or server computer. Referring now to FIG. 11, an exemplary computing system 1100 that may be employed to implement processing functionality for various embodiments (e.g., as a SIMD device, client device, server device, one or more processors, or the like) is illustrated. Those skilled in the relevant art will also recognize how to implement the invention using other computer systems or architectures. The computing system 1100 may represent, for example, a user device such as a desktop, a laptop, a mobile phone, personal entertainment device, DVR, and so on, or any other type of special or general-purpose computing device as may be desirable or appropriate for a given application or environment. The computing system 1100 may include one or more processors, such as a processor 1102 that may be implemented using a general or special purpose processing engine such as, for example, a microprocessor, microcontroller or other control logic. In this example, the processor 1102 is connected to a bus 1104 or other communication medium. In some embodiments, the processor 1102 may be an Artificial Intelligence (AI) processor, which may be implemented as a Tensor Processing Unit (TPU), or a graphical processor unit, or a custom programmable solution Field-Programmable Gate Array (FPGA).
The computing system 1100 may also include a memory 1106 (main memory), for example, Random Access Memory (RAM) or other dynamic memory, for storing information and instructions to be executed by the processor 1102. The memory 1106 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by the processor 1102. The computing system 1100 may likewise include a read only memory (“ROM”) or other static storage device coupled to bus 1104 for storing static information and instructions for the processor 1102.
The computing system 1100 may also include a storage devices 1108, which may include, for example, a media drive 1110 and a removable storage interface. The media drive 1110 may include a drive or other mechanism to support fixed or removable storage media, such as a hard disk drive, a floppy disk drive, a magnetic tape drive, an SD card port, a USB port, a micro USB, an optical disk drive, a CD or DVD drive (R or RW), or other removable or fixed media drive. A storage media 1112 may include, for example, a hard disk, magnetic tape, flash drive, or other fixed or removable medium that is read by and written to by the media drive 1110. As these examples illustrate, the storage media 1112 may include a computer-readable storage medium having stored therein particular computer software or data.
In alternative embodiments, the storage devices 1108 may include other similar instrumentalities for allowing computer programs or other instructions or data to be loaded into the computing system 1100. Such instrumentalities may include, for example, a removable storage unit 1114 and a storage unit interface 1116, such as a program cartridge and cartridge interface, a removable memory (for example, a flash memory or other removable memory module) and memory slot, and other removable storage units and interfaces that allow software and data to be transferred from the removable storage unit 1114 to the computing system 1100.
The computing system 1100 may also include a communications interface 1118. The communications interface 1118 may be used to allow software and data to be transferred between the computing system 1100 and external devices. Examples of the communications interface 1118 may include a network interface (such as an Ethernet or other NIC card), a communications port (such as for example, a USB port, a micro USB port), Near field Communication (NFC), etc. Software and data transferred via the communications interface 1118 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by the communications interface 1118. These signals are provided to the communications interface 1118 via a channel 1120. The channel 1120 may carry signals and may be implemented using a wireless medium, wire or cable, fiber optics, or other communications medium. Some examples of the channel 1120 may include a phone line, a cellular phone link, an RF link, a Bluetooth link, a network interface, a local or wide area network, and other communications channels.
The computing system 1100 may further include Input/Output (I/O) devices 1122. Examples may include, but are not limited to a display, keypad, microphone, audio speakers, vibrating motor, LED lights, etc. The I/O devices 1122 may receive input from a user and also display an output of the computation performed by the processor 1102. In this document, the terms “computer program product” and “computer-readable medium” may be used generally to refer to media such as, for example, the memory 1106, the storage devices 1108, the removable storage unit 1114, or signal(s) on the channel 1120. These and other forms of computer-readable media may be involved in providing one or more sequences of one or more instructions to the processor 1102 for execution. Such instructions, generally referred to as “computer program code” (which may be grouped in the form of computer programs or other groupings), when executed, enable the computing system 800 to perform features or functions of embodiments of the present invention.
In an embodiment where the elements are implemented using software, the software may be stored in a computer-readable medium and loaded into the computing system 1100 using, for example, the removable storage unit 1114, the media drive 1110 or the communications interface 1118. The control logic (in this example, software instructions or computer program code), when executed by the processor 1102, causes the processor 1102 to perform the functions of the invention as described herein.
Various embodiments provide method and system for dynamically detecting point anomalies in time-series data. The disclosed method and system may receive time series data for a current time unit. The time series data may include a plurality of points of each of one or more variables. The disclosed method and system may further dynamically create a set of time buckets in the time unit based on a data distribution of the plurality of points. Each of the set of time buckets may include a set of points from the plurality of points. Each of the set of time buckets may correspond to a variable time interval in the time unit. Further, dynamically creating the set of time buckets may include, for each point of the plurality of points in chronological order, the disclosed method and system may further calculate in real time a first distance metric of the point corresponding to a current time bucket based on the data distribution of one or more of the plurality of points in the current time bucket. The current time bucket may be one of the set of time buckets. A time value may be associated with the point is greater than an end time value of the variable time interval of the current time bucket. The disclosed method and system may further compare the first distance metric of the point with a bucketing threshold value of the current time bucket. The disclosed method and system may further assign the point to one of the current time bucket or a next time bucket based on the comparison. The next time bucket may be one of the set of time buckets. The next time bucket may be arranged next in chronological order to the current time bucket. The disclosed method and system may further detect one or more point anomalies in each of the set of time buckets based on a data distribution of the set of points in each of concurrent set of time buckets of each of a plurality of time units. The plurality of time units may include the current time unit.
Thus, the disclosed method and system try to overcome the technical problem of dynamically detecting point anomalies in time-series data. The method and system may improve anomaly detection accuracy by using a context-aware dynamic windowing mechanism that may effectively differentiate between true anomalies and recurring contextual spikes, reducing false alarms. Further, the method and system may be designed to handle large-scale datasets efficiently. Further, the method and system may be applicable in multiple fields such as cybersecurity (intrusion detection), finance (fraud detection), healthcare (patient monitoring), and industrial systems (fault prediction). Further, the method and system may use traditional anomaly detection that often requires predefined thresholds, reducing manual intervention by adjusting detection criteria based on data patterns. Further, the method and system may use
many existing approaches that primarily work in univariate scenarios, and the method and system may extend seamlessly to multivariate data. Many traditional algorithms, designed for single-dimensional anomaly detection, struggle to generalize across multiple dimensions, but the method and system may effectively capture correlations between multiple features.
In light of the above mentioned advantages and the technical advancements provided by the disclosed method and system, the claimed steps as discussed above are not routine, conventional, or well understood in the art, as the claimed steps enable the following solutions to the existing problems in conventional technologies. Further, the claimed steps clearly bring an improvement in the functioning of the device itself as the claimed steps provide a technical solution to a technical problem.
The specification has described method and system for dynamically detecting point anomalies in time-series data. The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims. , Claims:CLAIMS
I/We Claim:
1. A method (300) for dynamically detecting point anomalies in time-series data, the method (300) comprising:
receiving (302), by a processor (104), time series data (208) for a current time unit, wherein the time series data (208) comprises a plurality of points of each of one or more variables;
dynamically creating (304), by the processor (104), a set of time buckets in the current time unit based on a data distribution of the plurality of points, wherein each of the set of time buckets comprises a set of points from the plurality of points, wherein each of the set of time buckets corresponds to a variable time interval in the current time unit, and wherein dynamically creating the set of time buckets comprises:
for each point of the plurality of points in chronological order,
calculating (306) in real-time, by the processor (104), a first distance metric of the point corresponding to a current time bucket based on the data distribution of one or more of the plurality of points in the current time bucket, wherein the current time bucket is one of the set of time buckets, and wherein a time value associated with the point is greater than an end time value of the variable time interval of the current time bucket;
comparing (308), by the processor (104), the first distance metric of the point with a bucketing threshold value of the current time bucket; and
assigning (310), by the processor (104), the point to one of the current time bucket or a next time bucket based on the comparison, wherein the next time bucket is one of the set of time buckets, and wherein the next time bucket is arranged next in chronological order to the current time bucket; and
detecting (312), by the processor (104), one or more point anomalies in each of the set of time buckets based on a data distribution of the set of points in each of concurrent set of time buckets of each of a plurality of time units, wherein the plurality of time units comprises the current time unit.
2. The method (300) as claimed in claim 1, wherein calculating (306) in real-time, the first distance metric of the point comprises:
calculating (402) a first mean for the current time bucket based on the data distribution of the one or more of the plurality of points in the current time bucket;
calculating (404) a first standard deviation for the current time bucket based on the data distribution of the one or more of the plurality of points in the current time bucket; and
calculating (406) the first distance metric of the point corresponding to the current time bucket based on the first mean and the first standard deviation, wherein the first distance metric is a Z-score when the time series data (208) corresponds to one variable, and wherein the first distance metric is a Mahalanobis distance when the time series data (208) corresponds to two or more variables.
3. The method (300) as claimed in claim 2, comprising determining the bucketing threshold value based on a distance metric mean and a distance metric standard deviation corresponding to the data distribution of the plurality of points in the time series data (208).
4. The method (300) as claimed in claim 3, wherein assigning (310) the point to one of the current time bucket or the next time bucket comprises:
when the first distance metric is within the bucketing threshold value,
assigning (412) the point to the current time bucket; and
dynamically adjusting (414) an end time value of the variable time interval corresponding to the current time bucket to a time value associated with the point in the time series data (208); and
when the first distance metric is beyond the bucketing threshold value, assigning (416) the point to the next time bucket.
5. The method (300) as claimed in claim 4, comprising:
upon assigning the point to the next time bucket, comparing (418) a count of the one or more of the plurality of points in the current time bucket with a predefined minimum threshold number of points; and
when the count is less than the predefined minimum threshold number of points, combining (422) the next time bucket with the current time bucket.
6. The method (300) as claimed in claim 1, wherein detecting (312) the one or more point anomalies in the time series data (208) comprises:
for each time bucket of the set of time buckets and for each point of the set of points in the time bucket,
calculating (502) a second distance metric of the point based on the data distribution of the set of points in the time bucket of the current time unit and a set of historical points in a concurrent time bucket in each of remaining of the plurality of time units;
comparing (512) the second distance metric of the point with an anomaly detection threshold value of the time bucket; and
establishing (514) the point as a point anomaly in the time bucket based on the comparison, wherein the point anomaly corresponds to an outlier point in the data distribution of the set of points and the set of historical points.
7. The method (300) as claimed in claim 6, wherein calculating (502) a second distance metric of the point comprises:
calculating (504) a second mean for the time bucket based on the data distribution of the set of points and the set of historical points;
calculating (506) a second standard deviation for the time bucket based on the data distribution of the set of points and the set of historical points; and
calculating (508) the second distance metric of the point corresponding to the time bucket based on the second mean and the second standard deviation, wherein the second distance metric is a Z-score when the time series data (208) corresponds to one variable, and wherein the second distance metric is a Mahalanobis distance when the time series data (208) corresponds to two or more variables.
8. The method (300) as claimed in claim 7, comprising determining (510) the anomaly detection threshold value based on the second mean and the second standard deviation.
9. A system (100) for dynamically detecting point anomalies in time-series data, the system (100) comprising:
a processor (104); and
a memory (106) communicatively coupled to the processor (104), wherein the memory (106) stores processor instructions, which when executed by the processor (104), cause the processor (104) to:
receive (302) time series data (208) for a current time unit, wherein the time series data (208) comprises a plurality of points of each of one or more variables;
dynamically create (304) a set of time buckets in the current time unit based on a data distribution of the plurality of points, wherein each of the set of time buckets comprises a set of points from the plurality of points, wherein each of the set of time buckets corresponds to a variable time interval in the current time unit, and wherein dynamically creating the set of time buckets comprises:
for each point of the plurality of points in chronological order,
calculate (306) in real-time a first distance metric of the point corresponding to a current time bucket based on the data distribution of one or more of the plurality of points in the current time bucket, wherein the current time bucket is one of the set of time buckets, and wherein a time value associated with the point is greater than an end time value of the variable time interval of the current time bucket;
compare (308) the first distance metric of the point with a bucketing threshold value of the current time bucket; and
assign (310) the point to one of the current time bucket or a next time bucket based on the comparison, wherein the next time bucket is one of the set of time buckets, and wherein the next time bucket is arranged next in chronological order to the current time bucket; and
detect (312) one or more point anomalies in each of the set of time buckets based on a data distribution of the set of points in each of concurrent set of time buckets of each of a plurality of time units, wherein the plurality of time units comprises the current time unit.
10. The system (100) as claimed in claim 9, wherein to calculate (306) in real-time, the first distance metric of the point, the processor instructions, on execution, cause the processor (104) to:
calculate (402) a first mean for the current time bucket based on the data distribution of the one or more of the plurality of points in the current time bucket;
calculate (404) first standard deviation for the current time bucket based on the data distribution of the one or more of the plurality of points in the current time bucket; and
calculate (406) the first distance metric of the point corresponding to the current time bucket based on the first mean and the first standard deviation, wherein the first distance metric is a Z-score when the time series data (208) corresponds to one variable, and wherein the first distance metric is a Mahalanobis distance when the time series data (208) corresponds to two or more variables.
11. The system (100) as claimed in claim 10, wherein the processor instructions, on execution, cause the processor (104) to determine the bucketing threshold value based on a distance metric mean and a distance metric standard deviation corresponding to the data distribution of the plurality of points in the time series data (208).
12. The system (100) as claimed in claim 11, wherein to assign (310) the point to one of the current time bucket or the next time bucket, the processor instructions, on execution, cause the processor (104) to:
when the first distance metric is within the bucketing threshold value,
assign (412) the point to the current time bucket; and
dynamically adjust (414) an end time value of the variable time interval corresponding to the current time bucket to a time value associated with the point in the time series data (208); and
when the first distance metric is beyond the bucketing threshold value, assign (416) the point to the next time bucket.
13. The system (100) as claimed in claim 12, wherein the processor instructions, on execution, cause the processor (104) to:
upon assigning the point to the next time bucket, compare (418) a count of the one or more of the plurality of points in the current time bucket with a predefined minimum threshold number of points; and
when the count is less than the predefined minimum threshold number of points, combine (422) the next time bucket with the current time bucket.
14. The system (100) as claimed in claim 9, wherein to detect (312) the one or more point anomalies in the time series data (208), the processor instructions, on execution, cause the processor (104) to:
for each time bucket of the set of time buckets and for each point of the set of points in the time bucket,
calculate (502) a second distance metric of the point based on the data distribution of the set of points in the time bucket of the current time unit and a set of historical points in a concurrent time bucket in each of remaining of the plurality of time units;
compare (512) the second distance metric of the point with an anomaly detection threshold value of the time bucket; and
establish (514) the point as a point anomaly in the time bucket based on the comparison, wherein the point anomaly corresponds to an outlier point in the data distribution of the set of points and the set of historical points.
15. The system (100) as claimed in claim 14, wherein to calculate (502) a second distance metric of the point the processor instructions, on execution, cause the processor (104) to:
calculate (504) a second mean for the time bucket based on the data distribution of the set of points and the set of historical points;
calculate (506) a second standard deviation for the time bucket based on the data distribution of the set of points and the set of historical points; and
calculate (508) the second distance metric of the point corresponding to the time bucket based on the second mean and the second standard deviation, wherein the second distance metric is a Z-score when the time series data (208) corresponds to one variable, and wherein the second distance metric is a Mahalanobis distance when the time series data (208) corresponds to two or more variable.
16. The system (100) as claimed in claim 15, wherein the processor instructions, on execution, cause the processor (104) to determine (510) the anomaly detection threshold value based on the second mean and the second standard deviation.
| # | Name | Date |
|---|---|---|
| 1 | 202511087271-STATEMENT OF UNDERTAKING (FORM 3) [12-09-2025(online)].pdf | 2025-09-12 |
| 2 | 202511087271-REQUEST FOR EXAMINATION (FORM-18) [12-09-2025(online)].pdf | 2025-09-12 |
| 3 | 202511087271-REQUEST FOR EARLY PUBLICATION(FORM-9) [12-09-2025(online)].pdf | 2025-09-12 |
| 4 | 202511087271-PROOF OF RIGHT [12-09-2025(online)].pdf | 2025-09-12 |
| 5 | 202511087271-POWER OF AUTHORITY [12-09-2025(online)].pdf | 2025-09-12 |
| 6 | 202511087271-FORM-9 [12-09-2025(online)].pdf | 2025-09-12 |
| 7 | 202511087271-FORM 18 [12-09-2025(online)].pdf | 2025-09-12 |
| 8 | 202511087271-FORM 1 [12-09-2025(online)].pdf | 2025-09-12 |
| 9 | 202511087271-FIGURE OF ABSTRACT [12-09-2025(online)].pdf | 2025-09-12 |
| 10 | 202511087271-DRAWINGS [12-09-2025(online)].pdf | 2025-09-12 |
| 11 | 202511087271-DECLARATION OF INVENTORSHIP (FORM 5) [12-09-2025(online)].pdf | 2025-09-12 |
| 12 | 202511087271-COMPLETE SPECIFICATION [12-09-2025(online)].pdf | 2025-09-12 |