common_function/scalars/anomaly/
mod.rs

1// Copyright 2023 Greptime Team
2//
3// Licensed under the Apache License, Version 2.0 (the "License");
4// you may not use this file except in compliance with the License.
5// You may obtain a copy of the License at
6//
7//     http://www.apache.org/licenses/LICENSE-2.0
8//
9// Unless required by applicable law or agreed to in writing, software
10// distributed under the License is distributed on an "AS IS" BASIS,
11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12// See the License for the specific language governing permissions and
13// limitations under the License.
14
15//! Anomaly detection window functions.
16//!
17//! This module provides statistical anomaly scoring functions that operate
18//! as window UDFs (User Defined Window Functions):
19//!
20//! - `anomaly_score_mad(value)` — MAD-based scoring
21//! - `anomaly_score_iqr(value, k)` — IQR-based scoring with configurable fence multiplier
22//! - `anomaly_score_zscore(value)` — Z-Score-based scoring
23//!
24//! These functions return a floating-point anomaly score rather than a boolean,
25//! allowing users to set their own threshold via `WHERE score > N`.
26//!
27//! ## Minimum Samples
28//!
29//! Each function has its own minimum based on statistical validity:
30//!
31//! | Function | min_samples | Rationale |
32//! |---|---|---|
33//! | `anomaly_score_zscore` | 2 | stddev requires n >= 2 (aligned with `STDDEV_SAMP`) |
34//! | `anomaly_score_mad` | 3 | n <= 2 makes MAD almost always 0, yielding spurious +inf |
35//! | `anomaly_score_iqr` | 3 | linear-interpolated Q1 != Q3 is possible at n >= 3 |
36//!
37//! ## Return Values
38//!
39//! | Condition | zscore | mad | iqr | Result |
40//! |---|---|---|---|---|
41//! | insufficient valid points | n < 2 | n < 3 | n < 3 | `NULL` |
42//! | stddev / MAD / IQR = 0, value = center | distance = 0 | distance = 0 | on fence | `0.0` |
43//! | stddev / MAD / IQR = 0, value ≠ center | distance > 0 | distance > 0 | outside fence | `+inf` |
44//! | normal case | stddev > 0 | MAD > 0 | IQR > 0 | finite positive |
45//!
46//! ## Window Frame Semantics
47//!
48//! The functions score the **current row** in the partition, regardless of
49//! window frame type. This works correctly for all frame specifications:
50//!
51//! ```sql
52//! -- Trailing window
53//! anomaly_score_mad(cpu) OVER (ORDER BY ts ROWS 100 PRECEDING)
54//! -- Centered window
55//! anomaly_score_mad(cpu) OVER (ORDER BY ts ROWS BETWEEN 50 PRECEDING AND 50 FOLLOWING)
56//! ```
57//!
58//! Internally, a row counter tracks which partition row is being evaluated.
59//! The `range` parameter determines only which rows participate in computing
60//! the window statistics (median, MAD, IQR, mean, stddev).
61//!
62//! ## Performance Notes
63//!
64//! Current implementation uses per-row evaluation with O(N × W) complexity
65//! where N is the partition size and W is the window size. This is acceptable
66//! for typical window sizes (W ≤ a few thousand).
67//!
68//! Future optimizations could include:
69//! - Incremental computation using order-statistic trees or two-heap median
70//!   maintenance, reducing to O(N × log W)
71//! - Batch `evaluate_all` for fixed-size windows
72
73mod iqr;
74mod mad;
75pub(crate) mod utils;
76mod zscore;
77
78use datafusion_expr::WindowUDF;
79
80use crate::function_registry::FunctionRegistry;
81
82pub struct AnomalyFunction;
83
84impl AnomalyFunction {
85    pub fn register(registry: &FunctionRegistry) {
86        registry.register_window(WindowUDF::from(mad::AnomalyScoreMad::new()));
87        registry.register_window(WindowUDF::from(iqr::AnomalyScoreIqr::new()));
88        registry.register_window(WindowUDF::from(zscore::AnomalyScoreZscore::new()));
89    }
90}