🧠 Building a Hybrid Anomaly Detection Engine for Network Flows
1️⃣ Background
flowenricher already enriched NetFlow/IPFIX data with ASN, GeoIP, DNS, etc.,
and had an Isolation Forest (iForest)–based anomaly detector.
We wanted to make the anomaly detection more stable, explainable, and sensitive
to different attack patterns without constant retuning —
so we added two complementary models:
HBOS (Histogram-Based Outlier Score) and
eHBOS (Ensemble HBOS).
Together, they form a fused, explainable, adaptive detection system.
2️⃣ The Models
🟩 Isolation Forest (iForest)
- Tree-based unsupervised model that isolates outliers by recursively partitioning data.
- Great for nonlinear, high-dimensional patterns (flow features like PPS, BPS, unique ports, etc.).
- Fast to score but requires periodic retraining on baseline data.
🟨 HBOS (Histogram-Based Outlier Score)
- Statistical density estimation per feature using histograms.
- Scores samples inversely to how common their values are in the baseline.
- Excellent for univariate, independent features — extremely fast and deterministic.
🟦 eHBOS (Ensemble HBOS)
- An ensemble extension of HBOS that builds many random feature subspaces.
- Each subspace runs its own HBOS; results are aggregated (
maxormean). - Captures interactions between features while staying light-weight and interpretable.
3️⃣ Fusion: Combining the Three
Each model outputs a normalized score (0–1). We combine them linearly using configurable weights:
fused = 0.55 * iForest + 0.30 * eHBOS + 0.15 * HBOS
Then apply optional gates:
require_hbos_percentile = 0.99require_ehbos_percentile = 0.98
This means:
“Only log if both HBOS and eHBOS agree that this sample is in the top 1–2% most unusual region.”
Finally, only fused scores above the tick’s dynamic mean (e.g. mean × 1.25)
are logged to risk.log.
4️⃣ Why This Matters
| Model | Strength | Weakness |
|---|---|---|
| iForest | Non-linear, robust to noise | Needs periodic retrain, stochastic |
| HBOS | Instant scoring, explainable | Independent features only |
| eHBOS | Captures feature interactions, lightweight | Slightly slower |
By fusing them:
- We smooth out iForest’s randomness with HBOS stability.
- We add subspace sensitivity via eHBOS.
- The final fused score reflects both global rarity (iForest)
and local statistical oddness (HBOS/eHBOS).
5️⃣ Real Effect
After the change:
risk.lognow shows non-zeroehbos_normand balanced fused scores.- The detector became more robust to benign bursts, but faster to catch structured scans and floods.
- Logs are fully explainable, showing each model’s contribution.
ANOMALY label=iforest_anomaly fused=0.82 iforest=0.77 hbos_norm=0.90 ehbos_norm=0.95
if_w=0.42 hbos_w=0.14 ehbos_w=0.28
2025/11/05 08:12:20 [2025-11-05T06:12:20Z] ANOMALY label=iforest_anomaly fused=0.3771 iforest=-0.1248 hbos_norm=0.9959 ehbos_norm=0.9878 if_w=-0.0686 hbos_w=0.1494 ehbos_w=0.2963 hbos_raw=12.15(τ=5.33) ehbos_raw=11.02(τ=10.65) src=212.11.64.179 PTR=Server3009 ASN=AS42624 (Global-Data System IT Corporation) CC=SC example_dst=194.153.116.31:7035/tcp x1 shape=[syn-heavy] feats={0.15 6.6 44 8 9 1 0} count=337
2025/11/05 08:12:22 [2025-11-05T06:12:22Z] ANOMALY label=iforest_anomaly fused=0.3819 iforest=-0.1220 hbos_norm=0.9959 ehbos_norm=0.9986 if_w=-0.0671 hbos_w=0.1494 ehbos_w=0.2996 hbos_raw=9.55(τ=5.33) ehbos_raw=17.02(τ=10.65) src=221.145.31.23 PTR=- ASN=AS4766 (Korea Telecom) CC=KR example_dst=194.153.116.180:4200/tcp x1 shape=[syn-heavy single-port] feats={0.6166666666666667 24.666666666666668 40 37 1 1 0} count=4
2025/11/05 08:12:56 [2025-11-05T06:12:56Z] RISK fused=0.4560 if=0.1047 hbos_norm=0.8601 ehbos_norm=0.8981 hbos_raw=-3.25 mu=0.2984 thr=0.3730 src=171.234.9.174 PTR=dynamic-ip-adsl.viettel.vn ASN=AS7552 (Viettel Group) CC=VN example_dst=84.54.49.204:443/tcp x1 shape=[syn-heavy single-dst-ip single-port] feats={0.6666666666666666 71.43333333333334 107.15 1 1 1 0}
6️⃣ TL;DR — What We Achieved
- ✅ Moved from single iForest → hybrid iForest + HBOS + eHBOS
- ✅ Added percentile-based gating and dynamic mean adaptation
- ✅ Achieved higher precision, less noise, and better explainability
- ✅ Added YAML-configurable eHBOS parameters (
bins,subspaces,size,agg) - ✅ Improved startup observability