Cổng tri thức PTIT

Bài báo quốc tế

Kho tri thức

/

/

Temporal Degradation in Machine Learning-Based Malware Detection: A Multi-Dataset, Multi-Year Empirical Study

Temporal Degradation in Machine Learning-Based Malware Detection: A Multi-Dataset, Multi-Year Empirical Study

Huỳnh Trọng Thưa

Machine-learning malware detectors achieve near-perfect deployment accuracy yet silently degrade as threats evolve. We present a multi-dataset temporal study of this concept drift on 1.68 million Portable-Executable samples from EMBER 2017, EMBER 2018, and BODMAS (2019–2020), unified in the EMBER v2 feature space and analyzed with three classifier families (LightGBM, Random Forest, MLP) across nine experimental dimensions: in-era baselines, cross-era transfer, monthly drift tracking, incremental retraining, family-level false-negative decomposition, feature-group sensitivity, cumulative Area-Under-Time (AUT) analysis, drift-triggered retraining (ADWIN, DDM), and active-learning sample selection, with 10-seed statistical validation. Six findings emerge. (1) Forward degradation is asymmetric: under a strict appeared-year split, training on 2017 data loses 8.47 percentage points (pp) F1 on 2018 data (LightGBM, 10 seeds), whereas the reverse direction shows no degradation. (2) Unseen malware families dominate failures, with false-negative rates up to 23.92% and same-month ratios exceeding 30× relative to known families in the strongest case. (3) Cross-era robustness is feature-group dependent: SectionInfo and ImportsInfo dominate transfer (+0.85 and +0.37 pp respectively when retained), while HeaderFileInfo and StringExtractor act as temporal artifacts—zeroing them improves cross-era F1 by 0.67 and 0.47 pp respectively. (4) Incremental retraining with only 1% newly labeled data gains +0.56 pp cumulative AUT over a static baseline. (5) ADWIN/DDM-triggered retraining matches that AUT within 0.07–0.13 pp on LightGBM while issuing ∼33–35% fewer retrains, exposing a label-budget vs. accuracy trade-off. (6) Uncertainty sampling delivers a +0.76 pp AUT improvement over random sampling at identical labeling cost (p = 0.0020, Wilcoxon). Together the results form a five-way mitigation ladder—static, fixed 1%/month, ADWIN-triggered, DDM-triggered, and uncertainty-sampled—that practitioners can position along their labeling-budget and AUT requirements.

Xuất bản trên:

Temporal Degradation in Machine Learning-Based Malware Detection: A Multi-Dataset, Multi-Year Empirical Study


Nhà xuất bản:

IEEE Access

Địa điểm:


Từ khoá:

Concept drift , malware detection , machine learning , temporal analysis , PE malware , intrusion detection systems