DistilBERT for Efficient and Accurate Email Phishing Detection: A Benchmark Against Machine and Deep Learning Models
Email phishing remains a persistent cybersecurity threat that exploits human vulnerabilities,
often evading technical safeguards. While machine learning (ML) and deep learning (DL)
have been widely applied for phishing detection, systematic benchmarks comparing
lightweight transformer models with traditional approaches remain limited. This study
addresses this gap by evaluating six models—Naïve Bayes, Random Forest, XGBoost,
LSTM, BiLSTM, and a fine-tuned DistilBERT—on a real-world dataset of 17,538 emails
using three train-test splits (60:40, 70:30, 80:20). DistilBERT consistently outperforms all
baselines across all splits. Under the 80:20 split, it achieves the highest accuracy (98.77%),
precision (99.10%), recall (98.97%), F1-score (99.02%), and AUC (99.91%). Remarkably,
it maintains low computational overhead with a training time of 342 seconds, demonstrating
an optimal trade-off between detection accuracy and efficiency. In contrast, BiLSTM, the
best-performing recurrent model, reaches 97.43% accuracy but produces more false
negatives—a more critical security risk than false positives in phishing detection.
Additional experiments reveal that DistilBERT maintains stable performance across
different data splits, with AUC values consistently above 0.998. The confusion matrix
analysis shows that DistilBERT misclassifies only 25 legitimate emails as phishing (false
positives) and misses only 23 phishing emails (false negatives), significantly outperforming
all baseline models. These findings demonstrate that lightweight transformer models like
DistilBERT offer a practical, scalable, and cost-effective solution for real-time phishing
email detection, effectively bridging the gap between high accuracy and production-ready
deployability.
Xuất bản trên:
DistilBERT for Efficient and Accurate Email Phishing Detection: A Benchmark Against Machine and Deep Learning Models
Nhà xuất bản:
Ingénierie des Systèmes d’Information
Từ khoá:
phishing email detection, DistilBERT, lightweight transformer, comparative benchmarking, computational efficiency, real-time cybersecurity