Cổng tri thức PTIT

Trang chủ

Giới thiệu

AI Cộng đồng

Kho tri thức

Tin tức

Liên hệ

Bài báo quốc tế

Kho tri thức

Bài báo quốc tế

A feature-engineered dataset of benign and phishing URLs for machine learning and large language models evaluation

Tran Cong Hung

Phishing websites remain a major cybersecurity threat, yet the availability of balanced and feature-rich datasets for evaluating detection models is still limited. While machine learning (ML) and large language models (LLMs) have shown strong potential in URL-based classification, most public datasets provide raw URLs without feature engineering, making reproducibility and fair comparison across models difficult. To address this gap, we present a curated dataset of 111,660 URLs, consisting of 100,000 benign samples (label 0) and 11,660 phishing samples (label 1). Each URL entry is enriched with 22 numerical lexical and structural features (e.g., URL length, domain length, digit ratio, entropy, HTTPS usage). Additionally, three string reference columns (URL, domain, TLD) are preserved for interpretability, and one label column (0 = benign, 1 = phishing), totaling 26 columns. To demonstrate its utility, we evaluate two baseline approaches: a Random Forest (RF) classifier using handcrafted features, and a MiniLM embedding model with Logistic Regression (LR). Both achieved accuracy above 96 % and ROC AUC scores exceeding 0.99 across training, validation, and test splits. This dataset represents an important step toward building reproducible and comparable benchmarks for phishing detection, bridging traditional ML and LLM-based approaches, and supporting future research on adversarial robustness and scalable security models.

Xuất bản trên:

A feature-engineered dataset of benign and phishing URLs for machine learning and large language models evaluation

Ngày đăng:

2025

DOI:

https://doi.org/10.1016/j.dib.2025.112162

Nhà xuất bản:

Data in Brief

Địa điểm:

Từ khoá:

Artificial intelligence (AI), Cybersecurity, Data science, Feature-engineered dataset, Large language models (LLMs), Machine learning (ML), Natural language processing (NLP), URL classification

Bài báo liên quan

Iris-based lung cancer pre-scanning for mobile platforms

Hồ Đắc Hưng

LogMerge: improved log parsing based on two-step clustering combined with low-level token processing

Viet Le Hai

A Novel Network Attack Detection Platform Targeting the AMF Component in the 5G Network Infrastructure

Nguyễn Huy Trung

MobiIris: Attention-Enhanced Lightweight Iris Recognition with Knowledge Distillation and Quantization

Huỳnh Trọng Thưa

Image Copyright Protection: A Comprehensive Survey of Digital Watermarking, Deep Learning, and Blockchain Approaches

Nguyễn Quang Phúc

A static method for detecting android malware based on directed API call

Vũ Minh Mạnh

Bilinear Neural Network Method for Solving Extended (2+1)-Dimensional Sixth-order Benney-Luke Equation

Nguyễn Minh Tuấn