WikiPhish: A Diverse Wikipedia Based Dataset for Phishing Website Detection

1 min read

Abstract

Phishing remains a pervasive security threat, necessitating effective and universally comparable detection systems. The use of supervised machine learning models for phishing detection has been generalized in the literature to automate predictions and increase the detection capacities of security systems. These models rely on large amounts of annotated data for their training, evaluation and maintenance. Thus, there is a need to efficiently collect significant amount of annotated data to improve phishing detection. This paper introduces WikiPhish, a novel, renewable, and open-access dataset for phishing website classification. It consists of 110,606 webpages harvested from URLs drawn from Wikipedia’s references and the popular phishing databases OpenPhish and PhishTank. The dataset is designed to address the challenges of phishing detection by leveraging Wikipedia’s contribution verification and wide-ranging content. WikiPhish offers a more diverse and robust baseline for developing phishing detection models. We highlight the importance of gathering diverse URLs for building phishing website datasets, and demonstrate the practical utility of WikiPhish by employing it in the training and evaluation of phishing detection machine learning models.

Cite

@inproceedings{10.1145/3626232.3653283,
author = {Loiseau, Gabriel and Lefils, Valentin and Meyer, Maxime and Riquet, Damien},
title = {WikiPhish: A Diverse Wikipedia-Based Dataset for Phishing Website Detection: Data/Toolset Paper},
year = {2024},
isbn = {9798400704215},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626232.3653283},
doi = {10.1145/3626232.3653283},
booktitle = {Proceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy},
pages = {361–366},
numpages = {6},
keywords = {datasets, machine learning, phishing website detection, web security},
location = {Porto, Portugal},
series = {CODASPY '24}
}