Integrating Three Machine Learning Algorithms in Ensemble Learning Model for Improving Content-based Spam Email Recognition

Ali Q. Saeed; Mohammed Hasan  Aldulaimi; Ismail Abdulwahhab Ismail; Ibrahim M. Ahmed; Yahya Ahmed Yahya; Qasem M. Kharma; Taher M. Ghazal

Authors

Ali Q. Saeed Department of Artificial Intelligence, Technical Engineering College for Computer and AI, Northern Technical University
Mohammed Hasan Aldulaimi Department of Computer Techniques Engineering ,College of Engineering, Al-Mustaqbal University, 51001 Hillah, Babylon , Iraq
Ismail Abdulwahhab Ismail Department of Translation, College of Arts, Alnoor University, Mosul, 41012, Iraq
Ibrahim M. Ahmed College of Computer Sciences and Mathematics University of Mosul Mosul, Iraq https://orcid.org/0000-0002-2380-8550
Yahya Ahmed Yahya Technical Engineering College for Computer and AI , Northern Technical University ,Mosul, 41000, Ninevah, Iraq https://orcid.org/0000-0002-8271-0991
Qasem M. Kharma Hourani Center for Applied Scientific Research, Al-Ahliyya Amman University https://orcid.org/0009-0001-1433-017X
Taher M. Ghazal University of Buraimi

Keywords:

Email spam, machine learning, classification, ensemble, random forest, naive Bayes, linear regression

Abstract

Email spam refers to junk files, images, or data sent through email that might contain links leading to phishing websites. This email is often sent repeatedly to random users, and sometimes it may be dangerous. The objective of this study is to predict and recognize whether the emails sent to users are spam or not by using machine learning classification algorithms. Email Spam Classification (ESC) datasets are used in this study for spam detection tests. The ESC datasets contain 5172 rows and 3002 columns of spam and non-spam features. The methodology used in this study is the CRISP-DM to guide the process of evaluating the performance of three machine learning algorithms: Naive Bayes (NB), Logistic Regression (LR), and Random Forest (RF). Subsequently, an ensemble model that integrates the three machine learning algorithms is proposed to improve the performance of spam email recognition. The selected evaluation metrics are F1-Score, accuracy, precision, and recall. Based on the results, the RF algorithm has the highest accuracy of 97.3% in classifying spam emails, with an F1 score of 96.8%, precision of 96.2%, and recall of 96.0%. The NB achieves the best second results, which are slightly different from the RF, and the LR achieves considerably lower results than the other two algorithms. The ensemble model that integrates the three algorithms performs best in classifying spam emails with 98.9% accuracy, 97.6% precision, 97.4% recall, and 96.7% F1-score.