Integrating Three Machine Learning Algorithms in Ensemble Learning Model for Improving Content-based Spam Email Recognition
Keywords:
Email spam, machine learning, classification, ensemble, random forest, naive Bayes, linear regressionAbstract
Email spam refers to junk files, images, or data sent through email that might contain links leading to phishing websites. This email is often sent repeatedly to random users, and sometimes it may be dangerous. The objective of this study is to predict and recognize whether the emails sent to users are spam or not by using machine learning classification algorithms. Email Spam Classification (ESC) datasets are used in this study for spam detection tests. The ESC datasets contain 5172 rows and 3002 columns of spam and non-spam features. The methodology used in this study is the CRISP-DM to guide the process of evaluating the performance of three machine learning algorithms: Naive Bayes (NB), Logistic Regression (LR), and Random Forest (RF). Subsequently, an ensemble model that integrates the three machine learning algorithms is proposed to improve the performance of spam email recognition. The selected evaluation metrics are F1-Score, accuracy, precision, and recall. Based on the results, the RF algorithm has the highest accuracy of 97.3% in classifying spam emails, with an F1 score of 96.8%, precision of 96.2%, and recall of 96.0%. The NB achieves the best second results, which are slightly different from the RF, and the LR achieves considerably lower results than the other two algorithms. The ensemble model that integrates the three algorithms performs best in classifying spam emails with 98.9% accuracy, 97.6% precision, 97.4% recall, and 96.7% F1-score.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Journal of Soft Computing and Data Mining

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.









