Detection of topic on Health News in Twitter Data

Authors

  • Shum Chen Yau Data Science Research Lab, School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah, MALAYSIA
  • Juhaida Abu Bakar Medical Devices and Life Sciences Cluster, Sport Engineering Research Centre, Centre of Excellence (SERC), Universiti Malaysia Perlis (UniMAP), 02600 Arau, Perlis, MALAYSIA
  • Azian Azamimi Abdullah Medical Devices and Life Sciences Cluster, Sport Engineering Research Centre, Centre of Excellence (SERC), Universiti Malaysia Perlis (UniMAP), 02600 Arau, Perlis, MALAYSIA
  • Hazlyna Harun Data Science Research Lab, School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah, MALAYSIA
  • Ruziana Mohamad Rasli Department of Information Technology and Communication, Tuanku Syed Sirajuddin Polytechnic, Pauh Putra, 02600 Arau, Perlis, MALAYSIA
  • Lim Zheng Yang Data Science Research Lab, School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah, MALAYSIA
  • Evon Thum Yi Mun Data Science Research Lab, School of Computing, UUM College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah, MALAYSIA

Keywords:

Prolonged sitting, muscle activity, exercises on prolonged sitting

Abstract

Abstract: The development and rapid popularization of the internet has led to an exponential growth of data in the network, thus, the text mining becomes more important. Users search for the information from the immense information available online. The ways to obtain valuable information, and to classify, organize and manage vast text data automatically make the text processing even more difficult. Therefore, in order to solve those problems and requirements, intelligent information processing has been extensively studied. Topic modelling has been widely employed in the field of natural language processing. Current research directions are more focused on ways to improve the classification speed and accuracy of text classification and topic detection as well as selecting feature methods in achieving better dimension reduction operations. Latent Dirichlet Allocation (LDA) topic model works well on data noise reduction. The LDA is widely used as a feature model combined with the classifier design in order to achieve a good classification effect. This study aims to conduct data mining and save load from the huge database. Thus, three supervised learning algorithms are run, which are Naïve Bayes, Decision Tree and Random Forest. Random Forest classifier outperforms the other two classifiers with 99.99% accuracy. Seven clusters for topic modelling have been revealed using Random Forest classifier. Each output has been set to four highest word and shows the highest term and its weight.  The highest term used in the dataset is term ‘Ebola’. Based on the finding of this study, it shows that the combination of the LDA and supervised learning algorithm effectively solve the problem of data sparseness in short text sets. The method of selecting microblogs that are most likely to discuss news topics will significantly reduce the size of data objects of concern, and to a certain extent eliminate the interference of non-news blogs.

Downloads

Published

01-11-2021

Issue

Section

Articles

How to Cite

Chen Yau, S. ., Abu Bakar, J. ., Abdullah, A. A. ., Harun, H. ., Mohamad Rasli, R., Yang, L. Z. ., & Yi Mun, E. T. (2021). Detection of topic on Health News in Twitter Data. Emerging Advances in Integrated Technology, 2(2), 23-29. https://penerbit.uthm.edu.my/ojs/index.php/emait/article/view/9961