Content Classification based-on Latent Semantic Analysis and Support Vector Machine (LSA-SVM)

Gita Indah Marthasari, Nur Hayatin, Maulidya Yuniarti

Abstract


The diversity of the content of a web page can have a negative impact if used by the wrong user. Almost a half of internet users are children. Therefore, it is important to classify web pages to find out which pages are worthy of being seen by children and that are not feasible. One method that can be used is the Support Vector Machine (SVM) algorithm. SVM is a binary classification whose working principle is to find the best hyperplane to separate the two classes. To obtain better classification accuracy, the SVM is combined with the Latent Semantic Analysis (LSA) algorithm. The data used in this study were taken from the DMOZ web data which has been classified into two categories. The data is then entered into the pre-processing stage for further feature extraction using LSA. The LSA algorithm is used to find out the semantic similarities of words and text contained in web pages. The results of feature extraction are then classified using SVM with RBF kernel. Based on the testing result, we obtain a classification accuracy of 64%.

Keywords


web page classification; support vector machine; latent semantic analysis

References


Adhitia, Rama, and Ayu Purwarianti. 2012. Penilaian Esai Jawaban Bahasa Indonesia Menggunakan Metode Svm - Lsa Dengan Fitur Generik. Jurnal Sistem Informasi 5(1): 33.

Chen, Rung Ching, and Chung Hsun Hsieh. 2006. Web Page Classification Based on a Support Vector Machine Using a Weighted Vote Schema. Expert Systems with Applications 31(2): 427 35.

Eickhoff, Carsten, Pieter Dekker, and Arjen P. de Vries. 2012. Supporting Children s Web Search in School Environments. Proceedings of the 4th Information Interaction in Context Symposium on (January): 129 37. http://www.scopus.com/inward/record.url?eid=2-s2.0-84867470614&partnerID=tZOtx3y1.

Ghaddar, Bissan, and Joe Naoum-Sawaya. 2018. High Dimensional Data Classification and Feature Selection Using Support Vector Machines. European Journal of Operational Research 265(3): 993 1004. http://dx.doi.org/10.1016/j.ejor.2017.08.040.

Huang, Wenqing, and Hui You. 2018. Web Page Classification Algorithm Based on Semi-Supervised Support Vector Machine. In Proceedings of 2018 2nd IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference, IMCEC 2018, IEEE, 2144 48.

InternetWorldStats. 2020. Internet World Stats : Usage and Population Statistics. Miniwatts Marketing Group. https://www.internetworldstats.com/stats.htm (May 7, 2020).

Jordan, M I, and T M Mitchell. 2015. Machine Learning: Trends, Perspectives, and Prospects. 349(6245).

Landauer, Thomas K, Peter W. Foltz, and Darrell Laham. 1998. An Introduction to Latent Semantic Analysis. Discourse Processes 25(2 3): 259 84.

Sarwosri, and Wahyu Budi Surastyo Basori, Ahmad Hoirul. 2009. Aplikasi Web Crawler Untuk Web Content Pada. JUTI 7(3): 127 34.

Setiawan, Agus, Indah Fitri Astuti, and Awang Harsa Kridalaksana. 2016. Klasifikasi Dan Pencarian Buku Referensi Akademik Menggunakan Metode Na ve Bayes Classifier (NBC) (Studi Kasus: Perpustakaan Daerah Provinsi Kalimantan Timur). Informatika Mulawarman : Jurnal Ilmiah Ilmu Komputer 10(1): 1.

Shinde, Sharmila, Prasanna Joeg, and Sandeep Vanjale. 2018. Web Document Classification Using Support Vector Machine. In International Conference on Current Trends in Computer, Electrical, Electronics and Communication, CTCEEC 2017, IEEE, 688 91.

Zhang, Wei, Sui xi Kong, Yan chun Zhu, and Xiao le Wang. 2019. Sentiment Classification and Computing for Online Reviews by a Hybrid SVM and LSA Based Approach. Cluster Computing 22: 12619 32. https://doi.org/10.1007/s10586-017-1693-7.




DOI: http://dx.doi.org/10.26623/transformatika.v19i2.2745

Refbacks

  • There are currently no refbacks.


| View My Stats |

Jurnal Transformatika : Journal Information Technology  by  Department of Information Technology, Faculty of Information Technology and Communication, Semarang University  is licensed under a  Creative Commons Attribution 4.0 International License.