Content Classification based-on Latent Semantic Analysis and Support Vector Machine (LSA-SVM)
DOI:
https://doi.org/10.26623/transformatika.v19i2.2745Keywords:
web page classification, support vector machine, latent semantic analysisAbstract
The diversity of the content of a web page can have a negative impact if used by the wrong user. Almost a half of internet users are children. Therefore, it is important to classify web pages to find out which pages are worthy of being seen by children and that are not feasible. One method that can be used is the Support Vector Machine (SVM) algorithm. SVM is a binary classification whose working principle is to find the best hyperplane to separate the two classes. To obtain better classification accuracy, the SVM is combined with the Latent Semantic Analysis (LSA) algorithm. The data used in this study were taken from the DMOZ web data which has been classified into two categories. The data is then entered into the pre-processing stage for further feature extraction using LSA. The LSA algorithm is used to find out the semantic similarities of words and text contained in web pages. The results of feature extraction are then classified using SVM with RBF kernel. Based on the testing result, we obtain a classification accuracy of 64%.References
Adhitia, Rama, and Ayu Purwarianti. 2012. Penilaian Esai Jawaban Bahasa Indonesia Menggunakan Metode Svm - Lsa Dengan Fitur Generik. Jurnal Sistem Informasi 5(1): 33.
Chen, Rung Ching, and Chung Hsun Hsieh. 2006. Web Page Classification Based on a Support Vector Machine Using a Weighted Vote Schema. Expert Systems with Applications 31(2): 427 35.
Eickhoff, Carsten, Pieter Dekker, and Arjen P. de Vries. 2012. Supporting Children s Web Search in School Environments. Proceedings of the 4th Information Interaction in Context Symposium on (January): 129 37. http://www.scopus.com/inward/record.url?eid=2-s2.0-84867470614&partnerID=tZOtx3y1.
Ghaddar, Bissan, and Joe Naoum-Sawaya. 2018. High Dimensional Data Classification and Feature Selection Using Support Vector Machines. European Journal of Operational Research 265(3): 993 1004. http://dx.doi.org/10.1016/j.ejor.2017.08.040.
Huang, Wenqing, and Hui You. 2018. Web Page Classification Algorithm Based on Semi-Supervised Support Vector Machine. In Proceedings of 2018 2nd IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference, IMCEC 2018, IEEE, 2144 48.
InternetWorldStats. 2020. Internet World Stats : Usage and Population Statistics. Miniwatts Marketing Group. https://www.internetworldstats.com/stats.htm (May 7, 2020).
Jordan, M I, and T M Mitchell. 2015. Machine Learning: Trends, Perspectives, and Prospects. 349(6245).
Landauer, Thomas K, Peter W. Foltz, and Darrell Laham. 1998. An Introduction to Latent Semantic Analysis. Discourse Processes 25(2 3): 259 84.
Sarwosri, and Wahyu Budi Surastyo Basori, Ahmad Hoirul. 2009. Aplikasi Web Crawler Untuk Web Content Pada. JUTI 7(3): 127 34.
Setiawan, Agus, Indah Fitri Astuti, and Awang Harsa Kridalaksana. 2016. Klasifikasi Dan Pencarian Buku Referensi Akademik Menggunakan Metode Na ve Bayes Classifier (NBC) (Studi Kasus: Perpustakaan Daerah Provinsi Kalimantan Timur). Informatika Mulawarman : Jurnal Ilmiah Ilmu Komputer 10(1): 1.
Shinde, Sharmila, Prasanna Joeg, and Sandeep Vanjale. 2018. Web Document Classification Using Support Vector Machine. In International Conference on Current Trends in Computer, Electrical, Electronics and Communication, CTCEEC 2017, IEEE, 688 91.
Zhang, Wei, Sui xi Kong, Yan chun Zhu, and Xiao le Wang. 2019. Sentiment Classification and Computing for Online Reviews by a Hybrid SVM and LSA Based Approach. Cluster Computing 22: 12619 32. https://doi.org/10.1007/s10586-017-1693-7.
Downloads
Published
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
Transformatika is licensed under a Creative Commons Attribution 4.0 International License.