Nowadays, with the Internet infrastructure and nearly global access, the amount and diversity of data are increasing rapidly. Many tasks require information retrieval and data collection for machine learn- ing, research, and survey reports in various fields such as meteorology, science, geography, literature, and more. However, manual data collection and classification can be time-consuming and prone to errors. Addition- ally, AI assistants used for drafting or writing can sometimes be corrected regarding writing style and inappropriate language for the given con- text. Faced with these needs, In this article, Vietnamese documents are classified using the TF-IDF method, TF-IDF combined with SVD, and FastText at three levels: word level, n-gram level, and character level. For this approach, 15 categories were gathered from various online news sources. The dataset was preprocessed and trained using machine learn- ing models such as SVM, Naive Bayes, Neural Network, and Random Forest to find the most effective method. The Random Forest combined with the FastText method was highly evaluated, achieving a success rate of 82% when measured against essential evaluation criteria of accuracy, precision, and F1 score.
Số tạp chí Ngoc Thanh Nguyen · Bogdan Franczyk · André Ludwig · Manuel Núñez · Jan Treur · Gottfried Vossen · Adrianna Kozierkiewicz(2024) Trang: 157-169
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên