Document classifiers are supervised learning models in which documents are assigned labels based on models that are trained on labeled datasets. The accuracy of a classifier depends on the size and quality of training datasets, which are costly and time-consuming to construct. Besides, a suitable word representation method may improve the quality of the text classifier. In this paper, we study the effect of different word representation methods on 16 classification models trained on a labeled dataset. Then, we experiment with the ability to discover latent topics using 6 topic models. Based on experimental results using combination of classification models and topic models, we propose a method to label datasets for training classification models using topic models and classification models. Although we perform experiments on a Vietnamese document dataset, our approach may apply to any datasets and does not require any labeled datasets for bootstrapping.
Tạp chí: Association for Computational Linguistics (ACL 2023), In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 2023
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên