Text classification is a sophisticated field of research in natural language processing that deals with the problem of automatically classifying new documents into pre-defined classes. It is a complex procedure involving not only selecting the right training models, but also integrating numerous fine-tuned processes, e.g. pre-processing, transformation, and dimensionality reduction. Researchers either develop new classification models or improve the existing approaches by investigating new techniques. An ideal text classifier would mimic how humans assign text to topics. People usually categorize documents by scanning their important words rather than reading the whole text source. With this process in mind, the authors propose a framework to categorize documents and apply the idea of keyword-based classification. The authors have collected real text data from various websites and utilize the TextRank algorithm and Jaccard similarity coefficient. A wide range of experiments has been conducted to show that the proposed framework achieves good results.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên