Document clustering plays a crucial role in various information retrieval tasks. Existing approaches often struggle with capturing the semantic relationships between documents, especially when dealing with long and complex texts. To address this issue, we propose SBoC, a novel Segment-based Bag-of-Clusters approach. SBoC first divides documents into segments, capturing local semantic information. It then applies clus- tering algorithms to these segments, forming clusters that represent distinct semantic concepts. Finally, a Bag-of-Clusters representation is constructed for each document, encoding its semantic content based on the assigned segment clusters. SBoC shows promising results, particularly in terms of capturing semantic relationships in document clustering. While not surpassing all existing methods, SBoC demonstrates competitive performance on benchmark datasets, particularly when handling long and complex texts. This approach provides a potential solution for enhancing document clustering for various information retrieval tasks.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên