Our investigation aims at pre-training clustering models to summarize Vietnamese texts. For this purpose, we create a large-scale dataset by collecting Vietnamese articles from newspaper websites and extracting the plain text to build the dataset, including 1,101,101 documents. We propose a new single-document extractive text summarization model based on clustering models. Our proposal clusters the documents with the hard clustering k-means algorithm and the soft clustering LDA (Latent Dirichlet Allocation) algorithm. Then, based on the pre-training clustering models, a summary model is used to select the salient sentence in the input text to construct the summary. The empirical results showed that our summary model achieved 51.22% ROUGE-1, 17.62% ROUGE-2 and 29.16% ROUGE-L on the testing set. Besides the traditional word representation such as BoW (Bag-of-Words), we also use the word meaning-based tools like FastText and BERT (Bidirectional Encoder Representations from Transformers) in our model. The additional benefit of our proposed extractive summary model is that the output summary is a long-text, readable document. Furthermore, the model’s architecture is straightforward, easy to understand and runs on cost-efficient resources like arm CPU and GPU too.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên