Pre-Training Clustering Models to Summarize Vietnamese Texts

Hướng dẫn

Tìm kiếm nâng cao

Tên bài báo

Tìm

Tác giả

Năm xuất bản

Tóm tắt

Lĩnh vực

Phân loại

Số tạp chí

Bản tin định kỳ

Báo cáo thường niên

Tạp chí khoa học ĐHCT

Tạp chí tiếng anh ĐHCT

Tạp chí trong nước

Tạp chí quốc tế

Kỷ yếu HN trong nước

Kỷ yếu HN quốc tế

Book chapter

Pre-Training Clustering Models to Summarize Vietnamese Texts

2024 (2024) Trang: 1-18

Tác giả: Nguyễn Tí Hon, Đỗ Thanh Nghị

Tạp chí: Vietnam Journal of Computer Science

Liên kết: https://doi.org/10.1142/S2196888824500118

Tóm tắt

Our investigation aims at pre-training clustering models to summarize Vietnamese texts. For this purpose, we create a large-scale dataset by collecting Vietnamese articles from newspaper websites and extracting the plain text to build the dataset, including 1,101,101 documents. We propose a new single-document extractive text summarization model based on clustering models. Our proposal clusters the documents with the hard clustering k-means algorithm and the soft clustering LDA (Latent Dirichlet Allocation) algorithm. Then, based on the pre-training clustering models, a summary model is used to select the salient sentence in the input text to construct the summary. The empirical results showed that our summary model achieved 51.22% ROUGE-1, 17.62% ROUGE-2 and 29.16% ROUGE-L on the testing set. Besides the traditional word representation such as BoW (Bag-of-Words), we also use the word meaning-based tools like FastText and BERT (Bidirectional Encoder Representations from Transformers) in our model. The additional benefit of our proposed extractive summary model is that the output summary is a long-text, readable document. Furthermore, the model’s architecture is straightforward, easy to understand and runs on cost-efficient resources like arm CPU and GPU too.

Các bài báo khác

THASUM: Transformer for High-Performance Abstractive Summarizing Vietnamese Large-scale Dataset

(2024) Trang: 100-111

Tác giả: Nguyễn Tí Hon, Đỗ Thanh Nghị

Tạp chí: International Conference on Information Technology and Its Applications

Tóm tắt

Text Summarization on Large-scale Vietnamese Datasets

20 (2022) Trang: 309-316

Tác giả: Nguyễn Tí Hon, Đỗ Thanh Nghị

Tạp chí: Journal of information and communication convergence engineering

Tóm tắt

Extractive Text Summarization on Large-Scale Dataset Using K-Means Clustering and Word Embedding

141 (2023) Trang: 489-501

Tác giả: Nguyễn Tí Hon, Đỗ Thanh Nghị

Tạp chí: Lecture Notes on Data Engineering and Communications Technologies

Tóm tắt

Pre-training Classification and Clustering Models for Vietnamese Automatic Text Summarization

Harish Sharma, Vivek Shrivastava, Kusum Kumari Bharti, Lipo Wang (2023) Trang: 65-77

Tác giả: Nguyễn Tí Hon, Đỗ Thanh Nghị

Tạp chí: Lecture Notes in Networks and Systems

Tóm tắt

LAVETTES: Large-scAle-dataset Vietnamese ExTractive TExt Summarization Models

1925 (2023) Trang: 273-288

Tác giả: Nguyễn Tí Hon, Mã Trường Thành, Đỗ Thanh Nghị

Tạp chí: Communications in Computer and Information Science

Tóm tắt

Extractive Text Summarization on Large-scale Dataset Using K-Means Clustering

In Hamido Fujita · Philippe Fournier-Viger · Moonis Ali · Yinglin Wang (2022) Trang: 737-746

Tác giả: Nguyễn Tí Hon, Đỗ Thanh Nghị

Tạp chí: Lecture Notes in Computer Science

Tóm tắt

HUẤN LUYỆN MÔ HÌNH TÓM TẮT TỰ ĐỘNG VĂN BẢN TIẾNG VIỆT TỪ TẬP DỮ LIỆU LỚN

(2020) Trang: 180-187

Tác giả: Nguyễn Tí Hon, Nguyễn Thị Ngọc Hân, Phạm Thế Phi, Đỗ Thanh Nghị

Tạp chí: Hội nghị KHCN Quốc gia lần thứ XIII về Nghiên cứu cơ bản và ứng dụng Công nghệ thông tin (FAIR), Nha Trang, 2020

Tóm tắt

KẾT HỢP KỸ THUẬT GOM NHÓM VÀ PHẢN HỒI TƯƠNG ĐỒNG TRONG TÌM KIẾM ẢNH

(2019) Trang: 225-233

Tác giả: Nguyễn Tí Hon, Phạm Thế Phi, Hà Thị Phương Anh

Tạp chí: FAIR

Tóm tắt

Vietnamese | English

Tạp chí khoa học Trường Đại học Cần Thơ
Khu II, Đại học Cần Thơ, Đường 3/2, Phường Ninh Kiều, Thành phố Cần Thơ, Việt Nam
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn

Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên

Vui lòng chờ...