Nowadays, the consecutive increase of the volume of text corpora datasets and the countlessresearch directions in general classification have created a great opportunity and an unprecedented demandfor a comprehensive evaluation of the current achievement in the research of natural language processing.There are unfortunately few studies that have applied the combination of convolutional neural networks(CNN) and Apache Spark to the task of automating opinion discretization. In this paper, the authorspropose a new distributed structure for solving an opinion classification problem in text mining by utilizingCNN models and big data technologies on Vietnamese text sources. The proposed framework consists ofimplementation concepts that are needed by a researcher to perform experiments on text discretizationproblems. It covers all the steps and components that are usually part of a completely practical text miningpipeline: acquiring input data, processing, tokenizing it into a vectorial representation, applying machinelearning algorithms, performing the trained models to unseen data, and evaluating their accuracy. Thedevelopment of the framework started with a specific focus on binary text discretization, but soon expandedtoward many other text-categorization-based problems, distributed language modeling and quantification.Several intensive assessments have been investigated to prove the robustness and efficiency of the proposedframework. Resulting in high accuracy (72.99%±3.64) from the experiments, one can conclude that it isfeasible to perform our proposed distributed framework to the task of opinion discretization on Facebook.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên