The human gut environment can contain hundreds to thousands bacterial species which are proven that they are associated with various diseases. Although Machine learning has been supporting and developing metagenomic researches to obtain great achievements in personalized medicine approaches to improve human health, we still face overfitting issues in Bioinformatics tasks related to metagenomic data classification where the performance in the training phase is rather high while we get low performance in testing. In this study, we present discretization methods on metagenomic data which include Microbial Compositions to obtain better results in disease prediction tasks. Data types used in the experiments consist of species abundance and read counts on various taxonomic ranks such as Genus, Family, Order, etc. The proposed data discretization approaches for metagenomic data in this work are unsupervised binning approaches including binning with equal width bins, considering the frequency of values and data distribution. The prediction results with the proposed methods on eight datasets with more than 2000 samples related to different diseases such as liver cirrhosis, colorectal cancer, Inflammatory bowel disease, obesity, type 2 diabetes and HIV reveal potential improvements on classification performances of classic machine learning as well as deep learning algorithms. These binning approaches are expected to be promising pre-processing techniques on various data domains to improve the performance of prediction tasks in metagenomics.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên