Metagenomic is now a novel source for supporting diagnosis and prognosis human diseases. Numerous studies have pointed to crucial roles of metagenomics in personalized medicine approaches. Recent years, machine learning has been widely deploying in a vast amount of metagenomic research. Usually, gene family data are characterized by very high dimension which can be up to millions of features. However, the number of obtained samples is rather small compared to the number of attributes. Therefore, the results in validation sets often exhibit poor performance while we can get high accuracy during training phrases. Moreover, a very large number of features on each gene family dataset consumes a considerable time in processing and learning. In this study, we propose feature selection methods using Ridge Regression on datasets including gene families, then the new obtained set of features is binned by an equal width binning approach and fetched into either a Linear Regression and a One-Dimensional Convolutional Neural Network (CNN1D) to do prediction tasks. The experiments are examined on more than 1000 samples of gene family abundance datasets related to Liver Cirrhosis, Colorectal Cancer, Inflammatory Bowel Disease, Obesity and Type 2 Diabetes. The results from the proposed method combining between feature selection algorithms and binning show significant improvements in both prediction performance and execution time compared to the state-of-the-art methods.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên