Our study aims to classify images of Intangible Cultural Heritage (ICH) in the Mekong Delta, Vietnam. To achieve this purpose, we have built a dataset consisting of images from 17 different ICH categories and manually annotated them. Initially, we fine-tuned recent pre-trained network models, including VGG16, DenseNet, and Vision Transformer (ViT), for classifying our dataset. After that, we trained Logistic Regression (LR) models, called fusing models, which fuse not only various visual features extracted from deep networks but also the output of deep networks to improve the classification accuracy. Our comparative study of the classification performance on the 17-category ICH image dataset shown that our fusing models improve the classification correctness compared to any single fine-tuned one. The first fusing model (LR with visual feature extracted from VGG16, DenseNet, ViT) achieves an accuracy of 66.76%. The second fusing model (LR on top VGG16, DenseNet, ViT) gives an accuracy of 66.49%.