Based on the images of dishes in the Mekong Delta along with questions about the dishes such as: What is the name of this dish? Where is it famous? What are the main ingredients? How is it made? An application chatbot will be built to promote the speciality dishes of the Mekong Delta. This report outlines a method for training a Visual Ques- tion Answering (VQA) model for classification tasks using Transformer- based models, such as ViT for image data, BERT/PhoBERT for text data, or ViLT for simultaneous processing of image and text data. After that, a Visual Encoder-Decoder model for the task of generating sen- tences will be built using the VQA model as a Visual Encoder and a GPT-2 as a Decoder. The experimental dataset, which includes 7,694 photos of dishes from the Mekong Delta, is a subset of the datasets 30VNFoods and VinaFood21. The accuracy metric was used to evalu- ate the VQA models, and the results were relatively good. For Model 1: ViT and BERT, the accuracy scores for English and Vietnamese are 94% and 95%, respectively, while the accuracy score for Model 2: ViLT is over 92% on English-only. According to the ROUGE evaluation method, Model 3’s answer sentence generation model, on English only, which used ViLT along with GPT-2, yielded results of 49.92, 39.26, and 47.53 for the ROUGE-1, ROUGE-2, and ROUGE-L, respectively. Finally, the trained models were applied to build a chatbot.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên