Image captioning neural networks are trained simultaneously on image recognition sub-models and natural language processing sub-models to generate description sentences for images. This paper presents several image captioning models based on the encoder-decoder framework. We change the neural sub-models used for the encoder as well as the decoder, and make comparisons. First, we experiment with several ResNet architectures (viz., ResNet-50, ResNet-101, and ResNet-152) as encoders, and Transformer or bidirectional Transformer models as decoders. Second, we use the combination of the Vision Transformer as a visual encoder, and the standard Transformer or RoBERTa as the language decoder. Finally, we propose an image captioning model using Vision Transformer for encoding images and bidirectional Transformer for predicting image captions. The models are trained on the Flickr8k dataset in English and Vietnamese and evaluated using the BLEU metric. The combination model between the Vision Transformer and the bidirectional RoBERTa model outperforms the existing image captioning models, including VirTex and CPTR models. The BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores of our best image captioning model are 0.870, 0.661, 0.443, and 0.331 on the English dataset, and 0.829, 0.647, 0.483, and 0.387 on the Vietnamese dataset.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên