Automatically describing the content of an image holds an essential role in numerous applications. Some practical applications include providing more accurate and detailed images or videos in scenarios like information search by image or video surveillance systems. The technique generates picture captions that are generally semantically informative and grammatically accurate by acquiring image and caption pairings. Humans use natural languages to describe scenes because they are short and compact. On the other hand, machine vision systems characterize the scene by capturing an image that is a two-dimensional array. The concept is to combine the image and captions into one area and then map from the image to the sentences. This study proposes a merge model to combine the image vector and the partial caption. It can be implemented through three steps: processing the sequence from the text, extracting the feature vector from the image, and decoding the output by concatenating the above two layers. Besides that, to evaluate model performance, we generate multiple sentences using Beam Search and BLEU. The experiments show that the method can generate captions with relatively accurate content and less training memory. We use the Flickr8k dataset consisting of 8000 images paired with five different captions, which provide precise descriptions of the salient entities and events. For training, we use 6000 images, 1000 for test and 1000 for development, while Flickr8k text includes text files describing train set, test set. The best results were obtained when testing the model with BLEU-1, Greedy, and Beam Search with k == 5 or k == 7 all over 60 in BLEU scores.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên