Information extraction automatically obtains structured information from unstructured or semi-structured machine-readable documents. The extraction steps consist mainly of classifying words (tagging). The output can be stored as key-value pairs in a computer-friendly file format, and then stored in a database for later reference. Information extraction from receipts or invoices is a difficult task because the tagging step should not be done solely on machine-readable words. Also, we obtain layout information or positions of words relative to other words in the invoices or receipts. This study deployed optical character recognition solutions for the Vietnamese language (VietOCR) combining a graph convolutional network (GCN) to extract information from 731 Vietnamese invoices issued by several stores. First, we collected invoice images captured with smartphones from supermarkets in Vietnam. Then, with those images we proceeded with text detection and recognition, then feature processing. The dataset was classified into two parts for training and testing, and we executed classification tasks with two GCNs. Experimental results revealed that our proposed method reached 99.50%, 98.52%, 98.52%, and 98.52% for accuracy, recall, precision, and F1-score, respectively. This work is expected to prove useful for information extraction from image-based documents.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên