Fuzzy or Similarity join has been widely studied and applied in various practical fields. There are many different approaches to the fuzzy join problem using MapReduce. The common challenge of these studies is to find pairs of data with similarity greater than or equal to a given threshold within a reasonable time and resource efficiency. This work proposes a new approach that applies neural networks, which is the Siamese Recurrent Network, to process fuzzy join queries on large-scale datasets. Additionally, this study applies the Bloom filter to eliminate redundant intermediate data, aiming to improve fuzzy join algorithms according to the MapReduce model with Hamming, Levenshtein, and Cosine distance measures. The research results are analyzed, evaluated, and demonstrated through experiments on large datasets on a Spark cluster.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên