In the evolution of Big Data, efficiently processing large datasets is always a top concern for researchers. A join operation is one of such processing, a common operation appearing in many data queries. This operation generates plenty of intermediate data and data transmis- sion over the network, especially a recursive join operation. Although extremely expensive, a recursive join has a wide variety of domains as database, social network and computer network analyses, compiler, data integration and graph mining. Therefore, this study was carried out to optimize recursive joins based on some solutions in a Spark environ- ment. The solutions leverage the advantages of three-way join operations, Bloom filters, Spark RDD and caching techniques for iterative join com- putation. These significantly reduce the number of executed iterations and jobs, the amount of redundant data, and remotely accessing persis- tent data. Our experimental results show that the optimized recursive join is more efficient than a typical one by reducing the number of itera- tions to half, minimizing data transfer, and thus shorter execution time.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên