In the era of information explosion, the amount of data generated is increasing day by day, reached the threshold of petabytes or even zettabytes. In order to extract useful information from a variety of huge data sources, we need effectively computational operations performed in parallel and distributed manner on a cluster of computers. These operations involve a lot of complex and expensive processing operations. One of the typical and frequently used operations in queries is a join operation to combine more than one dataset into one. Currently, although there are some studies on join operations in Spark, there has not been any study showing an adequate and systematic comparison of join algorithms in the Spark environment. Therefore, this study is dedicated to the join operation aspects in Spark. It describes important strategies of implementing the join operation in detail, and exposes the advantages and disadvantages of each one. In addition, the work provides a more thorough comparison of the joins by using a mathematical cost model and experimental verification.
Tạp chí khoa học Trường Đại học Cần Thơ
Lầu 4, Nhà Điều Hành, Khu II, đường 3/2, P. Xuân Khánh, Q. Ninh Kiều, TP. Cần Thơ
Điện thoại: (0292) 3 872 157; Email: tapchidhct@ctu.edu.vn
Chương trình chạy tốt nhất trên trình duyệt IE 9+ & FF 16+, độ phân giải màn hình 1024x768 trở lên