Data Science and Big Data Hadoop: Join Scenario

An SQL join clause combines records from two or more tables. This operation is very common in data processing and understanding of what happens under the hood is important. There are several common join types: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS or CARTESIAN.

#============Practical Approach cover all join============

valuesA = [('Nitin Damle',1),('Manish Shama',2),('Nirja',3),('Spaghetti',4)]

TableA = spark.createDataFrame(valuesA,['name','id'])

valuesB = [('Ruturaj',1),('Nitin Damle',2),('Nirja',3),('Dhruva',4)]

TableB = spark.createDataFrame(valuesB,['name','id'])

TableA.show()

TableB.show()

+------------+---+

| name| id|

+------------+---+

| Nitin Damle| 1|

|Manish Shama| 2|

| Nirja| 3|

| Spaghetti| 4|

+------------+---+

+-----------+---+

| name| id|

+-----------+---+

| Ruturaj| 1|

|Nitin Damle| 2|

| Nirja| 3|

| Dhruva| 4|

+-----------+---+

ta = TableA.alias('ta')

tb = TableB.alias('tb')

print("inner join result \n")

inner_join = ta.join(tb, ta.name == tb.name)

inner_join.show()

inner join result

+-----------+---+-----------+---+

| name| id| name| id|

+-----------+---+-----------+---+

| Nirja| 3| Nirja| 3|

|Nitin Damle| 1|Nitin Damle| 2|

+-----------+---+-----------+---+

print("left_outer join result \n")

left_join = ta.join(tb, ta.name == tb.name,how='left') # Can also use 'left_outer'

left_join.show()

left_outer join result

+------------+---+-----------+----+

| name| id| name| id|

+------------+---+-----------+----+

| Nirja| 3| Nirja| 3|

| Nitin Damle| 1|Nitin Damle| 2|

+------------+---+-----------+----+

print("left_outer join result other null using filter condition \n")

left_join = ta.join(tb, ta.name == tb.name,how='left') # Can also use 'left_outer'

left_join.filter(col('tb.name').isNull()).show()

left_outer join result other null using filter condition

+------------+---+----+----+

| name| id|name| id|

+------------+---+----+----+

+------------+---+----+----+

print("right_outer join result \n")

right_join = ta.join(tb, ta.name == tb.name,how='right') # Can also use 'right_outer'

right_join.show()

right_outer join result

+-----------+----+-----------+---+

| name| id| name| id|

+-----------+----+-----------+---+

| Nirja| 3| Nirja| 3|

|Nitin Damle| 1|Nitin Damle| 2|

+-----------+----+-----------+---+

print("full_outer join result \n")

full_outer_join = ta.join(tb, ta.name == tb.name,how='full') # Can also use 'full_outer'

full_outer_join.show()

full_outer join result

+------------+----+-----------+----+

| name| id| name| id|

+------------+----+-----------+----+

| Nirja| 3| Nirja| 3|

| Nitin Damle| 1|Nitin Damle| 2|

+------------+----+-----------+----+

Sample data

All subsequent explanations on join types in this article make use of the following two tables. The rows in these tables serve to illustrate the effect of different types of joins and join-predicates.

Employees table has a nullable column. To express it in terms of statically typed Scala, one needs to use Option type.

val employees = sc.parallelize(Array[(String, Option[Int])](
  ("Rafferty", Some(31)), ("Jones", Some(33)), ("Heisenberg", Some(33)), ("Robinson", Some(34)), ("Smith", Some(34)), ("Williams", null)
)).toDF("LastName", "DepartmentID")

employees.show()

+----------+------------+
|  LastName|DepartmentID|
+----------+------------+
|  Rafferty|          31|
|     Jones|          33|
|Heisenberg|          33|
|  Robinson|          34|
|     Smith|          34|
|  Williams|        null|
+----------+------------+