Pages

OracleEBSpro is purely for knowledge sharing and learning purpose, with the main focus on Oracle E-Business Suite Product and other related Oracle Technologies.

I'm NOT responsible for any damages in whatever form caused by the usage of the content of this blog.

I share my Oracle knowledge through this blog. All my posts in this blog are based on my experience, reading oracle websites, books, forums and other blogs. I invite people to read and suggest ways to improve this blog.


Tuesday, February 26, 2013

Nested Loop, Hash Join, Sort Merge Join, Cartesian join difference


Nested loop (loop over loop)
In this algorithm, an outer loop is formed which consists of few entries and then for each entry, and inner loop is processed.
Ex:
Select tab1.*, tab2.* from tabl, tab2 where tabl.col1=tab2.col2;
It is processed like:
For i in (select * from tab1) loop
For j in (select * from tab2 where col2=i.col1) loop
Display results;
End loop;
End loop;
The Steps involved in doing nested loop are:
a) Identify outer (driving) table
b) Assign inner (driven) table to outer table.
c) For every row of outer table, access the rows of inner table.
In execution plan it is seen like this:
NESTED LOOPS
outer_loop
inner_loop
When optimizer uses nested loops?
Optimizer uses nested loop when we are joining tables containing small number of rows with an efficient driving condition. It is important to have an index on column of inner join table as this table is probed every time for a new value from outer table.
Optimizer may not use nested loop in case:
  1. No of rows of both the table is quite high
  2. Inner query always results in same set of records
  3. The access path of inner table is independent of data coming from outer table.
Note: You will see more use of nested loop when using FIRST_ROWS optimizer mode as it works on model of showing instantaneous results to user as they are fetched. There is no need for selecting caching any data before it is returned to user. In case of hash join it is needed and is explained below.
Hash join
Hash joins are used when the joining large tables. The optimizer uses smaller of the 2 tables to build a hash table in memory and the scans the large tables and compares the hash value (of rows from large table) with this hash table to find the joined rows.
The algorithm of hash join is divided in two parts
  1. Build a in-memory hash table on smaller of the two tables.
  2. Probe this hash table with hash value for each row second table
In simpler terms it works like
Build phase
For each row RW1 in small (left/build) table loop
Calculate hash value on RW1 join key
Insert RW1 in appropriate hash bucket.
End loop;
Probe Phase
For each row RW2 in big (right/probe) table loop
Calculate the hash value on RW2 join key
For each row RW1 in hash table loop
If RW1 joins with RW2
Return RW1, RW2
End loop;
End loop;
When optimizer uses hash join?
Optimizer uses has join while joining big tables or big fraction of small tables.
Unlike nested loop, the output of hash join result is not instantaneous as hash joining is blocked on building up hash table.
Note: You may see more hash joins used with ALL_ROWS optimizer mode, because it works on model of showing results after all the rows of at least one of the tables are hashed in hash table.
Sort merge join
Sort merge join is used to join two independent data sources. They perform better than nested loop when the volume of data is big in tables but not as good as hash joins in general.
They perform better than hash join when the join condition columns are already sorted or there is no sorting required.
The full operation is done in two parts:
  • Sort join operation
get first row RW1 from input1
get first row RW2 from input2.

  • Merge join operation
while not at the end of either input loop
if RW1 joins with RW2
get next row R2 from input 2
return (RW1, RW2)
else if RW1 < style=""> get next row RW1 from input 1
else
get next row RW2 from input 2
end loop
Note: If the data is already sorted, first step is avoided.
Important point to understand is, unlike nested loop where driven (inner) table is read as many number of times as the input from outer table, in sort merge join each of the tables involved are accessed at most once. So they prove to be better than nested loop when the data set is large.
When optimizer uses Sort merge join?
a) When the join condition is an inequality condition (like <, <=, >=). This is because hash join cannot be used for inequality conditions and if the data set is large, nested loop is definitely not an option.
b) If sorting is anyways required due to some other attribute (other than join) like “order by”, optimizer prefers sort merge join over hash join as it is cheaper.
Note: Sort merge join can be seen with both ALL_ROWS and FIRST_ROWS optimizer hint because it works on a model of first sorting both the data sources and then start returning the results. So if the data set is large and you have FIRST_ROWS as optimizer goal, optimizer may prefer sort merge join over nested loop because of large data. And if you have ALL_ROWS as optimizer goal and if any inequality condition is used the SQL, optimizer may use sort-merge join over hash join
-------------------------------------------------------------------------------------------------------------------------------------------------------
This article is written in oracle 9.2.0.8. 

Nested loop Joins The optimizer uses nested loop joins when joining small number of rows, with a good driving condition between the two tables. It drives from the outer loop to the inner loop. The inner loop is iterated for every row returned from the outer loop, ideally by an index scan. It is inefficient when join returns large number of rows (typically, more than 10,000 rows is considered large), and the optimizer might choose not to use it.

The cost is calculated as below.
cost = access cost of A + (access cost of B * no_of_rows from A) 

A nested loop join involves the following steps:

1. The optimizer determines the driving table and designates it as the outer table.
2. The other table is designated as the inner table. 
3. For every row in the outer table, Oracle accesses all the rows in the inner table.

For instance,
scott@DB1.US.ORACLE.COM> SELECT emp.empno, dept.dname
2 FROM EMP , DEPT
3 WHERE dept.deptno = 10
4 AND emp.deptno = dept.deptno
5 /

EMPNO DNAME
---------- --------------
7782 ACCOUNTING
7839 ACCOUNTING
7934 ACCOUNTING

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=5 Bytes=85)
1 0 NESTED LOOPS (Cost=3 Card=5 Bytes=85)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'DEPT' (Cost=1 Card=1 Bytes=10)
3 2 INDEX (UNIQUE SCAN) OF 'PK_DEPT' (UNIQUE)
4 1 TABLE ACCESS (FULL) OF 'EMP' (Cost=2 Card=5 Bytes=35)

We can also force the Nested loop join hint as below.

SELECT /*+ USE_NL(emp dept) */ emp.empno, dept.dname
FROM EMP , DEPT WHERE dept.deptno = 10
AND emp.deptno = dept.deptno

Hash Join Hash joins are used for joining large data sets. The optimizer uses the smaller of two tables or data sources to build a hash table on the join key in memory. It then scans the larger table, probing the hash table to find the joined rows. This method is best used when the smaller table fits in available memory. The optimizer uses a hash join to join two tables if they are joined using an equijoin(joins with equals predicates) and large amount of data needs to be joined.

The cost of a Hash loop join is calculated by the following formula:
cost=(access cost of A*no_of_hash partitions of B) + access cost of B 

For instance,
scott@DB1.US.ORACLE.COM> ;
1 SELECT emp.empno, dept.dname
2 FROM EMP , DEPT
3* WHERE emp.deptno = dept.deptno
scott@DB1.US.ORACLE.COM>
/
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=5 Card=14 Bytes=238)
1 0 HASH JOIN (Cost=5 Card=14 Bytes=238)
2 1 TABLE ACCESS (FULL) OF 'DEPT' (Cost=2 Card=7 Bytes=70)
3 1 TABLE ACCESS (FULL) OF 'EMP' (Cost=2 Card=14 Bytes=98)

We can also force the Hash join hint as below.
SELECT /*+USE_HASH(emp dept) */ emp.empno, dept.dname
FROM EMP , DEPT
WHERE emp.deptno = dept.deptno

Sort-Merge Join Sort merge joins can be used to join rows from two independent sources. Hash joins generally perform better than sort merge joins. On the other hand, sort merge joins can perform better than hash joins if both of the following conditions exist:

1. The row sources are sorted already.
2. A sort operation does not have to be done.

Sort merge joins are almost exclusively used for non-equi joins (>, <, BETWEEN). Sort merge joins perform better than nested loop joins for large data sets. (You cannot use hash joins unless there is an equality condition). In a merge join, there is no concept of a driving table.
The join consists of two steps:
Sort join operation: Both the inputs are sorted on the join key. 
Merge join operation: The sorted lists are merged together.

The optimizer can choose a sort merge join over a hash join for joining large amounts of data if any of the following conditions are true:

1. The join condition between two tables is not an equi-join.
2. OPTIMIZER_MODE is set to RULE.
3. HASH_JOIN_ENABLED is false. 
4. Because of sorts already required by other operations, the optimizer finds it is cheaper to use a sort merge than a hash join. 
5. The optimizer thinks that the cost of a hash join is higher, based on the settings of HASH_AREA_SIZE and SORT_AREA_SIZE.

The cost of a sort merge join is calculated by the following formula: 
cost = access cost of A + access cost of B + (sort cost of A + sort cost of B)

For instance, scott@DB1.US.ORACLE.COM> SELECT emp.empno, dept.dname
2 FROM EMP , DEPT
3 WHERE emp.deptno < dept.deptno

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=6 Card=5 Bytes=85)
1 0 MERGE JOIN (Cost=6 Card=5 Bytes=85)
2 1 SORT (JOIN) (Cost=2 Card=7 Bytes=70)
3 2 TABLE ACCESS (BY INDEX ROWID) OF 'DEPT' (Cost=2 Card=7 Bytes=70)
4 3 INDEX (FULL SCAN) OF 'PK_DEPT' (UNIQUE) (Cost=1 Card =7)
5 1 SORT (JOIN) (Cost=4 Card=14 Bytes=98)
6 5 TABLE ACCESS (FULL) OF 'EMP' (Cost=2 Card=14 Bytes=98)

We can also force the Hash join hint as below.

scott@DB1.US.ORACLE.COM> SELECT /*+USE_MERGE(emp dept) */
emp.empno, dept.dname
2 FROM EMP , DEPT
3 WHERE emp.deptno < dept.deptno

Cartesian join A Cartesian join is used when one or more of the tables does not have any join conditions to any other tables in the statement. The optimizer joins every row from one data source with all the row from the other data source, creating the Cartesian product of the two sets.

For instance,
scott@DB1.US.ORACLE.COM> select * from emp,dept;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=16 Card=98 Bytes=5194)
1 0 MERGE JOIN (CARTESIAN) (Cost=16 Card=98 Bytes=5194)
2 1 TABLE ACCESS (FULL) OF 'DEPT' (Cost=2 Card=7 Bytes=112)
3 1 BUFFER (SORT) (Cost=14 Card=14 Bytes=518)
4 3 TABLE ACCESS (FULL) OF 'EMP' (Cost=2 Card=14 Bytes=518)

No comments:

Post a Comment