Analytic functions were introduced in Release 2 of 8i and simplify greatly the means by which pivot reports and OLAP queries can be computed in straight, non-procedural SQL. Prior to the introduction of analytic functions, complex reports could be produced in SQL by complex self-joins, sub-queries and inline-views but these were resource-intensive and very inefficient. Furthermore, if a question to be answered was too complex, it could be written in PL/SQL, which by its very nature is usually less efficient than a single SQL statement.
There are three types of SQL extensions that fall under the banner of "analytic functions" though the first could be said to provide "analytic functionality" rather than actually be analytic functions:
- new grouping of resultsets through extensions to the GROUP BY clause (ROLLUP and CUBE);
- the new analytic functions themselves; and
- TOP-N analysis (largely enabled by the analytic functions).
Each of these will be dealt with in turn.
grouping extensions (rollup and cube)
Extensions to the GROUP BY clause enable pre-computed resultsets, summaries and aggregations to be supplied from within the Oracle server itself, rather than by a tool such as SQL*Plus. An example follows.
Example One: Sums the salaries by job and then sub-totals each department (similar to a break sum report in SQL*Plus) and then the entire salary column.
SELECT deptno , job , SUM(sal) FROM emp GROUP BY ROLLUP(deptno,job);
DEPTNO JOB SUM(SAL) ---------- --------- ---------- 10 CLERK 1300 10 MANAGER 2450 10 PRESIDENT 5000 10 8750 20 ANALYST 6000 20 CLERK 1900 20 MANAGER 2975 20 10875 30 CLERK 950 30 MANAGER 2850 30 SALESMAN 5600 30 9400 29025
Example Two: Sums the salaries by job and then sub-totals each department and each job type (similar to a break sum report in SQL*Plus) and then the entire salary column. This will provide sub-totals for all columns within the GROUP BY clause.
SELECT deptno , job , SUM(sal) FROM emp GROUP BY CUBE(deptno,job);
DEPTNO JOB SUM(SAL) ---------- --------- ---------- 10 CLERK 1300 10 MANAGER 2450 10 PRESIDENT 5000 10 8750 20 ANALYST 6000 20 CLERK 1900 20 MANAGER 2975 20 10875 30 CLERK 950 30 MANAGER 2850 30 SALESMAN 5600 30 9400 ANALYST 6000 CLERK 4150 MANAGER 8275 PRESIDENT 5000 SALESMAN 5600 29025
Example Three: Using the GROUPING function to "label" the sub-total rows (i.e. determine which rows represented rollups). The GROUPING function returns a value of 1 if the current row is a row representing an aggregated rollup group (such as the sub-total rows) or zero if the row is one of the "source" records itself.
SELECT DECODE(GROUPING(deptno),1,'All Departments',deptno) AS deptno , DECODE(GROUPING(job),1,'Job Total',job) AS job , SUM(sal) FROM emp GROUP BY CUBE(deptno,job);
DEPTNO JOB SUM(SAL) ---------------------------------------- --------- ---------- 10 CLERK 1300 10 MANAGER 2450 10 PRESIDENT 5000 10 Job Total 8750 20 ANALYST 6000 20 CLERK 1900 20 MANAGER 2975 20 Job Total 10875 30 CLERK 950 30 MANAGER 2850 30 SALESMAN 5600 30 Job Total 9400 All Departments ANALYST 6000 All Departments CLERK 4150 All Departments MANAGER 8275 All Departments PRESIDENT 5000 All Departments SALESMAN 5600 All Departments Job Total 29025
analytic functions
There are 33 new analytic functions, though it is likely that most users will concentrate on a small number of these and very few will use the statistical capabilities. They are:
- AVG, MIN, MAX, COUNT, SUM
- LAG, LEAD
- FIRST_VALUE, LAST_VALUE
- ROW_NUMBER
- RANK, DENSE_RANK, PERCENT_RANK
- NTILE, CUME_DIST, RATIO_TO_REPORT
- CORR
- COVAR_POP
- VARIANCE, VAR_POP, VAR_SAMP
- STTDEV, STTDEV_POP, STDDEV_SAMP
- REGR_COUNT, REGR_P2, REGR_AVGX, REGR_AVGY, REGR_SXX, REGR_SXY, REGR_SYY, REGR_SLOPE, REGR_INTERCEPT
The way analytic functions work is to manipulate data contained within returning resultsets. This means they can process, merge and compute against data that has already been fetched from a query and partition and order the resultset into groups while at the same time returning the entire resultset without GROUP BY clauses. There follows a range of examples that demonstrate the analytic functions that, in my opinion, will be used most commonly. The first is a simple running total of salaries within each of an organization's departments.
Example Four: Calculate a running salary total for each department as new employees were hired.
SELECT empno , deptno , sal , SUM(sal) OVER (PARTITION BY deptno ORDER BY ename ASC NULLS LAST) AS department_running_total , ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY ename ASC NULLS LAST) AS employees_in_running_total FROM emp ORDER BY deptno , ename;
ENAME DEPTNO SAL DEPARTMENT_RUNNING_TOTAL EMPLOYEES_IN_RUNNING_TOTAL ---------- ---------- ---------- ------------------------ -------------------------- CLARK 10 2450 2450 1 KING 10 5000 7450 2 MILLER 10 1300 8750 3 ADAMS 20 1100 1100 1 FORD 20 3000 4100 2 JONES 20 2975 7075 3 SCOTT 20 3000 10075 4 SMITH 20 800 10875 5 ALLEN 30 1600 1600 1 BLAKE 30 2850 4450 2 JAMES 30 950 5400 3 MARTIN 30 1250 6650 4 TURNER 30 1500 8150 5 WARD 30 1250 9400 6
This simple example demonstrates the power of analytic functions - they can split resultsets into working groups to compute, order and aggregate data. The above example would be considerably more complex with standard SQL, and would require something like three scans of the EMP table instead of the one scan with the above example.
Analytic functions are invoked using the OVER() clause. This also enables Oracle to distinguish between PL/SQL functions and analytic functions that share the same name such as AVG, MIN and MAX.
There are three components to the OVER clause:
- PARTITION clause, by which the resultset can be broken into groups, such as departments in the example above. Without this the entire resultset is treated as a single partition;
- ORDER BY clause, by which the resultset or partition group can be ordered. This is optional for some analytic functions but mandatory for those which need to access rows either side of the current row, such as LAG and LEAD; and
- RANGE or ROWS clause (AKA windowing), by which the function can be made to include rows or values around the current row in its calculations. RANGE windows work on values and ROWS windows work on records, such as either X rows on each side of the current row or all rows preceding the current row, within the current partition.
The PARTITION and ORDER BY clauses are demonstrated in the first example above. The resultset was partitioned into the individual departments in the organization. Within each department, the data was ordered by ename (using default criteria (ASC and NULLS LAST). No RANGE clause was added which means that we used the default of RANGE UNBOUNDED PRECEDING, which means include all the preceding records in the current partition in the calculation for the current row. The easiest way to understand analytic functions and windowing is by examples which demonstrate the each of the three components to the OVER() clause.
Example Five: Find the average salary by department and compare each employees' salaries to the department average.
SELECT deptno , ename , sal , ROUND(average_sal_dept,0) AS average_sal_dept , ROUND(sal - average_sal_dept,0) AS sal_variance FROM (SELECT deptno , ename , sal , AVG(sal) OVER (PARTITION BY deptno) AS average_sal_dept FROM emp);
DEPTNO ENAME SAL AVERAGE_SAL_DEPT SAL_VARIANCE ---------- ---------- ---------- ---------------- ------------ 10 CLARK 2450 2917 -467 10 KING 5000 2917 2083 10 MILLER 1300 2917 -1617 20 SMITH 800 2175 -1375 20 ADAMS 1100 2175 -1075 20 FORD 3000 2175 825 20 SCOTT 3000 2175 825 20 JONES 2975 2175 800 30 ALLEN 1600 1567 33 30 BLAKE 2850 1567 1283 30 MARTIN 1250 1567 -317 30 JAMES 950 1567 -617 30 TURNER 1500 1567 -67 30 WARD 1250 1567 -317
Example Six: Determine the order by which employees joined their respective departments. Also include the employees who preceded and succeeded them.
SELECT deptno , ename , hiredate , LAG(ename,1,NULL) OVER (PARTITION BY deptno ORDER BY hiredate ASC NULLS LAST) AS previous_employee_ename , LEAD(ename,1,NULL) OVER (PARTITION BY deptno ORDER BY hiredate ASC NULLS LAST) AS next_employee_ename FROM emp ORDER BY deptno;
DEPTNO ENAME HIREDATE PREVIOUS_EMPLOYEE_ENAME NEXT_EMPLOYEE_ENAME ---------- ---------- ----------- ------------------------------ --------------------- 10 CLARK 09-JUN-1981 KING 10 KING 17-NOV-1981 CLARK MILLER 10 MILLER 23-JAN-1982 KING 20 SMITH 17-DEC-1980 JONES 20 JONES 02-APR-1981 SMITH FORD 20 FORD 03-DEC-1981 JONES SCOTT 20 SCOTT 09-DEC-1982 FORD ADAMS 20 ADAMS 12-JAN-1983 SCOTT 30 ALLEN 20-FEB-1981 WARD 30 WARD 22-FEB-1981 ALLEN BLAKE 30 BLAKE 01-MAY-1981 WARD TURNER 30 TURNER 08-SEP-1981 BLAKE MARTIN 30 MARTIN 28-SEP-1981 TURNER JAMES 30 JAMES 03-DEC-1981 MARTIN
Note: LAG() and LEAD() provide access to rows around the current row, something that was very difficult to achieve prior to 8.1.6. The functions take 3 parameters - expression to be returned from the LAG/LEAD row, the LAG/LEAD offset from the current row, and the value to be returned if the offset is beyond the partition window.
Example Seven: Determine the proportion of each department's salary taken up by its individual employees.
SELECT deptno , ename , sal , dept_sal , ROUND(employees_dept_ratio*100,2) AS emps_proportion FROM (SELECT deptno , ename , sal , SUM(sal) OVER (PARTITION BY deptno) AS dept_sal , RATIO_TO_REPORT(sal) OVER (PARTITION BY deptno) AS employees_dept_ratio FROM emp) ORDER BY deptno;
DEPTNO ENAME SAL DEPT_SAL EMPS_PROPORTION ---------- ---------- ---------- ---------- --------------- 10 CLARK 2450 8750 28 10 KING 5000 8750 57.14 10 MILLER 1300 8750 14.86 20 SMITH 800 10875 7.36 20 ADAMS 1100 10875 10.11 20 FORD 3000 10875 27.59 20 SCOTT 3000 10875 27.59 20 JONES 2975 10875 27.36 30 ALLEN 1600 9400 17.02 30 BLAKE 2850 9400 30.32 30 MARTIN 1250 9400 13.3 30 JAMES 950 9400 10.11 30 TURNER 1500 9400 15.96 30 WARD 1250 9400 13.3
Example Eight: RANGE windowing. Determine the first and last employee to be employed within 50 days of the current employees' hiredate.
SELECT ename , hiredate , FIRST_VALUE(ename) OVER (ORDER BY hiredate ASC NULLS LAST RANGE BETWEEN 50 PRECEDING AND 50 FOLLOWING) AS first_employee , LAST_VALUE(ename) OVER (ORDER BY hiredate ASC NULLS LAST RANGE BETWEEN 50 PRECEDING AND 50 FOLLOWING) AS last_employee FROM emp ORDER BY hiredate;
ENAME HIREDATE FIRST_EMPL LAST_EMPLO ---------- ----------- ---------- ---------- SMITH 17-DEC-1980 SMITH SMITH ALLEN 20-FEB-1981 ALLEN JONES WARD 22-FEB-1981 ALLEN JONES JONES 02-APR-1981 ALLEN BLAKE BLAKE 01-MAY-1981 JONES CLARK CLARK 09-JUN-1981 BLAKE CLARK TURNER 08-SEP-1981 TURNER MARTIN MARTIN 28-SEP-1981 TURNER KING KING 17-NOV-1981 MARTIN JAMES FORD 03-DEC-1981 KING JAMES JAMES 03-DEC-1981 KING JAMES MILLER 23-JAN-1982 MILLER MILLER SCOTT 09-DEC-1982 SCOTT ADAMS ADAMS 12-JAN-1983 SCOTT ADAMS
Example Nine: ROWS windowing. Determine who was recruited two employees before and three after the current employee (note what happens when employees share hiredates).
SELECT ename , hiredate , FIRST_VALUE(ename) OVER (ORDER BY hiredate ASC NULLS LAST ROWS 2 PRECEDING) AS two_employees_back , FIRST_VALUE(ename) OVER (ORDER BY hiredate DESC NULLS LAST ROWS 3 PRECEDING) AS three_employees_forward FROM emp ORDER BY hiredate;
ENAME HIREDATE TWO_EMPLOYEES_BACK THREE_EMPLOYEES_FORWARD ---------- ----------- ------------------------- ------------------------- SMITH 17-DEC-1980 SMITH JONES ALLEN 20-FEB-1981 SMITH BLAKE WARD 22-FEB-1981 SMITH CLARK JONES 02-APR-1981 ALLEN TURNER BLAKE 01-MAY-1981 WARD MARTIN CLARK 09-JUN-1981 JONES KING TURNER 08-SEP-1981 BLAKE FORD MARTIN 28-SEP-1981 CLARK JAMES KING 17-NOV-1981 TURNER MILLER JAMES 03-DEC-1981 MARTIN ADAMS FORD 03-DEC-1981 KING SCOTT MILLER 23-JAN-1982 JAMES ADAMS SCOTT 09-DEC-1982 FORD ADAMS ADAMS 12-JAN-1983 MILLER ADAMS
top-n queries
These refer to ranked sets of data and are quite commonly requested, such as "Who are our top-n spending customers", "who are our top-n earners" and so on. Oracle 8i provides two ways of providing answers to TOP-N queries; either by the introduction of ORDER BY in in-line views or by analytic function. Prior to these methods being available, TOP-N queries could only be achieved using complex SQL, utilizing self-joins and subqueries.
At its simplest, TOP-N queries can be answered using an ORDER BY in an in-line view and then limiting the number of rows selected from the view. An example follows.
Example Ten: Who were the first three recruits to our organization?
SELECT ROWNUM AS rank , ename , hiredate FROM (SELECT ename , hiredate FROM emp ORDER BY hiredate ASC NULLS LAST) WHERE ROWNUM <= 3;
RANK ENAME HIREDATE ---------- ---------- ----------- 1 SMITH 17-DEC-1980 2 ALLEN 20-FEB-1981 3 WARD 22-FEB-1981
There is ambiguity to the above question, especially in using the above methodology. For example, if five people were recruited on the same day, then how would this question be answered? The ORDER BY in-line view method would generate the employees in no particular order and the stopkey would stop returning rows at record three.
To remove this ambiguity, analytic functions can help. For the following example, I've updated five employees to have the earliest date.
Example Eleven: Who were the first three recruits to our organization?
SELECT hire_rank , ename , hiredate FROM (SELECT ename , hiredate , RANK() OVER (ORDER BY HIREDATE ASC NULLS LAST) AS hire_rank FROM emp) WHERE hire_rank <= 3;
HIRE_RANK ENAME HIREDATE ---------- ---------- ----------- 1 SMITH 01-JAN-1951 1 ALLEN 01-JAN-1951 1 WARD 01-JAN-1951 1 JONES 01-JAN-1951 1 MARTIN 01-JAN-1951
Technically, this has not answered the actual question but has instead expanded it. In using the RANK() analytic function, the query has returned all the people who joined the organization on the same day, even though this is more than the three people asked for. Note the use of RANK() rather than the DENSE_RANK() function. The RANK() function skips ranking numbers, such that the sixth employee to be returned would be given a rank of 6. DENSE_RANK() would assign the sixth person a rank of 2 as this works on distinct values, rather than rows.
Example Twelve: Determine the ranking of each employees' salary within their departments and within the company as a whole.
SELECT deptno , ename , sal , DENSE_RANK() OVER (PARTITION BY deptno ORDER BY sal DESC NULLS LAST) AS dept_ranking , DENSE_RANK() OVER (ORDER BY sal DESC NULLS LAST) AS company_ranking FROM emp ORDER BY deptno;
DEPTNO ENAME SAL DEPT_RANKING COMPANY_RANKING ---------- ---------- ---------- ------------ --------------- 10 KING 5000 1 1 10 CLARK 2450 2 5 10 MILLER 1300 3 8 20 SCOTT 3000 1 2 20 FORD 3000 1 2 20 JONES 2975 2 3 20 ADAMS 1100 3 10 20 SMITH 800 4 12 30 BLAKE 2850 1 4 30 ALLEN 1600 2 6 30 TURNER 1500 3 7 30 WARD 1250 4 9 30 MARTIN 1250 4 9 30 JAMES 950 5 11
Note: The use of DENSE_RANK() and opposed to RANK() means that no rank numbers are skipped. For example, in department 20, SCOTT and FORD have the same salary so they share a dense_rank of 1, while JONES (next highest) has the dense_rank of 2. With RANK(), JONES would be ranked 3, as rank is relative to the number of rows, so the RANK() for SCOTT, FORD and JONES would be 1,1,3 respectively.
analytic functions and pl/sql
In 8i, the analytic functions cannot be used in PL/SQL (in 9i, this is overcome as the SQL and PL/SQL parsers are merged). To work around this, compile your cursor SELECT as a view and SELECT from the view instead of the underlying table in the PL/SQL block or alternatively, use Native Dynamic SQL to send your query to the SQL parser instead of compiling it with the PL/SQL parser (though be careful of this approach if the resulting SQL statement is to be executed many times - NDS necessitates one soft-parse of a statement per execution and as a result can be more "expensive" than the static SQL that could be achieved via the view method.
summary
This introduction demonstrates the power and relative simplicity of the analytic functions. They provide an easy mechanism to compute resultsets that, before 8i, were inefficient, impractical and, in some cases, impossible in "straight SQL".
To the uninitiated, the syntax can at first appear cumbersome, but once you have one or two examples under your belt, you will actively seek opportunities to use them. In addition to their flexibility and power, they are also extremely efficient - this can easily be demonstrated using SQL_TRACE and comparing the analytic functions' performance to the SQL statements that would have been required in the days before 8.1.6. Try it for yourself...
further reading
For an in-depth discussion of analytic functions, see Tom Kyte's "Expert One-on-One Oracle" where he gives Analytic Functions an entire chapter. Also see the February 2002 Oracle magazine for details on 9i extensions to the capabilities. And, of course, the various online documentation stored at OTN, tahiti.oracle.com etc.
Credit goes to the below website(s):
No comments:
Post a Comment