500x improvement using Group By Cube and Having (Oracle)

2022.08.05 — 3 min read

Understanding how the group by and having clauses work can help us write more efficient SQL. I want to demonstrate this with an example based on a real situation where rewriting a query resulted in a 500x performance improvement.

A scheduled job ran a complex query daily to truncate and reload a table used for reporting. The job used to run for about an hour and was identified for investigation. The query aggregated data from several tables, of which we will use simplified versions below.

We have two lookup tables for countries and categories along with an orders table:

1select * from sts_countries
2;

COUNTRY_CODE COUNTRY_NAME
US USA
CA CANADA
MX MEXICO
CH CHINA
AR ARGENTINA
BR BRAZIL

COUNTRY_CODE	COUNTRY_NAME
US	USA
CA	CANADA
MX	MEXICO
CH	CHINA
AR	ARGENTINA
BR	BRAZIL

1select * from sts_categories 
2;

CATEGORY_CODE CATEGORY_DESCR
C1 Category 1
C2 Category 2
C3 Category 3
C4 Category 4

CATEGORY_CODE	CATEGORY_DESCR
C1	Category 1
C2	Category 2
C3	Category 3
C4	Category 4

1select * from sts_orders fetch first 8 rows only
2;

ORDER_ID COUNTRY_CODE CATEGORY_CODE ORDER_AMT
1 AR C1 58788
2 BR C1 726424
3 CA C1 725327
4 CH C1 508595
5 MX C1 133405
6 US C1 992442
7 AR C2 348913
8 BR C2 257835

ORDER_ID	COUNTRY_CODE	CATEGORY_CODE	ORDER_AMT
1	AR	C1	58788
2	BR	C1	726424
3	CA	C1	725327
4	CH	C1	508595
5	MX	C1	133405
6	US	C1	992442
7	AR	C2	348913
8	BR	C2	257835

The query in the job was aggregating order data (sum of order_amt) for these combinations:

one row for each combination of country, year and category, excluding the data for USA
one row for each combination of country and category, excluding the data for USA
one row for each combination of country and year, excluding the data for USA
one row for each combination of year and category, including the data for USA
one row for each country for all years and all categories, excluding the data for USA
one row for each category for all countries and all years, including the data for USA
one row for each year for all countries and all categories, including the data for USA
one row for grand totals (all countries, all years and all categories, including USA)

The task could have been accomplished using fewer queries but for unknown reasons, there were 8 select statements that were combined using UNION operations as shown below.

1-- country, year, category totals excluding USA
2select c.country_name, o.order_year, cat.category_descr,
3       sum(o.order_amt) as tot_amt
4  from sts_countries c
5 inner join sts_orders o on (o.country_code = c.country_code)
6 inner join sts_categories cat on (cat.category_code = o.category_code) 
7 where c.country_name != 'USA'
8 group by c.country_name, o.order_year, cat.category_descr
9UNION
10-- country, category totals excluding USA (all years)
11select c.country_name, null as order_year, cat.category_descr,
12       sum(o.order_amt) as tot_amt
13  from sts_countries c
14 inner join sts_orders o on (o.country_code = c.country_code)
15 inner join sts_categories cat on (cat.category_code = o.category_code) 
16 where c.country_name != 'USA'
17 group by c.country_name, cat.category_descr
18UNION
19-- country, year totals excluding USA (all categories)
20select c.country_name, o.order_year, 'All' as category_descr,
21       sum(o.order_amt) as tot_amt
22  from sts_countries c
23 inner join sts_orders o on (o.country_code = c.country_code)
24 where c.country_name != 'USA'
25 group by c.country_name, o.order_year
26UNION
27-- year, category totals for all countries (including USA)
28select 'All' as country_name, o.order_year, cat.category_descr,
29       sum(o.order_amt) as tot_amt
30  from sts_orders o
31 inner join sts_categories cat on (cat.category_code = o.category_code) 
32 group by o.order_year, cat.category_descr
33UNION
34-- country totals excluding USA (all years and all categories)
35select c.country_name, null as order_year, 'All' as category_descr,
36       sum(o.order_amt) as tot_amt
37  from sts_countries c
38 inner join sts_orders o on (o.country_code = c.country_code)
39 where c.country_name != 'USA'
40 group by c.country_name
41UNION
42-- category totals for all countries and all years (including USA)
43select 'All' as country_name, null as order_year, cat.category_descr,
44       sum(o.order_amt) as tot_amt
45  from sts_orders o
46 inner join sts_categories cat on (cat.category_code = o.category_code) 
47 group by cat.category_descr
48UNION
49-- year totals for all countries and all categories (including USA)
50select 'All' as country_name, o.order_year, 'All' as category_descr,
51       sum(o.order_amt) as tot_amt
52  from sts_orders o
53 group by o.order_year
54UNION
55-- grand totals for all countries, all years and all categories (including USA)
56select 'All' as country_name, null as order_year, 'All' as category_descr,
57       sum(o.order_amt) as tot_amt
58  from sts_orders o
59;

COUNTRY_NAME ORDER_YEAR CATEGORY_DESCR TOT_AMT
All 2018 All 22859852
All 2018 Category 1 5864425
All 2018 Category 2 4116212
All 2018 Category 3 5611821
All 2018 Category 4 7267394
All 2019 All 21410101
All 2019 Category 1 5749186
All 2019 Category 2 4929132
All 2019 Category 3 5706655
All 2019 Category 4 5025128
All 2020 All 23525939
.. .. .. ..
120 rows selected

COUNTRY_NAME	ORDER_YEAR	CATEGORY_DESCR	TOT_AMT
All	2018	All	22859852
All	2018	Category 1	5864425
All	2018	Category 2	4116212
All	2018	Category 3	5611821
All	2018	Category 4	7267394
All	2019	All	21410101
All	2019	Category 1	5749186
All	2019	Category 2	4929132
All	2019	Category 3	5706655
All	2019	Category 4	5025128
All	2020	All	23525939
..	..	..	..

If you look at what the individual queries are doing, we can see that USA country data is excluded only when the group by clause includes the country_name column. If you ignore that filter, we are generating all possible combinations of country name, order year and category description. Note that we are fetching the sts_orders data 8 times in this query.

We can write a simpler and more efficient query with a better understanding of the capabilities of group by and having. Just as a where clause allows us to filter raw data, having lets us filter the resultset from a group by operation. Typical examples you would see for having clause use aggregate functions for filtering grouped results. But you can also use the columns on which the group by is operating for filtering. Our main enhancement is to use the cube extension of group by to generate aggregates for all combinations of country name, order year and category description. Once we have aggregates for all combinations of these three columns, we can filter out the summary rows for USA with having. The following query is equivalent to the above but reads the orders data only once.

1select coalesce(country_name,'All') as country_name,
2       order_year,
3       coalesce(category_descr,'All') as category_descr,
4       sum(order_amt) as tot_amt
5from sts_orders o
6inner join sts_countries c on (c.country_code = o.country_code)
7inner join sts_categories cat on (cat.category_code = o.category_code)
8group by cube(country_name, category_descr, order_year)
9having (grouping(country_name) = 1 -- all countries
10        or
11        (grouping(country_name) = 0 and country_name != 'USA') -- country level summary
12       )
13;

What is this query doing? It uses cube with group by to generate all 8 combinations of the country, year and category columns (single column at a time, two columns at a time, three columns at a time and grand totals). It then excludes the rows where the country_name is USA using the having clause.

The grouping() function returns 0 if the summary row is using the column passed to it for grouping and 1 if not. Rows for all countries' summary data will have a value of 1 for country name(since those rows are not actually using a specific country name value). To achieve our goal, we can exclude rows with the value of USA for the country name when the value is 0.

By eliminating the unncecessary UNIONs and reducing the number of full scans on the orders table from 8 to 1, this query is much simpler and is far more efficient.

Group by supports extensions such as rollup and grouping sets in addition to cube. These extensions can be extremely useful and are worth spending time to understand better. A future post will go into this in more detail. Stay tuned!

The actual query this example is based on had 24 small select statements that were combined with union and had several large tables joined in each select statement. The rewritten query was about 10 lines long compared to over 200 lines for the original code and ran in a few seconds compared to about an hour for the original query.

To run the code yourself, use the script below to create the data. It inserts 6 rows into sts_country table, 4 rows into sts_category table and two rows for each combination of country, category and year (using 2018, 2019 and 2020 for order years) into sts_orders. As a result, we have 6x4x3x2, or 144 rows in sts_orders table.

1--- Script to create tables and add sample data
2
3drop table sts_orders purge;
4drop table sts_countries purge;
5drop table sts_categories purge;
6
7create table sts_countries (
8    country_code    varchar2(2) not null primary key,
9    country_name    varchar2(30) not null
10);
11create table sts_categories (
12    category_code   varchar2(2) not null primary key,
13    category_descr  varchar2(30) not null
14);
15create table sts_orders (
16    order_id        number(10,0) not null,
17    country_code    varchar2(2) not null,
18    category_code   varchar2(2) not null,
19    order_year      number(4) not null,
20    order_amt       number(18,2) not null,
21    constraints orders_pk primary key (order_id),
22    constraint ctry_fk foreign key (country_code)
23        references sts_countries (country_code),
24    constraint cat_fk foreign key (category_code)
25        references sts_categories (category_code)
26);
27
28select * from all_types
29where typecode = 'COLLECTION'
30and type_name like '%CHAR%ARRAY%'
31;
32
33insert into sts_countries
34with t1 as (
35  select mdsys.t_varchar2_array(
36    'US,USA',
37    'CA,CANADA',
38    'MX,MEXICO',
39    'CH,CHINA',
40    'AR,ARGENTINA',
41    'BR,BRAZIL'
42    ) as ctry
43  from dual
44)
45select substr(column_value,1,2) as ctry_cd, substr(column_value,4) as ctry_nm
46from t1, table(ctry)
47;
48commit
49;
50
51insert into sts_categories
52with t2 as (
53  select mdsys.t_varchar2_array(
54    'C1,Category 1',
55    'C2,Category 2',
56    'C3,Category 3',
57    'C4,Category 4'
58    ) as cat
59  from dual
60)
61select substr(column_value,1,2) as cat_cd, substr(column_value,4) as cat_descr
62from t2, table(cat)
63;
64commit
65;
66
67exec dbms_random.seed('some random seed text');
68
69insert into sts_orders 
70    (order_id, country_code, category_code, order_year, order_amt)
71--
72with yrs as (
73    select 2018 + mod(lvl1,3) as yr
74    from (select level as lvl1 from dual connect by level < 7)
75), order_dta as (
76    select country_code, category_code, yr
77    from yrs
78    cross join sts_countries cn
79    cross join sts_categories ct
80)
81select rownum as order_id,
82       order_dta.country_code, order_dta.category_code, order_dta.yr,
83       round(dbms_random.value(10000, 1000000)) as order_amt
84from order_dta
85;
86commit
87;

Thanks for reading! Feel free to reach out if you have any interesting SQL or data projects that I can help with.