HAVING, a less understood SQL clause
— 2 min read
Understanding how the
group by and
having clauses work can help us write more efficient sql. I want to demonstrate this with an example based on a real situation.
Let's assume we have two tables - a country table with a country code and name, and an invoice table with an invoice number, country code, year, month, item count, and amount.
1select * from country;
ctry_code ctry_name US USA CA Canada CN China .. ..
1select * from invoice;
invoice_nbr ctry_cd year_nbr month_nbr item_cnt invoice_amt 2014001 BR 2014 3 13 162875 2021172 CA 2021 10 200 1299355 2020435 CN 2020 1 12 145654 2016201 US 2016 8 900 7125125 2021662 US 2021 4 100 800135 2018743 MX 2018 11 5 76124 .. .. .. .. .. ..
4 million rows
We want to write a sql that aggregates the invoice data
- for each country and year, for all countries other than USA
- for each country, for all countries other than USA
- for each year (including data for USA)
- a grand total (including data for USA)
This is the output we want from our query:
ctry_name year_nbr tot_cnt tot_amt Canada 2014 200 1300000 Canada 2015 220 1500000 Brazil 2014 200 1150000 Brazil 2015 180 1000000 Brazil 2018 150 750000 .. .. .. .. Canada null 420 2800000 Brazil null 530 2900000 .. .. .. .. All countries 2014 900 6406325 All countries 2015 1000 7306368 All countries 2018 1200 8206334 .. .. .. .. All countries null 6124 41261346
Here's one that gives us the expected results. It uses 6 sqls for year-wise totals for each of the 6 countries (other than USA), 6 sqls for country-wise totals for all years, one sql for year-wise totals for all countries, and finally one sql for the grand total. These 14 sqls are then combined with 13
UNION operations as shown below.
1-- country, year totals2select c.ctry_name, i.year_nbr,3 sum(i.item_cnt) as tot_cnt,4 sum(i.invoice_amt) as tot_amt5 from country c6 inner join invoice i on (i.ctry_code = c.ctry_code)7 where c.ctry_name = 'Mexico'8 group by c.ctry_name, i.year_nbr -- grouping by country and year9UNION10....11.... -- unions for each of the 6 countries excluding USA12UNION13-- country totals14select c.ctry_name, null as year_nbr,15 sum(i.item_cnt) as tot_cnt,16 sum(i.invoice_amt) as tot_amt17 from country c18 inner join invoice i on (i.ctry_code = c.ctry_code)19 where c.ctry_name = 'Mexico'20 group by c.ctry_name -- grouping by just country21UNION22....23.... -- unions for each of the 6 countries excluding USA24UNION25-- totals for all countries by year26select 'All countries' as ctry_name, i.year_nbr,27 sum(i.item_cnt) as tot_cnt,28 sum(i.invoice_amt) as tot_amt29 from country c30 inner join invoice i on (i.ctry_code = c.ctry_code)31 group by i.year_nbr -- grouping by just year for all countries32UNION33-- totals for all countries and all years34select 'All countries' as ctry_name, null as year_nbr,35 sum(i.item_cnt) as tot_cnt,36 sum(i.invoice_amt) as tot_amt37 from country c38 inner join invoice i on (i.ctry_code = c.ctry_code)
This query returns the expected results, but apart from being difficult to maintain, it is also slow and inefficient. It fetches data from the invoice table 14 times to perform 14 aggregations followed by an expensive set of
We can write a simpler and more efficient query with a better understanding of the capabilities of
group by and
having. Just as a
where clause allows us to filter raw data,
having lets us filter the resultset from a
group by operation. The following sql uses
group by cube() to generate aggregates for all combinations of country and year while filtering out the summary rows for USA with
1select coalesce(c.ctry_name,'All countries') as ctry_name,2 i.year_nbr,3 sum(i.item_cnt) as tot_cnt,4 sum(i.invoice_amt) as tot_amt5 from country c6 inner join invoice i on (i.ctry_code = c.ctry_code)7 group by cube(c.ctry_name, i.year_nbr)8having coalesce(c.ctry_name,'x') != 'USA' -- exclude USA summary rows
What is this query doing? It uses
group by to get-
- one row for each country, year combination
- one row for each country (including data for all years)
- one row for each year (including data for all countries)
- one row for all years, all countries
It then excludes the rows where the ctry_name is USA using the
having clause. This query fetches data from the invoice table only once and is significantly faster then the initial sql.
Note that country-wise summary rows will have a
null year. Similarly, year-wise summary rows will have a
null country name, which we convert to
All countries using the
coalesce function. We also use
coalesce in the
having clause so we don't drop summary rows where the country name is
In this specific example, we needed to exclude data for USA in only two of the four aggregate groups. If the requirement was to exclude data for USA in all the summary rows, we would use the
where clause to filter USA data before any aggregation occurs, rather than doing it in the
having clause. The following query meets these requirements.
1select coalesce(c.ctry_name,'All countries') as ctry_name,2 i.year_nbr,3 sum(i.item_cnt) as tot_cnt,4 sum(i.invoice_amt) as tot_amt5 from country c6 inner join invoice i on (i.ctry_code = c.ctry_code)7 where c.ctry_name != 'USA' -- exclude USA in all summaries8 group by cube(c.ctry_name, i.year_nbr)
group by supports extensions like
rollup and others that are extremely useful and are worth spending the time to understand better. We will take a look at
group by in more detail soon.
The actual sql this example is based on had 24 small queries that were combined with
union and had more than two tables joined in each query. The rewritten sql was just 10 lines long compared to over 200 lines for the original code and ran in a few seconds compared to over half an hour for the original query.