HAVING, a less understood SQL clause
— 2 min read
Understanding how the group by
and having
clauses work can help us write more efficient sql. I want to demonstrate this with an example based on a real situation.
Let's assume we have two tables - a country table with a country code and name, and an invoice table with an invoice number, country code, year, month, item count, and amount.
1select * from country;
ctry_code ctry_name US USA CA Canada CN China .. .. 7 rows
1select * from invoice;
invoice_nbr ctry_cd year_nbr month_nbr item_cnt invoice_amt 2014001 BR 2014 3 13 162875 2021172 CA 2021 10 200 1299355 2020435 CN 2020 1 12 145654 2016201 US 2016 8 900 7125125 2021662 US 2021 4 100 800135 2018743 MX 2018 11 5 76124 .. .. .. .. .. .. 4 million rows
We want to write a sql that aggregates the invoice data
- for each country and year, for all countries other than USA
- for each country, for all countries other than USA
- for each year (including data for USA)
- a grand total (including data for USA)
This is the output we want from our query:
ctry_name year_nbr tot_cnt tot_amt Canada 2014 200 1300000 Canada 2015 220 1500000 Brazil 2014 200 1150000 Brazil 2015 180 1000000 Brazil 2018 150 750000 .. .. .. .. Canada null 420 2800000 Brazil null 530 2900000 .. .. .. .. All countries 2014 900 6406325 All countries 2015 1000 7306368 All countries 2018 1200 8206334 .. .. .. .. All countries null 6124 41261346
Here's one that gives us the expected results. It uses 6 sqls for year-wise totals for each of the 6 countries (other than USA), 6 sqls for country-wise totals for all years, one sql for year-wise totals for all countries, and finally one sql for the grand total. These 14 sqls are then combined with 13 UNION
operations as shown below.
1-- country, year totals2select c.ctry_name, i.year_nbr,3 sum(i.item_cnt) as tot_cnt,4 sum(i.invoice_amt) as tot_amt5 from country c6 inner join invoice i on (i.ctry_code = c.ctry_code)7 where c.ctry_name = 'Mexico'8 group by c.ctry_name, i.year_nbr -- grouping by country and year9UNION10....11.... -- unions for each of the 6 countries excluding USA12UNION13-- country totals14select c.ctry_name, null as year_nbr,15 sum(i.item_cnt) as tot_cnt,16 sum(i.invoice_amt) as tot_amt17 from country c18 inner join invoice i on (i.ctry_code = c.ctry_code)19 where c.ctry_name = 'Mexico'20 group by c.ctry_name -- grouping by just country21UNION22....23.... -- unions for each of the 6 countries excluding USA24UNION25-- totals for all countries by year26select 'All countries' as ctry_name, i.year_nbr,27 sum(i.item_cnt) as tot_cnt,28 sum(i.invoice_amt) as tot_amt29 from country c30 inner join invoice i on (i.ctry_code = c.ctry_code)31 group by i.year_nbr -- grouping by just year for all countries32UNION33-- totals for all countries and all years34select 'All countries' as ctry_name, null as year_nbr,35 sum(i.item_cnt) as tot_cnt,36 sum(i.invoice_amt) as tot_amt37 from country c38 inner join invoice i on (i.ctry_code = c.ctry_code)
This query returns the expected results, but apart from being difficult to maintain, it is also slow and inefficient. It fetches data from the invoice table 14 times to perform 14 aggregations followed by an expensive set of UNION
operations.
We can write a simpler and more efficient query with a better understanding of the capabilities of group by
and having
. Just as a where
clause allows us to filter raw data, having
lets us filter the resultset from a group by
operation. The following sql uses group by cube()
to generate aggregates for all combinations of country and year while filtering out the summary rows for USA with having
.
1select coalesce(c.ctry_name,'All countries') as ctry_name,2 i.year_nbr,3 sum(i.item_cnt) as tot_cnt,4 sum(i.invoice_amt) as tot_amt5 from country c6 inner join invoice i on (i.ctry_code = c.ctry_code)7 group by cube(c.ctry_name, i.year_nbr)8having coalesce(c.ctry_name,'x') != 'USA' -- exclude USA summary rows
What is this query doing? It uses cube
with group by
to get-
- one row for each country, year combination
- one row for each country (including data for all years)
- one row for each year (including data for all countries)
- one row for all years, all countries
It then excludes the rows where the ctry_name is USA using the having
clause. This query fetches data from the invoice table only once and is significantly faster then the initial sql.
Note that country-wise summary rows will have a null
year. Similarly, year-wise summary rows will have a null
country name, which we convert to All countries
using the coalesce
function. We also use coalesce
in the having
clause so we don't drop summary rows where the country name is null
.
In this specific example, we needed to exclude data for USA in only two of the four aggregate groups. If the requirement was to exclude data for USA in all the summary rows, we would use the where
clause to filter USA data before any aggregation occurs, rather than doing it in the having
clause. The following query meets these requirements.
1select coalesce(c.ctry_name,'All countries') as ctry_name,2 i.year_nbr,3 sum(i.item_cnt) as tot_cnt,4 sum(i.invoice_amt) as tot_amt5 from country c6 inner join invoice i on (i.ctry_code = c.ctry_code)7 where c.ctry_name != 'USA' -- exclude USA in all summaries8 group by cube(c.ctry_name, i.year_nbr)
group by
supports extensions like cube
, rollup
and others that are extremely useful and are worth spending the time to understand better. We will take a look at group by
in more detail soon.
The actual sql this example is based on had 24 small queries that were combined with union
and had more than two tables joined in each query. The rewritten sql was just 10 lines long compared to over 200 lines for the original code and ran in a few seconds compared to over half an hour for the original query.