Wildcard tables in BigQuery

2024.12.01 — 3 min read

Imagine you have multiple tables that have similar structure and contain data for different time periods (or different locations). You want to combine the data from all the tables for some historical analysis. Google BigQuery has a powerful feature that allows you to query all these tables in one single SQL statement. Rather than using union all to concatenate data from each of the tables, you can use a wildcard in the table name to automatically query all tables that match the wildcard expression. In this post, we look at some basic examples of using BigQuery wildcard tables.

The NYC TLC Trips public dataset hosted by BigQuery contains several years of New York City taxi trip records collected by the NYC Taxi and Limousine Commission (TLC). As shown below, Yellow taxi trip data is stored in one table for each year, from 2011 to 2023.

1select table_name
2from `bigquery-public-data.new_york_taxi_trips.INFORMATION_SCHEMA.TABLES`
3where table_name like 'tlc_yellow%'
4order by table_name
5;

table_name
tlc_yellow_trips_2011
tlc_yellow_trips_2012
tlc_yellow_trips_2013
tlc_yellow_trips_2014
tlc_yellow_trips_2015
tlc_yellow_trips_2016
tlc_yellow_trips_2017
tlc_yellow_trips_2018
tlc_yellow_trips_2019
tlc_yellow_trips_2020
tlc_yellow_trips_2021
tlc_yellow_trips_2022
tlc_yellow_trips_2023

table_name
tlc_yellow_trips_2011
tlc_yellow_trips_2012
tlc_yellow_trips_2013
tlc_yellow_trips_2014
tlc_yellow_trips_2015
tlc_yellow_trips_2016
tlc_yellow_trips_2017
tlc_yellow_trips_2018
tlc_yellow_trips_2019
tlc_yellow_trips_2020
tlc_yellow_trips_2021
tlc_yellow_trips_2022
tlc_yellow_trips_2023

To use the trip data for 2021, 2022 and 2023, we can use the wildcard table expression `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_202*` (note that the expression is enclosed in backticks).

1select concat('202',_table_suffix) as year, count(*) as total_trips
2from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_202*`
3group by 1
4;

One advantage of using the wildcard table expression is the ability to use a query without any changes when new tables are created with data for future periods. When 2024 data is added as a new table, as long as the table name matches the pattern and the table structure matches existing tables, the above query will also process the new table.

The _table_suffix pseudocolumn in the query takes on values that the wildcard (*) in the table name corresponds to. If we use tlc_yellow_trips_202* as the wildcard expression, _table_suffix will be 1 for the 2021 table, 2 for the 2022 table and 3 for the 2023 table.

In our second example, suppose we want to combine 2011 and 2012 data. We need to fetch data from two tables corresponding to the two years. We could independently select the data and use union all to combine them as follows.

1select *, '2011' as year
2from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2011`
3union all
4select *, '2012' as year
5from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2012`
6;

Or, we could use the wild card expression and use _table_suffix to generate the year for data from each table as follows

1select *, concat('201',_table_suffix) as year
2from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_201*`
3where _table_suffix in ('1','2')
4;

In the BiqQuery console, you can verify that both queries will process the same amount of data (around 76GB). If we were combining the data for all years from 2011 thru 2019, the query using wildcard tables feature is much simpler, compared to one where we would write out 9 different queries and use union all to combine the data.

When we use the _table_suffix pseudocolumn in the where clause to filter tables to be selected for processing the query, only the tables matching the filter condition are scanned as long as a constant expression is used in the where clause condition. For example, in the previous query, only the 2011 and 2012 tables are scanned because the where clause uses a constant list of values to filter tables.

For our final example, the goal is to compare the 2011 and 2012 average of total_amount column for all combinations of the metrics vendor_id, passenger_count, payment_type, pickup_location_id and the month parth of pickup_datetime. For this we need a query that outputs all combinations of the given metrics that exist in either 2011 or 2012 data and show the average amount for 2011 and 2012 for each. We can contrast the usual method of separately querying the data from the two tables and then combining the results, to a simpler query that uses wildcard tables.

In the union all approach, we have to first fetch data from the 2011 table and combine it with data from the 2012 table in a CTE and then use it to create the grouped rows. The following query shows such an approach. Another approach could be to do the group by on the two tables separately, then combine them and do a pivot operation to display the 2011 and 2012 summary data on the same line for each combination of the metrics.

1with combined_data as (
2  select vendor_id, passenger_count, payment_type, pickup_location_id,
3         extract(month from pickup_datetime) AS month, total_amount, 2011 as year
4    from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2011`
5   where total_amount > 0
6  union all
7  select vendor_id, passenger_count, payment_type, pickup_location_id,
8         extract(month from pickup_datetime) AS month, total_amount, 2012 as year
9    from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2012`
10   where total_amount > 0
11)
12select vendor_id, passenger_count, payment_type, pickup_location_id,
13       month,
14       avg(case when year = 2011 then total_amount end) as avg_2011,
15       avg(case when year = 2012 then total_amount end) as avg_2012
16  from combined_data
17 group by vendor_id, passenger_count, payment_type, pickup_location_id, month
18;

Compare that to the simplicity of the following query where we use wildcard table names and a filter using the pseudocolumn _table_suffix to fetch the data for both years in one select statement.

1select vendor_id, passenger_count, payment_type, pickup_location_id,
2       extract(month from pickup_datetime) AS month,
3       avg(case when _table_suffix = '1' then total_amount end) as avg_2011,
4       avg(case when _table_suffix = '2' then total_amount end) as avg_2012
5  from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_201*`
6 where _table_suffix in ('1','2')
7   and total_amount > 0
8 group by vendor_id, passenger_count, payment_type, pickup_location_id, month
9;

Although there are advantages to using wildcard tables, we should also be aware of limitations when using them, some of which are

Wildcard tables are available only in GoogleSQL and not when using legacy SQL
Wildcard tables only support BigQuery storage (tables) and not external tables or views
Queries against multiple tables using wildcard tables do not support cached results

Oracle and PostgreSQL do not directly support wildcard tables and we will need to use dynamic sql to achieve similar functionality.

Thanks for reading! Feel free to reach out if you have any interesting SQL or data projects that I can help with.