Wildcard tables in BigQuery
— 3 min read
Imagine you have multiple tables that have similar structure and contain data for different time periods (or different locations). You want to combine the data from all the tables for some historical analysis. Google BigQuery has a powerful feature that allows you to query all these tables in one single SQL statement. Rather than using union all to concatenate data from each of the tables, you can use a wildcard in the table name to automatically query all tables that match the wildcard expression. In this post, we look at some basic examples of using BigQuery wildcard tables.
The NYC TLC Trips public dataset hosted by BigQuery contains several years of New York City taxi trip records collected by the NYC Taxi and Limousine Commission (TLC). As shown below, Yellow taxi trip data is stored in one table for each year, from 2011 to 2023.
1select table_name2from `bigquery-public-data.new_york_taxi_trips.INFORMATION_SCHEMA.TABLES`3where table_name like 'tlc_yellow%'4order by table_name5;
table_name tlc_yellow_trips_2011 tlc_yellow_trips_2012 tlc_yellow_trips_2013 tlc_yellow_trips_2014 tlc_yellow_trips_2015 tlc_yellow_trips_2016 tlc_yellow_trips_2017 tlc_yellow_trips_2018 tlc_yellow_trips_2019 tlc_yellow_trips_2020 tlc_yellow_trips_2021 tlc_yellow_trips_2022 tlc_yellow_trips_2023
To use the trip data for 2021, 2022 and 2023, we can use the wildcard table expression `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_202*` (note that the expression is enclosed in backticks).
1select concat('202',_table_suffix) as year, count(*) as total_trips2from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_202*`3group by 14;
One advantage of using the wildcard table expression is the ability to use a query without any changes when new tables are created with data for future periods. When 2024 data is added as a new table, as long as the table name matches the pattern and the table structure matches existing tables, the above query will also process the new table.
The _table_suffix
pseudocolumn in the query takes on values that the wildcard (*) in the table name corresponds to. If we use tlc_yellow_trips_202* as the wildcard expression, _table_suffix
will be 1 for the 2021 table, 2 for the 2022 table and 3 for the 2023 table.
In our second example, suppose we want to combine 2011 and 2012 data. We need to fetch data from two tables corresponding to the two years. We could independently select the data and use union all to combine them as follows.
1select *, '2011' as year2from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2011`3union all4select *, '2012' as year5from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2012`6;
Or, we could use the wild card expression and use _table_suffix
to generate the year for data from each table as follows
1select *, concat('201',_table_suffix) as year2from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_201*`3where _table_suffix in ('1','2')4;
In the BiqQuery console, you can verify that both queries will process the same amount of data (around 76GB). If we were combining the data for all years from 2011 thru 2019, the query using wildcard tables feature is much simpler, compared to one where we would write out 9 different queries and use union all to combine the data.
When we use the _table_suffix
pseudocolumn in the where clause to filter tables to be selected for processing the query, only the tables matching the filter condition are scanned as long as a constant expression is used in the where clause condition. For example, in the previous query, only the 2011 and 2012 tables are scanned because the where clause uses a constant list of values to filter tables.
For our final example, the goal is to compare the 2011 and 2012 average of total_amount column for all combinations of the metrics vendor_id, passenger_count, payment_type, pickup_location_id and the month parth of pickup_datetime. For this we need a query that outputs all combinations of the given metrics that exist in either 2011 or 2012 data and show the average amount for 2011 and 2012 for each. We can contrast the usual method of separately querying the data from the two tables and then combining the results, to a simpler query that uses wildcard tables.
In the union all approach, we have to first fetch data from the 2011 table and combine it with data from the 2012 table in a CTE and then use it to create the grouped rows. The following query shows such an approach. Another approach could be to do the group by on the two tables separately, then combine them and do a pivot operation to display the 2011 and 2012 summary data on the same line for each combination of the metrics.
1with combined_data as (2 select vendor_id, passenger_count, payment_type, pickup_location_id,3 extract(month from pickup_datetime) AS month, total_amount, 2011 as year4 from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2011`5 where total_amount > 06 union all7 select vendor_id, passenger_count, payment_type, pickup_location_id,8 extract(month from pickup_datetime) AS month, total_amount, 2012 as year9 from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2012`10 where total_amount > 011)12select vendor_id, passenger_count, payment_type, pickup_location_id,13 month,14 avg(case when year = 2011 then total_amount end) as avg_2011,15 avg(case when year = 2012 then total_amount end) as avg_201216 from combined_data17 group by vendor_id, passenger_count, payment_type, pickup_location_id, month18;
Compare that to the simplicity of the following query where we use wildcard table names and a filter using the pseudocolumn _table_suffix
to fetch the data for both years in one select statement.
1select vendor_id, passenger_count, payment_type, pickup_location_id,2 extract(month from pickup_datetime) AS month,3 avg(case when _table_suffix = '1' then total_amount end) as avg_2011,4 avg(case when _table_suffix = '2' then total_amount end) as avg_20125 from `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_201*`6 where _table_suffix in ('1','2')7 and total_amount > 08 group by vendor_id, passenger_count, payment_type, pickup_location_id, month9;
Although there are advantages to using wildcard tables, we should also be aware of limitations when using them, some of which are
- Wildcare tables are available only in GoogleSQL and not when using legacy SQL
- Wildcard tables only support BigQuery storage (tables) and not external tables or views
- Queries against multiple tables using wildcard tables do not support cached results
Oracle and PostgreSQL do not directly support wildcard tables and we will need to use dynamic sql to achieve similar functionality.