from nbdev.showdoc import *core
This module provides an ergonomic wrapper around pandas_gbq to simplify working with BigQuery in pandas. The main functions are:
read()- Load data from BigQuery queries or tables into DataFrames with automatic type conversionto()- Write DataFrames to BigQuery tablesex()- Execute queries without returning results (useful for DDL/DML)exists()- Check if a table existsensure()- Read a table if it exists, otherwise create it from a query first
All functions include verbose timing and size reporting by default.
Imports
Credentials helper - Get credentials from environment variable for CI/CD, or fall back to ADC
get_creds
get_creds ()
Get credentials and project_id from environment variable or ADC
Core read function - Wraps pandas_gbq.read_gbq with automatic type conversion and timing
read
read (query_or_table:str, verbose:bool=True, convert_dtypes:bool=True, date_cols:list=None, str_cols:list=['scv_id'], use_bqstorage_api:bool=True, credentials=None, project_id=None, **kwargs)
Load data from BigQuery query or table into DataFrame
| Type | Default | Details | |
|---|---|---|---|
| query_or_table | str | BigQuery SQL query or table reference | |
| verbose | bool | True | Print timing and size info |
| convert_dtypes | bool | True | Convert to pandas nullable types |
| date_cols | list | None | Additional columns to convert to datetime |
| str_cols | list | [‘scv_id’] | Columns to keep as string/object type |
| use_bqstorage_api | bool | True | Use Storage API for faster downloads |
| credentials | NoneType | None | Google credentials (defaults to ADC) |
| project_id | NoneType | None | GCP project ID (defaults to credential’s project) |
| kwargs | VAR_KEYWORD |
Type conversion helper - Converts BigQuery types to appropriate pandas nullable types
convert_bq_dtypes
convert_bq_dtypes (df:pandas.core.frame.DataFrame, date_cols:list=None, str_cols:list=None)
Convert BigQuery data types to pandas-compatible types
| Type | Default | Details | |
|---|---|---|---|
| df | DataFrame | DataFrame to convert | |
| date_cols | list | None | List of columns to convert to datetime |
| str_cols | list | None | List of columns to keep as string/object type |
Write function - Wraps pandas_gbq.to_gbq with timing info
to
to (df:pandas.core.frame.DataFrame, destination_table:str, verbose:bool=True, credentials=None, project_id=None, **kwargs)
Write DataFrame to BigQuery table
Memory helper - Calculate DataFrame size in GB
Execute function - Run queries without returning results (DDL/DML operations)
ex
ex (query:str, project_id:str=None, verbose:bool=True, credentials=None, **kwargs)
Execute query in BigQuery without returning results
parse_query
parse_query (q:str)
Parse query to count different operation types
| Type | Details | |
|---|---|---|
| q | str | Query string |
def create_top_terms(period, days):
return f"""
CREATE OR REPLACE TABLE `bq-sandbox-motdam.temporary.top_terms_{period}` AS
WITH ranked AS (
SELECT region_name, term, COUNT(*) as appearances, AVG(rank) as avg_rank,
ROW_NUMBER() OVER (PARTITION BY region_name ORDER BY COUNT(*) DESC, AVG(rank)) as rn
FROM `bigquery-public-data.google_trends.international_top_rising_terms`
WHERE country_name = 'United Kingdom'
AND region_name IN ('England', 'Scotland', 'Wales', 'Northern Ireland')
AND refresh_date BETWEEN CURRENT_DATE() - {days} AND CURRENT_DATE()
AND rank <= 100
GROUP BY region_name, term
)
SELECT region_name, term as top_term_{period}
FROM ranked WHERE rn = 1;
"""
periods = [('today', 1), ('week', 8), ('month', 31), ('year', 366)]
query_pipeline = ''.join([create_top_terms(period, days) for period, days in periods])
final_table = 'bq-sandbox-motdam.temporary.top_terms_summary'
query_pipeline += f"""
CREATE OR REPLACE TABLE `{final_table}` AS
SELECT t.region_name, t.top_term_today, w.top_term_week, m.top_term_month, y.top_term_year
FROM `bq-sandbox-motdam.temporary.top_terms_today` as t
JOIN `bq-sandbox-motdam.temporary.top_terms_week` as w ON t.region_name = w.region_name
JOIN `bq-sandbox-motdam.temporary.top_terms_month` as m ON t.region_name = m.region_name
JOIN `bq-sandbox-motdam.temporary.top_terms_year` as y ON t.region_name = y.region_name
ORDER BY t.region_name;
"""
parse_query(query_pipeline){'queries': 5, 'creates': 5, 'inserts': 0, 'deletes': 0, 'drops': 0}
Table helper class - Parse and represent BigQuery table references
Table
Table (project:str, dataset:str, name:str)
t = Table.from_id(f'bq-sandbox-motdam.temporary.top_terms_summary')
tTable(project='bq-sandbox-motdam', dataset='temporary', name='top_terms_summary')
t.with_(dataset='dev', project="dev")Table(project='dev', dataset='dev', name='top_terms_summary')
tTable(project='bq-sandbox-motdam', dataset='temporary', name='top_terms_summary')
Existence checker - Check if a BigQuery table exists
exists
exists (table_id:str, project_id:str=None, credentials=None, verbose:bool=True, **kwargs)
Check if a BigQuery table exists
Build read function - Read table if exists, otherwise execute query then read. Can be used to execute feature pipeline before executing query
build_read
build_read (table_id:str, query:str, project_id:str=None, force:bool=False, verbose:bool=True, credentials=None, **kwargs)
Read table if exists, otherwise execute query then read
Drop function - makes it easier to drop tables if needed
drop
drop (table_id:str, project_id:str=None, credentials=None, verbose:bool=True, **kwargs)
Drop a BigQuery table