core

Ergonomic wrapper for pandas_gbq that simplifies loading BigQuery data into DataFrames

This module provides an ergonomic wrapper around pandas_gbq to simplify working with BigQuery in pandas. The main functions are:

All functions include verbose timing and size reporting by default.

Imports

from nbdev.showdoc import *

Credentials helper - Get credentials from environment variable for CI/CD, or fall back to ADC


source

get_creds

 get_creds ()

Get credentials and project_id from environment variable or ADC

Core read function - Wraps pandas_gbq.read_gbq with automatic type conversion and timing


source

read

 read (query_or_table:str, verbose:bool=True, convert_dtypes:bool=True,
       date_cols:list=None, str_cols:list=['scv_id'],
       use_bqstorage_api:bool=True, credentials=None, project_id=None,
       **kwargs)

Load data from BigQuery query or table into DataFrame

Type Default Details
query_or_table str BigQuery SQL query or table reference
verbose bool True Print timing and size info
convert_dtypes bool True Convert to pandas nullable types
date_cols list None Additional columns to convert to datetime
str_cols list [‘scv_id’] Columns to keep as string/object type
use_bqstorage_api bool True Use Storage API for faster downloads
credentials NoneType None Google credentials (defaults to ADC)
project_id NoneType None GCP project ID (defaults to credential’s project)
kwargs VAR_KEYWORD

Type conversion helper - Converts BigQuery types to appropriate pandas nullable types


source

convert_bq_dtypes

 convert_bq_dtypes (df:pandas.core.frame.DataFrame, date_cols:list=None,
                    str_cols:list=None)

Convert BigQuery data types to pandas-compatible types

Type Default Details
df DataFrame DataFrame to convert
date_cols list None List of columns to convert to datetime
str_cols list None List of columns to keep as string/object type

Write function - Wraps pandas_gbq.to_gbq with timing info


source

to

 to (df:pandas.core.frame.DataFrame, destination_table:str,
     verbose:bool=True, credentials=None, project_id=None, **kwargs)

Write DataFrame to BigQuery table

Memory helper - Calculate DataFrame size in GB

Execute function - Run queries without returning results (DDL/DML operations)


source

ex

 ex (query:str, project_id:str=None, verbose:bool=True, credentials=None,
     **kwargs)

Execute query in BigQuery without returning results


source

parse_query

 parse_query (q:str)

Parse query to count different operation types

Type Details
q str Query string
def create_top_terms(period, days):
    return f"""
    CREATE OR REPLACE TABLE `bq-sandbox-motdam.temporary.top_terms_{period}` AS
    WITH ranked AS (
      SELECT region_name, term, COUNT(*) as appearances, AVG(rank) as avg_rank,
        ROW_NUMBER() OVER (PARTITION BY region_name ORDER BY COUNT(*) DESC, AVG(rank)) as rn
      FROM `bigquery-public-data.google_trends.international_top_rising_terms`
      WHERE country_name = 'United Kingdom'
        AND region_name IN ('England', 'Scotland', 'Wales', 'Northern Ireland')
        AND refresh_date BETWEEN CURRENT_DATE() - {days} AND CURRENT_DATE()
        AND rank <= 100
      GROUP BY region_name, term
    )
    SELECT region_name, term as top_term_{period}
    FROM ranked WHERE rn = 1;
    """

periods = [('today', 1), ('week', 8), ('month', 31), ('year', 366)]
query_pipeline = ''.join([create_top_terms(period, days) for period, days in periods])
final_table = 'bq-sandbox-motdam.temporary.top_terms_summary'
query_pipeline += f"""
CREATE OR REPLACE TABLE `{final_table}` AS
SELECT t.region_name, t.top_term_today, w.top_term_week, m.top_term_month, y.top_term_year
FROM `bq-sandbox-motdam.temporary.top_terms_today` as t
JOIN `bq-sandbox-motdam.temporary.top_terms_week` as w ON t.region_name = w.region_name
JOIN `bq-sandbox-motdam.temporary.top_terms_month` as m ON t.region_name = m.region_name
JOIN `bq-sandbox-motdam.temporary.top_terms_year` as y ON t.region_name = y.region_name
ORDER BY t.region_name;
"""
parse_query(query_pipeline)
{'queries': 5, 'creates': 5, 'inserts': 0, 'deletes': 0, 'drops': 0}

Table helper class - Parse and represent BigQuery table references


source

Table

 Table (project:str, dataset:str, name:str)
t = Table.from_id(f'bq-sandbox-motdam.temporary.top_terms_summary')
t
Table(project='bq-sandbox-motdam', dataset='temporary', name='top_terms_summary')
t.with_(dataset='dev', project="dev")
Table(project='dev', dataset='dev', name='top_terms_summary')
t
Table(project='bq-sandbox-motdam', dataset='temporary', name='top_terms_summary')

Existence checker - Check if a BigQuery table exists


source

exists

 exists (table_id:str, project_id:str=None, credentials=None,
         verbose:bool=True, **kwargs)

Check if a BigQuery table exists

Build read function - Read table if exists, otherwise execute query then read. Can be used to execute feature pipeline before executing query


source

build_read

 build_read (table_id:str, query:str, project_id:str=None,
             force:bool=False, verbose:bool=True, credentials=None,
             **kwargs)

Read table if exists, otherwise execute query then read

Drop function - makes it easier to drop tables if needed


source

drop

 drop (table_id:str, project_id:str=None, credentials=None,
       verbose:bool=True, **kwargs)

Drop a BigQuery table