test-pphe-data-pro/README.md
2025-08-19 13:51:34 +00:00

3.1 KiB

PPHE Dataform

Dataform is a tool for managing SQL queries used in the (T)ransformation stage of an ETL/ELT pipeline. Dataform compiles basic SQL queries, allowing authors to focus on business logic instead of DML and DDL syntax.

At PPHE, we use Dataform for all basic SQL transformations of raw data sources. Our Dataform repository is structured as follows:

  • definitions - main-level directory for all .SQLX (Dataform SQL) files
    • gold - for transformations that write to gld_ datasets in BigQuery
      • board - for custom Board reports
      • looker - for custom Looker (Studio) reports
      • warehouse - for business-critical tables that combine data from many sources
    • raw - for declarations of raw data as they exist in their source systems
    • staging - for essential transformations of source tables, including renaming and recasting of fields, unioning of raw tables that come from different versions of the source system, and capturing deleted records

The gold layer should be thought of as the cleanest final layer where any employee of the business may find useful data, and the staging layer should be thought of as a sandbox where analysts and engineers can begin building new analyses.

Any time a new data source is (E)xtracted and (L)oaded to our GCP environment, it should flow through Dataform in the following order:

  1. As new table declarations for all tables at definitions/raw/{source_name}/
  2. As staging transformations, with assertions to validate data quality, for all tables at definitions/staging/{source_name}/
  3. As new tables and/or fields incorporated into the different gold layer destinations at definitions/gold

Dev Environments

PPHE's data models are separated into two environments, production and development. Production includes finalized transformations that are safe to rely on for analysis and reporting. Development includes transformations that have not been completely validated yet and may still be in testing.

To keep these environments isolated, all new Dataform transformations automatically write to Development. Only after undergoing code review and the CI/CD process do transformations get promoted to Production.

Each Dataform has their own development environment to prevent collisions while working on new queries. Dev environments write to BigQuery in the following manner:

  • GCP Project: pphe-data-dev
    • BQ Datasets:
      • Staging Tables: dev_{username}stg{source_name}
        • e.g, dev_bwyllie_stg_opera
      • Gold Tables: dev_{username}gld{destination_name}
        • e.g, dev_bwyllie_gld_board

These transformations also have a built-in WHERE clause to select just a small amount of data from the raw source tables.

Once the transformations have been successfully reviewed and promoted to Prod, they write to BigQuery as so:

  • GCP Project: pphe-data-pro
    • BQ Datasets:
      • Sources: src_{source_name}_stg
        • e.g, src_opera_stg
      • Gold Tables: gld_{destination_name}
        • e.g, gld_board

New releases are deployed to Production on a weekly cadence to prevent excessive refreshing of large tables.