update docs

This commit is contained in:
Brady Wyllie 2025-08-19 13:51:34 +00:00 committed by GCP Dataform
parent 45fe5256b3
commit aa1314eecc
4 changed files with 26 additions and 11 deletions

View File

@ -9,20 +9,16 @@ At PPHE, we use Dataform for all basic SQL transformations of raw data sources.
* board - for custom Board reports * board - for custom Board reports
* looker - for custom Looker (Studio) reports * looker - for custom Looker (Studio) reports
* warehouse - for business-critical tables that combine data from many sources * warehouse - for business-critical tables that combine data from many sources
* sources - for declaration and transformation of raw data to cleaned staging tables * raw - for declarations of raw data as they exist in their source systems
* e.g, opera - for data that originates from Opera * staging - for essential transformations of source tables, including renaming and recasting of fields, unioning of raw tables that come from different versions of the source system, and capturing deleted records
* Within each source folder, the following structure applies:
* raw - for raw source table declaration statements only
* staging - for essential transformations of source tables, including renaming and recasting of fields, unioning of raw tables that come from different versions of the source system, and capturing deleted or records
The gold layer should be thought of as the cleanest final layer where any employee of the business may find useful data, and the staging layer should be thought of as a sandbox where analysts and engineers can begin building new analyses. The gold layer should be thought of as the cleanest final layer where any employee of the business may find useful data, and the staging layer should be thought of as a sandbox where analysts and engineers can begin building new analyses.
Any time a new data source is (E)xtracted and (L)oaded to our GCP environment, it should flow through Dataform in the following order: Any time a new data source is (E)xtracted and (L)oaded to our GCP environment, it should flow through Dataform in the following order:
1. As a new source directory at definitions/sources 1. As new table declarations for all tables at definitions/raw/{source_name}/
2. As new table declarations for all tables at definitions/sources/{source_name}/raw 2. As staging transformations, with assertions to validate data quality, for all tables at definitions/staging/{source_name}/
3. As staging transformations, with assertions to validate data quality, for all tables at definitions/sources/{source_name}/staging 3. As new tables and/or fields incorporated into the different gold layer destinations at definitions/gold
4. As new tables and/or fields incorporated into the different gold layer destinations at definitions/gold
## Dev Environments ## Dev Environments
@ -34,8 +30,8 @@ Each Dataform has their own development environment to prevent collisions while
* GCP Project: pphe-data-dev * GCP Project: pphe-data-dev
* BQ Datasets: * BQ Datasets:
* Sources: dev_{username}_src_{source_name}_{stg} * Staging Tables: dev_{username}_stg_{source_name}
* e.g, dev_bwyllie_src_opera_stg * e.g, dev_bwyllie_stg_opera
* Gold Tables: dev_{username}_gld_{destination_name} * Gold Tables: dev_{username}_gld_{destination_name}
* e.g, dev_bwyllie_gld_board * e.g, dev_bwyllie_gld_board

View File

@ -0,0 +1,9 @@
# Gold Layer
The gold layer consists of finalized datasets that are fit for consumption. Our gold layer is three-pronged:
1. Board - for custom tables designed to be ingested by Board reports
2. Looker - for custom tables designed to be ingested by Looker reports
3. Warehouse - for a well-structured data warehouse, with tables that may be easily used for further analysis (and in Board and Looker)
The gold layer has the greatest entry requirements of all the layers. Dataform models within "gold" should combine data from multiple source systems and output the data in a clean and intuitive structure. Simplicity is emphasized more than completeness.

View File

@ -0,0 +1,5 @@
# Raw Layer
The raw layer consists of raw data as it exists in its underlying source system. Tables should be organized in datasets based on the source system name and version whenever multiple versions of the same source exist (e.g, Opera Cloud vs Opera 5.6).
Raw data should be as true to the source as possible before landing in BigQuery. In Dataform, the raw layer should consist only of source declarations.

View File

@ -0,0 +1,5 @@
# Staging Layer
The staging layer consists of source-level data that has been transformed just enough to make it easy to work with in the gold layer. Transformations should be limited to renaming and recasting of fields, simple joins to identify deleted records, and unioning of data from different source system versions.
Staging data should represent cleaned objects from raw without introducing complex business logic or dimensions from other sources.