update docs
This commit is contained in:
parent
45fe5256b3
commit
aa1314eecc
18
README.md
18
README.md
@ -9,20 +9,16 @@ At PPHE, we use Dataform for all basic SQL transformations of raw data sources.
|
|||||||
* board - for custom Board reports
|
* board - for custom Board reports
|
||||||
* looker - for custom Looker (Studio) reports
|
* looker - for custom Looker (Studio) reports
|
||||||
* warehouse - for business-critical tables that combine data from many sources
|
* warehouse - for business-critical tables that combine data from many sources
|
||||||
* sources - for declaration and transformation of raw data to cleaned staging tables
|
* raw - for declarations of raw data as they exist in their source systems
|
||||||
* e.g, opera - for data that originates from Opera
|
* staging - for essential transformations of source tables, including renaming and recasting of fields, unioning of raw tables that come from different versions of the source system, and capturing deleted records
|
||||||
* Within each source folder, the following structure applies:
|
|
||||||
* raw - for raw source table declaration statements only
|
|
||||||
* staging - for essential transformations of source tables, including renaming and recasting of fields, unioning of raw tables that come from different versions of the source system, and capturing deleted or records
|
|
||||||
|
|
||||||
The gold layer should be thought of as the cleanest final layer where any employee of the business may find useful data, and the staging layer should be thought of as a sandbox where analysts and engineers can begin building new analyses.
|
The gold layer should be thought of as the cleanest final layer where any employee of the business may find useful data, and the staging layer should be thought of as a sandbox where analysts and engineers can begin building new analyses.
|
||||||
|
|
||||||
Any time a new data source is (E)xtracted and (L)oaded to our GCP environment, it should flow through Dataform in the following order:
|
Any time a new data source is (E)xtracted and (L)oaded to our GCP environment, it should flow through Dataform in the following order:
|
||||||
|
|
||||||
1. As a new source directory at definitions/sources
|
1. As new table declarations for all tables at definitions/raw/{source_name}/
|
||||||
2. As new table declarations for all tables at definitions/sources/{source_name}/raw
|
2. As staging transformations, with assertions to validate data quality, for all tables at definitions/staging/{source_name}/
|
||||||
3. As staging transformations, with assertions to validate data quality, for all tables at definitions/sources/{source_name}/staging
|
3. As new tables and/or fields incorporated into the different gold layer destinations at definitions/gold
|
||||||
4. As new tables and/or fields incorporated into the different gold layer destinations at definitions/gold
|
|
||||||
|
|
||||||
## Dev Environments
|
## Dev Environments
|
||||||
|
|
||||||
@ -34,8 +30,8 @@ Each Dataform has their own development environment to prevent collisions while
|
|||||||
|
|
||||||
* GCP Project: pphe-data-dev
|
* GCP Project: pphe-data-dev
|
||||||
* BQ Datasets:
|
* BQ Datasets:
|
||||||
* Sources: dev_{username}_src_{source_name}_{stg}
|
* Staging Tables: dev_{username}_stg_{source_name}
|
||||||
* e.g, dev_bwyllie_src_opera_stg
|
* e.g, dev_bwyllie_stg_opera
|
||||||
* Gold Tables: dev_{username}_gld_{destination_name}
|
* Gold Tables: dev_{username}_gld_{destination_name}
|
||||||
* e.g, dev_bwyllie_gld_board
|
* e.g, dev_bwyllie_gld_board
|
||||||
|
|
||||||
|
|||||||
9
definitions/gold/README.md
Normal file
9
definitions/gold/README.md
Normal file
@ -0,0 +1,9 @@
|
|||||||
|
# Gold Layer
|
||||||
|
|
||||||
|
The gold layer consists of finalized datasets that are fit for consumption. Our gold layer is three-pronged:
|
||||||
|
|
||||||
|
1. Board - for custom tables designed to be ingested by Board reports
|
||||||
|
2. Looker - for custom tables designed to be ingested by Looker reports
|
||||||
|
3. Warehouse - for a well-structured data warehouse, with tables that may be easily used for further analysis (and in Board and Looker)
|
||||||
|
|
||||||
|
The gold layer has the greatest entry requirements of all the layers. Dataform models within "gold" should combine data from multiple source systems and output the data in a clean and intuitive structure. Simplicity is emphasized more than completeness.
|
||||||
5
definitions/raw/README.md
Normal file
5
definitions/raw/README.md
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
# Raw Layer
|
||||||
|
|
||||||
|
The raw layer consists of raw data as it exists in its underlying source system. Tables should be organized in datasets based on the source system name and version whenever multiple versions of the same source exist (e.g, Opera Cloud vs Opera 5.6).
|
||||||
|
|
||||||
|
Raw data should be as true to the source as possible before landing in BigQuery. In Dataform, the raw layer should consist only of source declarations.
|
||||||
5
definitions/staging/README.md
Normal file
5
definitions/staging/README.md
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
# Staging Layer
|
||||||
|
|
||||||
|
The staging layer consists of source-level data that has been transformed just enough to make it easy to work with in the gold layer. Transformations should be limited to renaming and recasting of fields, simple joins to identify deleted records, and unioning of data from different source system versions.
|
||||||
|
|
||||||
|
Staging data should represent cleaned objects from raw without introducing complex business logic or dimensions from other sources.
|
||||||
Loading…
Reference in New Issue
Block a user