11 changed files with 186 additions and 46 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1 @@
 node_modules/
--- a/README.md
+++ b/README.md
@ -1,45 +1,53 @@
-# test-pphe-data-pro
+# PPHE Dataform
-
+
-Testing pphe-data-pro monorepo
+Dataform is a tool for managing SQL queries used in the (T)ransformation stage of an ETL/ELT pipeline. Dataform compiles basic SQL queries, allowing authors to focus on business logic instead of DML and DDL syntax.
-
+
-## Proposed monorepo structure
+At PPHE, we use Dataform for all basic SQL transformations of raw data sources. Our Dataform repository is structured as follows:
-
+
-The proposed structure is built with individual data pipelines in mind.
+* definitions - main-level directory for all .SQLX (Dataform SQL) files
-
+  * gold - for transformations that write to gld_ datasets in BigQuery
-* `pphe-data-pro` (repo root)
+    * board - for custom Board reports
-  * `data-extract-load/`
+    * looker - for custom Looker (Studio) reports
-    * `pipeline-{source-name}/`
+    * warehouse - for business-critical tables that combine data from many sources
-    * `common/`
+  * raw - for declarations of raw data as they exist in their source systems
-  * `data-transform/`
+  * staging - for essential transformations of source tables, including renaming and recasting of fields, unioning of raw tables that come from different versions of the source system, and capturing deleted records
-    * `dataform-project-{project-name}/`
+
-  * `infra/`
+The gold layer should be thought of as the cleanest final layer where any employee of the business may find useful data, and the staging layer should be thought of as a sandbox where analysts and engineers can begin building new analyses.
-    * `terraform/`
+
-    * `ci-cd/`
+Any time a new data source is (E)xtracted and (L)oaded to our GCP environment, it should flow through Dataform in the following order:
-  * `README.md`
+
-  * `.gitignore`
+1. As new table declarations for all tables at definitions/raw/{source_name}/
-
+2. As staging transformations, with assertions to validate data quality, for all tables at definitions/staging/{source_name}/
-### EL Pipelines
+3. As new tables and/or fields incorporated into the different gold layer destinations at definitions/gold
-
+
-The main idea is that each data pipeline should be fully independent from the others, except when shared (common) resources are used. Any GCP admin should be able to reconstruct one of our data pipelines end-to-end using Terraform alone. Consider the following EL pipeline directory structure for a single source system:
+## Dev Environments
-
+
-* `pipeline-{source-name}/` (pipeline dir root)
+PPHE's data models are separated into two environments, production and development. *Production* includes finalized transformations that are safe to rely on for analysis and reporting. *Development* includes transformations that have not been completely validated yet and may still be in testing.
-  * `terraform/` (all infra for the pipeline gets built here)
+
-    * `main.tf`
+To keep these environments isolated, all new Dataform transformations automatically write to Development. Only after undergoing code review and the CI/CD process do transformations get promoted to Production.
-    * `variables.tf`
+
-    * `outputs.tf`
+Each Dataform has their own development environment to prevent collisions while working on new queries. Dev environments write to BigQuery in the following manner:
-  * `cloud-run-service-{service-name}/` (if needed, a cloud run service goes here)
+
-    * `src/`
+* GCP Project: pphe-data-dev
-    * `Dockerfile`
+  * BQ Datasets:
-    * `requirements.txt`
+    * Staging Tables: stg_{source_name}_{username}
-  * `cloud-run-function-{function-name}/` (if needed, a cloud run function goes here)
+      * e.g, stg_opera_bwyllie
-    * `main.py`
+    * Gold Tables: dev_{username}_gld_{destination_name}
-    * `requirements.txt`
+      * e.g, gld_board_{username}
-  * `README.md` (pipeline documentation goes here)
+
-
+These transformations also have a built-in `WHERE` clause to select just a small amount of data from the raw source tables.
-### Transform Pipelines
+
-
+Once the transformations have been successfully reviewed and promoted to Prod, they write to BigQuery as so:
-Transformation pipelines combine data from multiple data sources and therefore should be pooled within a single directory. This directory houses a dataform project.
+
-
+* GCP Project: pphe-data-pro
-### Shared Infrastructure
+  * BQ Datasets:
-
+    * Sources: stg_{source_name}
-The `infra` directory houses all shared infrastructure, including Terraform scripts as well as CI/CD pipelines.
+      * e.g, stg_opera
    * Gold Tables: gld_{destination_name}
      * e.g, gld_board
 New releases are deployed to Production on a weekly cadence to prevent excessive refreshing of large tables.
 # todo:
 * get git repo figured out
 * create production release with compilation override variables: executionSetting = "pro"
--- a/data-transform/.gitkeep
+++ b/data-transform/.gitkeep
--- a/data-transform/sdADS.txt
+++ b/data-transform/sdADS.txt
@ -1 +0,0 @@
 sdADS
--- a/definitions/gold/README.md
+++ b/definitions/gold/README.md
@ -0,0 +1,9 @@
 # Gold Layer
 The gold layer consists of finalized datasets that are fit for consumption. Our gold layer is three-pronged:
 1. Board - for custom tables designed to be ingested by Board reports
 2. Looker - for custom tables designed to be ingested by Looker reports
 3. Warehouse - for a well-structured data warehouse, with tables that may be easily used for further analysis (and in Board and Looker)
 The gold layer has the greatest entry requirements of all the layers. Dataform models within "gold" should combine data from multiple source systems and output the data in a clean and intuitive structure. Simplicity is emphasized more than completeness.
--- a/definitions/raw/README.md
+++ b/definitions/raw/README.md
@ -0,0 +1,5 @@
 # Raw Layer
 The raw layer consists of raw data as it exists in its underlying source system. Tables should be organized in datasets based on the source system name and version whenever multiple versions of the same source exist (e.g, Opera Cloud vs Opera 5.6).
 Raw data should be as true to the source as possible before landing in BigQuery. In Dataform, the raw layer should consist only of source declarations.
--- a/definitions/staging/README.md
+++ b/definitions/staging/README.md
@ -0,0 +1,5 @@
 # Staging Layer
 The staging layer consists of source-level data that has been transformed just enough to make it easy to work with in the gold layer. Transformations should be limited to renaming and recasting of fields, simple joins to identify deleted records, and unioning of data from different source system versions.
 Staging data should represent cleaned objects from raw without introducing complex business logic or dimensions from other sources.
--- a/definitions/staging/board/test_model.sqlx
+++ b/definitions/staging/board/test_model.sqlx
@ -0,0 +1,9 @@
 config {
    type: "view",
    description: "Test model",
    schema: "stg_board"
 }
 SELECT
  2 AS testfield
  --${when(dataform.projectConfig.vars.executionSetting === "dev", "LIMIT 0")}
--- a/includes/docs.js
+++ b/includes/docs.js
@ -0,0 +1,93 @@
 // Universal fields
 const tenant_code = `The tenant/chain that the record belongs to`;
 const property_code = `The property that the record belongs to`;
 const export_insert_time = `Date and time the raw data record was inserted to BigQuery from Opera R&A`;
 const staging_insert_time = `Date and time the staging data record was inserted from the raw data table`;
 // Identifiers
 const reservation_id = `Within a given property and tenant, identifier for the individual reservation`;
 const reservation_product_id = `Within a given property and tenant, identifier for the individual reservation product`
 const financial_transaction_id = `Within a given property and tenant, identifier for the individual transaction`;
 const group_id = `Within a given property and tenant, identifier for the individual business group`;
 const event_id = `Within a given property and tenant, identifier for the individual event`;
 const profile_id = `Within a given property and tenant, identifier for the individual profile`;
 const market_segment_code = `Market code`;
 const group_profile_id = `Profile ID of the group`;
 const travel_agent_profile_id = `Profile ID of the travel agent`;
 const company_profile_id = `Profile ID of the company`;
 const guest_profile_id = `Profile ID of the guest`;
 const guest_country_code = `Country code of the guest`;
 const booking_status_code = `Booking status`;
 const booking_source_code = `Booking source`;
 const block_code = `Block code`;
 const rate_code = `Rate code`;
 const transaction_code = `Transaction code`;
 const reservation_status_code = `Reservation status`;
 const room_category_code = `Room category`;
 const booked_room_category_code = `Booked room category`;
 const room_class_code = `Room class`;
 const room_type_code = `Room type`;
 const confirmation_number = `Confirmation number of the reservation`;
 // Dimensions
 const guest_country_name = 'Country name of the guest';
 const product_name = `Product/package code`;
 const product_description = `Full description of the product/package`;
 const group_description = `Full description/name of the group/block`;
 // Dates and times
 const considered_date = `Business Date that the data corresponds to`;
 // Booleans
 const is_meeting_room_flag = `Indicates whether the room is a meeting room`;
 const is_pseudo_room_flag = `Indicates whether the room is a pseudo room`;
 // Stats and metrics
 const number_of_rooms = `Number of rooms`;
 const room_nights = 'Total number of nights (across all rooms) for the reservation';
 const adults = `Number of adults`;
 const children = `Number of children`;
 const room_revenue = `Total net room revenue amount`;
 const food_revenue = `Total net food and beverage revenue amount`;
 const total_revenue = `Total net revenue amount`;
 const other_revenue = `Total of net revenue amount that does not fall under room, food, or beverage categories`;
 module.exports = {
    tenant_code,
    property_code,
    export_insert_time,
    staging_insert_time,
    reservation_id,
    reservation_product_id,
    financial_transaction_id,
    group_id,
    event_id,
    profile_id,
    market_segment_code,
    group_profile_id,
    travel_agent_profile_id,
    company_profile_id,
    guest_profile_id,
    guest_country_code,
    booking_status_code,
    booking_source_code,
    block_code,
    rate_code,
    transaction_code,
    reservation_status_code,
    room_category_code,
    booked_room_category_code,
    room_class_code,
    room_type_code,
    confirmation_number,
    guest_country_name,
    product_name,
    product_description,
    group_description,
    considered_date,
    is_meeting_room_flag,
    is_pseudo_room_flag,
    number_of_rooms,
    room_nights,
    adults,
    children,
    room_revenue,
    food_revenue,
    total_revenue,
    other_revenue
 }
--- a/includes/vars.js
+++ b/includes/vars.js
@ -0,0 +1,6 @@
 // Used to grab additional data in case of ingestion failure
 const ingestion_buffer_days = 3;
 module.exports = {
    ingestion_buffer_days
 }
--- a/workflow_settings.yaml
+++ b/workflow_settings.yaml
@ -0,0 +1,5 @@
 defaultProject: pphe-data-dev
 defaultLocation: EU
 defaultDataset: raw_dataform
 defaultAssertionDataset: raw_dataform
 dataformCoreVersion: 3.0.26