From 0cd10c2c575f56f6e8e6c89267a9df64478cff81 Mon Sep 17 00:00:00 2001 From: Brady Wyllie Date: Tue, 18 Nov 2025 16:45:28 +0100 Subject: [PATCH] update readme --- README.md | 44 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4caa112..4b40258 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,45 @@ # test-pphe-data-pro -Testing pphe-data-pro monorepo \ No newline at end of file +Testing pphe-data-pro monorepo + +## Proposed monorepo structure + +The proposed structure is built with individual data pipelines in mind. + +* `pphe-data-pro` (repo root) + * `data-extract-load/` + * `pipeline-{source-name}/` + * `common/` + * `data-transform/` + * `dataform-project-{project-name}/` + * `infra/` + * `terraform/` + * `ci-cd/` + * `README.md` + * `.gitignore` + +### EL Pipelines + +The main idea is that each data pipeline should be fully independent from the others, except when shared (common) resources are used. Any GCP admin should be able to reconstruct one of our data pipelines end-to-end using Terraform alone. Consider the following EL pipeline directory structure for a single source system: + +* `pipeline-{source-name}/` (pipeline dir root) + * `terraform/` (all infra for the pipeline gets built here) + * `main.tf` + * `variables.tf` + * `outputs.tf` + * `cloud-run-service-{service-name}/` (if needed, a cloud run service goes here) + * `src/` + * `Dockerfile` + * `requirements.txt` + * `cloud-run-function-{function-name}/` (if needed, a cloud run function goes here) + * `main.py` + * `requirements.txt` + * `README.md` (pipeline documentation goes here) + +### Transform Pipelines + +Transformation pipelines combine data from multiple data sources and therefore should be pooled within a single directory. This directory houses a dataform project. + +### Shared Infrastructure + +The `infra` directory houses all shared infrastructure, including Terraform scripts as well as CI/CD pipelines. \ No newline at end of file