update readme

This commit is contained in:
Brady Wyllie 2025-11-18 16:45:28 +01:00
parent cd5d8ed624
commit 0cd10c2c57

View File

@ -1,3 +1,45 @@
# test-pphe-data-pro
Testing pphe-data-pro monorepo
Testing pphe-data-pro monorepo
## Proposed monorepo structure
The proposed structure is built with individual data pipelines in mind.
* `pphe-data-pro` (repo root)
* `data-extract-load/`
* `pipeline-{source-name}/`
* `common/`
* `data-transform/`
* `dataform-project-{project-name}/`
* `infra/`
* `terraform/`
* `ci-cd/`
* `README.md`
* `.gitignore`
### EL Pipelines
The main idea is that each data pipeline should be fully independent from the others, except when shared (common) resources are used. Any GCP admin should be able to reconstruct one of our data pipelines end-to-end using Terraform alone. Consider the following EL pipeline directory structure for a single source system:
* `pipeline-{source-name}/` (pipeline dir root)
* `terraform/` (all infra for the pipeline gets built here)
* `main.tf`
* `variables.tf`
* `outputs.tf`
* `cloud-run-service-{service-name}/` (if needed, a cloud run service goes here)
* `src/`
* `Dockerfile`
* `requirements.txt`
* `cloud-run-function-{function-name}/` (if needed, a cloud run function goes here)
* `main.py`
* `requirements.txt`
* `README.md` (pipeline documentation goes here)
### Transform Pipelines
Transformation pipelines combine data from multiple data sources and therefore should be pooled within a single directory. This directory houses a dataform project.
### Shared Infrastructure
The `infra` directory houses all shared infrastructure, including Terraform scripts as well as CI/CD pipelines.