In Redshift/Postgres, you can think of database schemas as folders that you use to organize database objects, such as tables and views. Spending some time up front determining how to organize your database objects into schemas will pay dividends later as the number of database objects and users in your database grows.
There are a few different approaches that are popular for organizing database schemas. Sometimes people use a single approach or they may combine several approaches together into a hybrid strategy.
Before we review these, it is important to note that database permissions can be assigned at a schema level as well as an object level. Organizing database objects into schemas such that you can assign permissions to your users at the schema level will make setting up and managing your database permissions much easier. Therefore, regardless of the strategy you choose, you should keep this in mind.
By Environment: Production vs Development
It is important to keep your production data separate from ongoing development or analysis of new data sets. This minimizes the risk of someone accidentally making changes to a production table that you may depend on for daily reporting or some other production process in your organization.
In order to do this, you should create one or more schemas to be used solely for development and one-time analysis. Any one-off tables should be created in these schemas.
Provisioning Options:
- 1 development schema for everyone
- 1 development schema for each team
- 1 development schema for each developer
Database Permissions: This approach allows you to limit the permissions that users have to production database schemas so that they don’t even have the access required to accidentally make changes to production tables.
By ETL Phase
Data is often loaded, transformed and processed in various stages until it is finally aggregated and consumed by reports and/or dashboards. Since each data set goes through similar processing, oftentimes with various teams or team members owning a stage, organizing your tables by ETL stage can be a useful approach.
Common Phases:
- Landing/Raw - used for loading raw data, typically as a generic string data type, prior to detecting data types or splitting raw data into multiple columns
- Staging - used for staging data, typically with the appropriate data types, prior to applying any transformation rules
- Final Raw - used for finalized raw data after applying any transformation rules
- Reporting - used for aggregate rollups of the finalized raw data, typically for reports and/or dashboards
Naming Conventions: It is common to apply specific naming conventions to each stage so that you can easily distinguish related tables across different stages/schemas. For example, imagine you are loading some donor data.
- Schema name: "landing" / "raw"
- Table suffix: "_raw"
- Table name: "donor_raw"
- Schema name: “staging”
- Table suffix: "_stg"
- Table name: "donor_stg"
- Schema name: "final"
- Table suffix: none
- Table name: "donor"
- Schema name: “reporting”
- Table suffix: "_rpt" / "_agg" / "_monthly"
- Table name: "donor_rpt"
Database Permissions: Since specific teams or team members often focus on specific stages, this makes it easier to grant permissions while limiting the risk of each team accidentally impacting data that they do not maintain. For example,
- data engineering team - responsible for loading and preparing data
- landing, staging, final schemas - read and write permission
- data analyst team - responsible for reporting
- final schema - read permission
- reporting schema - read and write permission
- reporting users - consume reports to make decisions
- reporting schema - read permission
By Functional Area
The data loaded into a database often falls into different functional areas or categories (e.g., HR, sales, CRM, etc.). The data in these categories tends to come from similar systems and is maintained and understood by similar teams or team members. This makes it another great way to organize your data into schemas. Since these schema names are more descriptive, it also helps improve data discoverability.
Hybrid Example
This uses a hybrid approach of all three of the above.
- staging
- If there are enough tables then you could break this out into hr_staging, sales_staging, etc.
- hr
- sales
- donors
- accounting
- reporting
- Similar to staging above you could break this out further depending on the amount of reporting
- <username> - development schemas
- Creating one per user, if possible, makes them easier to clean up as users transition between different teams since you don’t need to try and figure out who created which table.
Comments
0 comments
Please sign in to leave a comment.