- Part 1: What is Data Modeling?!
- Part 2: Dimensional Modeling Fundamentals
- Part 3: All we need is a Data Lakehouse?!
In the previous part of this series, you’ve learned about the lakehouse concept, as one of the emerging trends in the data world nowadays. But, this is just a concept for storing the data – similar to having a data warehouse as a concept for storing the data, and then various data modeling implementation options, such as relational or dimensional data warehouse, the same applies to data lakehouse – it’s just a concept. Therefore, we need to examine how to design the data model for the lakehouse architecture.
The most common pattern for modeling the data in the lakehouse is called a medallion. I love this name – it’s really easy to remember. But, why medallion? Tag along and you’ll soon find out why.
The same as for the lakehouse concept, credits for being pioneers in the medallion approach goes to Databricks.
Simply said, medallion architecture assumes that your data within the lakehouse will be organized in three different layers: bronze, silver, and gold. That’s why it’s called a medallion! Now, you may also hear terms such as: Raw, Validated, Enriched, which I personally prefer. Or, Raw, Validated, Curated…But, essentially, the idea is the same – to have different layers of data in the lakehouse, that are of different quality and serve different purposes.
Let’s quickly examine all three layers…
A bronze layer is where we land the data from external sources in its original, raw state. Data is ingested “as-is”, containing only metadata in addition. The purpose of the bronze layer is to serve as a repository of the historical archive of source data, and enable quick data reprocessing when necessary, without the need to connect to myriad external source systems again. It’s important to keep in mind that the bronze layer contains unvalidated data. In regards to storage format, the bronze layer usually stores the data in one of the efficient columnar formats we examined in the articles – parquet or delta format.
Bronze Layer Checklist
- Land data from external sources in its original state
- Serve as a repository of the historical archive of source data
- Contains unvalidated data
- Stores the data in Parquet/Delta format
In the silver layer, the data from the bronze layer is conformed and cleaned, so that all the key business entities, concepts, and transactions are available in the form of an “enterprise view” for ad-hoc analysis, machine learning workloads, etc. The data is enriched and validated, and it can be trusted downstream for further analytic workloads. From a data modeling perspective, the silver layer contains more 3-rd-normal-form-like tables. In the silver layer, data is again stored in Delta or Parquet formats.
Silver Layer Checklist
- Conformed and cleaned data from the bronze layer
- Ad-hoc analysis, machine learning workloads
- Contains enriched and validated data
- The data model normalized to a 3rd normal form
- Stores the data in Delta/Parquet format
Finally, the gold layer represents the “icing on the cake” – data is structured and organized to support specific project requirements. As this is the final stage in the process, data is additionally refined and cleaned. In the gold layer, you’d also apply various complex business rules and logic, use-case-specific calculations, and so on. From the data modeling perspective, the gold layer is usually implemented through a Kimball-style star schema, where the data is denormalized to support business reporting requirements. In terms of storage, similar to the previous two layers, data is stored in an efficient format, preferably Delta, or alternatively Parquet.
Gold Layer Checklist
- Structured and organized data for specific project requirements
- Data additionally cleaned and refined
- Complex business logic and specific calculations
- A data model is a Kimball-style star schema
- Stores the data preferably in Delta, alternatively in Parquet format
To conclude, if you are planning to implement a data lakehouse architecture, you should leverage a medallion data design pattern to logically organize the data and enable incremental and continuous improvement of the data quality.
Thanks for reading!
Last Updated on October 27, 2023 by Nikola