Skip to content

This repo provides a step-by-step approach to building a modern data warehouse using PostgreSQL. It covers the ETL (Extract, Transform, Load) process, data modeling, exploratory data analysis (EDA), and advanced data analysis techniques.

License

Notifications You must be signed in to change notification settings

Rudra-G-23/SQL-Data-Warehouse-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ce86f57 Β· Mar 7, 2025

History

45 Commits
Mar 4, 2025
Mar 7, 2025
Mar 7, 2025
Mar 7, 2025
Mar 3, 2025
Feb 28, 2025
Mar 7, 2025
Mar 3, 2025

Repository files navigation

🌟 Modern Data Warehouse & Analytics End-to-End Project

πŸ‘‹ Hello, Data Points!

  • My name is Rudra Prasad Bhuyan ! πŸ˜„
  • I am a Kaggle expert and a Google Certified Data Analyst.

Welcome to this Modern Data Warehouse & Analytics End-to-End Project project using PostgreSQL! 🎯

πŸͺ„Modern Data Warehouse & Analytics End-to-End Project overview

i

This repository provides a step-by-step approach to building a scalable, efficient, and analytics-ready data warehouse. It covers:
βœ… ETL Pipelines (Extract, Transform, Load)
βœ… Data Modeling (Star Schema)
βœ… Exploratory Data Analysis (EDA)
βœ… SQL-based Reporting & Analytics
βœ… Advanced-Data Analytsis & Reporting
πŸ“ Project Notion Page


πŸ—οΈ Data Architecture Overview

The project follows the Medallion Architecture with three layers:

πŸ“Œ Bronze Layer (Raw Data) – Stores data directly from the source (CSV files).
πŸ“Œ Silver Layer (Cleansed & Transformed Data) – Data is cleaned, structured, and normalized.
πŸ“Œ Gold Layer (Business-Ready Data) – Optimized for analytics and reporting using a star schema.

🌐 Architecture Diagram:

Data_Architecture


πŸ“– Project Overview

πŸ” Key Features & Learnings:

πŸ”Ή SQL Development – Writing optimized SQL queries for analytics.
πŸ”Ή Data Engineering – Designing ETL pipelines for seamless data movement.
πŸ”Ή Data Architecture – Structuring a robust and scalable data warehouse.
πŸ”Ή ETL Pipeline Development – Extracting, transforming, and loading data efficiently.
πŸ”Ή Data Modeling – Implementing fact and dimension tables.
πŸ”Ή Data Analytics – Running advanced analytical queries for insights.

πŸ› οΈ Tech Stack:

  • Database: PostgreSQL
  • ETL Processing: SQL, Python (optional)
  • Data Visualization: Power BI / Tableau (optional)
  • Documentation & Diagramming: Draw.io, Notion

πŸ“‚ Repository Structure


data-warehouse-project/
β”œβ”€β”€ datasets/             # Raw data from ERP and CRM systems.
β”‚
β”œβ”€β”€ docs/                 # Project documentation, architecture diagrams, and outputs.
β”‚   β”œβ”€β”€ bronze/
β”‚   β”‚   β”œβ”€β”€ data_flow_bronze.drawio   # Data flow diagram: Source -> Bronze (Draw.io).
β”‚   β”‚   β”œβ”€β”€ bronze_data_schema.md # Schema of the bronze layer tables.
β”‚   β”‚   └── bronze_output_examples/ # Example of the data after the bronze layer processing.
β”‚   β”œβ”€β”€ silver/
β”‚   β”‚   β”œβ”€β”€ data_cleaning_output/   # Examples of data after cleaning.
β”‚   β”‚   β”œβ”€β”€ data_flow_silver.drawio   # Data flow diagram: Bronze -> Silver (Draw.io).
β”‚   β”‚   β”œβ”€β”€ Data_Integration.drawio   # Data integration diagram (Draw.io).
β”‚   β”‚   └── silver_data_schema.md # Schema of the silver layer tables.
β”‚   β”œβ”€β”€ gold/
β”‚   β”‚   β”œβ”€β”€ output/             # Examples of the data after the gold layer processing.
β”‚   β”‚   β”œβ”€β”€ data_catalog.md     # Data dictionary for the Gold layer, including field descriptions.
β”‚   β”‚   β”œβ”€β”€ data_flow_gold.drawio   # Data flow diagram: Silver -> Gold (Draw.io).
β”‚   β”‚   β”œβ”€β”€ data_models.drawio   # Star schema diagram (Draw.io).
β”‚   β”‚   └── gold_data_schema.md  # Schema of the gold layer tables.
β”‚   └── warehouse/
β”‚       β”œβ”€β”€ naming_conventions.md # Naming conventions for tables, columns, etc.
β”‚       β”œβ”€β”€ data_architecture.drawio # Overall data warehouse architecture diagram (Draw.io).
β”‚       └── etl.drawio         # ETL process diagram, showcasing techniques and methods (Draw.io).
β”‚
β”œβ”€β”€ scripts/              # SQL scripts for ETL and transformations.
β”‚   β”œβ”€β”€ bronze/
β”‚   β”‚   └── load_raw_data.sql # Scripts to load data from the 'datasets' directory into the bronze layer.
β”‚   β”œβ”€β”€ silver/
β”‚   β”‚   └── transform_clean_data.sql # Scripts to clean and transform the data in the bronze layer.
β”‚   └── gold/
β”‚       β”œβ”€β”€ create_analytical_views.sql # Scripts to create views for analysis in the gold layer.
β”‚       └── populate_dimensions.sql # Scripts to populate dimension tables.
β”‚   └── init_database.sql   # Script to create the database and schemas.
β”‚
β”œβ”€β”€ tests/                 # Test scripts and quality control files (e.g., data quality checks).
β”‚   └── data_quality_checks.sql # SQL scripts for data quality checks.
β”‚
β”œβ”€β”€ report/                # Analysis scripts and reports.
β”‚   β”œβ”€β”€ 1_gold_layer_datasets/   # Datasets used for reporting and analysis.
β”‚   β”œβ”€β”€ 2_eda_scripts/        # Exploratory Data Analysis (EDA) scripts.
β”‚   β”‚   └── basic_eda.ipynb # Jupyter notebook containing basic EDA.
β”‚   β”œβ”€β”€ 3_advanced_eda/       # Advanced EDA scripts and analyses.
β”‚   β”‚   └── advanced_eda.ipynb # Jupyter notebook containing advanced EDA.
β”‚   β”œβ”€β”€ output/             # Output from the analysis (e.g., charts, tables).
β”‚   β”œβ”€β”€ 12_report_customers.sql # SQL script for the customer report.
β”‚   └── 13_report_products.sql # SQL script for the product report.
β”‚
β”œβ”€β”€ README.md              # Project overview, instructions, and report summaries.
β”œβ”€β”€ LICENSE                # License information.
└── requirements.txt        # Project dependencies (e.g.pgsql libraries).

🌊 Data Flow

dataflow

πŸš€ Project Requirements

πŸ‘¨β€πŸ’» Data Engineering: Building the Data Warehouse

Goal: Develop a PostgreSQL-based data warehouse consolidating sales data for analytical reporting.

βœ”οΈ Data Sources: Import from ERP & CRM (CSV files)
βœ”οΈ Data Quality: Cleaning & handling missing values
βœ”οΈ Integration: Merging datasets into a single analytical model
βœ”οΈ Data Modeling: Implementing a star schema (Fact & Dimension tables)
βœ”οΈ Documentation: Clear metadata & model descriptions

πŸ“Š BI: Analytics & Reporting

πŸ“Œ Key Business Insights:
πŸ”Έ Customer Behavior Analysis – Understanding buying patterns
πŸ”Έ Product Performance Metrics – Evaluating top-performing items
πŸ”Έ Sales Trend Analysis – Identifying revenue patterns

Outcome: πŸ“ˆ Actionable reports for data-driven business decisions!


πŸ“° Report - Data Analysis and Business Insights

This section summarizes the data analysis process and the resulting reports, providing valuable business insights.

eda analysis

🎏 Data Exploration and Analysis

The analysis followed a structured approach, covering various aspects of the data:

  1. Database Exploration: Understanding the structure and relationships within the database.
  2. Dimensions Exploration: Analyzing the characteristics of the dimension tables (customers, products).
  3. Date Range Exploration: Identifying the time period covered by the data.
  4. Measures Exploration: Examining key metrics and their distributions.
  5. Magnitude Exploration: Understanding the scale of different measures.
  6. Ranking Analysis: Identifying top performers (e.g., customers, products).
  7. Change Over Time Analysis: Tracking trends and patterns over time.
  8. Cumulative Analysis: Examining the accumulated values of metrics.
  9. Performance Analysis: Evaluating the performance of different aspects of the business.
  10. Data Segmentation: Grouping data into meaningful segments for targeted analysis.
  11. Part-to-Whole Analysis: Understanding the contribution of different parts to the overall picture.

The EDA process was conducted using SQL queries. The results of the EDA are stored in the output directory within the report folder.


πŸ› οΈ Setup & Installation Guide

πŸ”Ή Prerequisites:

  • Install PostgreSQL β†’ Download PostgreSQL
  • Clone this repository:
    git clone https://github.com/Rudra-G-23/SQL-Data-Warehouse-Project.git
  • Load sample datasets from the /datasets/ folder.

πŸ”Ή Running SQL Scripts:

1️⃣ Initialize Database:

\i init_database.sql;

2️⃣ Run ETL Scripts:

\i scripts/bronze/       -- load data
\i scripts/silver/       -- transform data
\i scripts/gold/         -- final model

3️⃣ Start Analysis: Query tables to generate insights!


πŸ”— Useful Links & Resources

πŸ“Œ Project Assets:


πŸ“’ Connect & Collaborate!

πŸ’‘ Want to contribute? Fork this repo and submit a pull request!
πŸ“© Got questions? Open an issue or reach out to me!

πŸ™ Thank you

A special thank you to my instructor, Baraa Khatib Salkini.IT Project Manager | Lead Big Data, Data Lakehouse and BI at Mercedes-Benz AG. I learned many things from him.

πŸ“Œ Follow me on:

LinkedIn
GitHub
Kaggle

πŸ“§ Email me at: rudraprasadbhuyan000@gmail.com

About

This repo provides a step-by-step approach to building a modern data warehouse using PostgreSQL. It covers the ETL (Extract, Transform, Load) process, data modeling, exploratory data analysis (EDA), and advanced data analysis techniques.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published