π Hello, Data Points!
- My name is Rudra Prasad Bhuyan ! π
- I am a Kaggle expert and a Google Certified Data Analyst.
Welcome to this Modern Data Warehouse & Analytics End-to-End Project project using PostgreSQL! π―
This repository provides a step-by-step approach to building a scalable, efficient, and analytics-ready data warehouse. It covers:
β
ETL Pipelines (Extract, Transform, Load)
β
Data Modeling (Star Schema)
β
Exploratory Data Analysis (EDA)
β
SQL-based Reporting & Analytics
β
Advanced-Data Analytsis & Reporting
π Project Notion Page
The project follows the Medallion Architecture with three layers:
π Bronze Layer (Raw Data) β Stores data directly from the source (CSV files).
π Silver Layer (Cleansed & Transformed Data) β Data is cleaned, structured, and normalized.
π Gold Layer (Business-Ready Data) β Optimized for analytics and reporting using a star schema.
πΉ SQL Development β Writing optimized SQL queries for analytics.
πΉ Data Engineering β Designing ETL pipelines for seamless data movement.
πΉ Data Architecture β Structuring a robust and scalable data warehouse.
πΉ ETL Pipeline Development β Extracting, transforming, and loading data efficiently.
πΉ Data Modeling β Implementing fact and dimension tables.
πΉ Data Analytics β Running advanced analytical queries for insights.
- Database: PostgreSQL
- ETL Processing: SQL, Python (optional)
- Data Visualization: Power BI / Tableau (optional)
- Documentation & Diagramming: Draw.io, Notion
data-warehouse-project/
βββ datasets/ # Raw data from ERP and CRM systems.
β
βββ docs/ # Project documentation, architecture diagrams, and outputs.
β βββ bronze/
β β βββ data_flow_bronze.drawio # Data flow diagram: Source -> Bronze (Draw.io).
β β βββ bronze_data_schema.md # Schema of the bronze layer tables.
β β βββ bronze_output_examples/ # Example of the data after the bronze layer processing.
β βββ silver/
β β βββ data_cleaning_output/ # Examples of data after cleaning.
β β βββ data_flow_silver.drawio # Data flow diagram: Bronze -> Silver (Draw.io).
β β βββ Data_Integration.drawio # Data integration diagram (Draw.io).
β β βββ silver_data_schema.md # Schema of the silver layer tables.
β βββ gold/
β β βββ output/ # Examples of the data after the gold layer processing.
β β βββ data_catalog.md # Data dictionary for the Gold layer, including field descriptions.
β β βββ data_flow_gold.drawio # Data flow diagram: Silver -> Gold (Draw.io).
β β βββ data_models.drawio # Star schema diagram (Draw.io).
β β βββ gold_data_schema.md # Schema of the gold layer tables.
β βββ warehouse/
β βββ naming_conventions.md # Naming conventions for tables, columns, etc.
β βββ data_architecture.drawio # Overall data warehouse architecture diagram (Draw.io).
β βββ etl.drawio # ETL process diagram, showcasing techniques and methods (Draw.io).
β
βββ scripts/ # SQL scripts for ETL and transformations.
β βββ bronze/
β β βββ load_raw_data.sql # Scripts to load data from the 'datasets' directory into the bronze layer.
β βββ silver/
β β βββ transform_clean_data.sql # Scripts to clean and transform the data in the bronze layer.
β βββ gold/
β βββ create_analytical_views.sql # Scripts to create views for analysis in the gold layer.
β βββ populate_dimensions.sql # Scripts to populate dimension tables.
β βββ init_database.sql # Script to create the database and schemas.
β
βββ tests/ # Test scripts and quality control files (e.g., data quality checks).
β βββ data_quality_checks.sql # SQL scripts for data quality checks.
β
βββ report/ # Analysis scripts and reports.
β βββ 1_gold_layer_datasets/ # Datasets used for reporting and analysis.
β βββ 2_eda_scripts/ # Exploratory Data Analysis (EDA) scripts.
β β βββ basic_eda.ipynb # Jupyter notebook containing basic EDA.
β βββ 3_advanced_eda/ # Advanced EDA scripts and analyses.
β β βββ advanced_eda.ipynb # Jupyter notebook containing advanced EDA.
β βββ output/ # Output from the analysis (e.g., charts, tables).
β βββ 12_report_customers.sql # SQL script for the customer report.
β βββ 13_report_products.sql # SQL script for the product report.
β
βββ README.md # Project overview, instructions, and report summaries.
βββ LICENSE # License information.
βββ requirements.txt # Project dependencies (e.g.pgsql libraries).
Goal: Develop a PostgreSQL-based data warehouse consolidating sales data for analytical reporting.
βοΈ Data Sources: Import from ERP & CRM (CSV files)
βοΈ Data Quality: Cleaning & handling missing values
βοΈ Integration: Merging datasets into a single analytical model
βοΈ Data Modeling: Implementing a star schema (Fact & Dimension tables)
βοΈ Documentation: Clear metadata & model descriptions
π Key Business Insights:
πΈ Customer Behavior Analysis β Understanding buying patterns
πΈ Product Performance Metrics β Evaluating top-performing items
πΈ Sales Trend Analysis β Identifying revenue patterns
Outcome: π Actionable reports for data-driven business decisions!
This section summarizes the data analysis process and the resulting reports, providing valuable business insights.
The analysis followed a structured approach, covering various aspects of the data:
- Database Exploration: Understanding the structure and relationships within the database.
- Dimensions Exploration: Analyzing the characteristics of the dimension tables (customers, products).
- Date Range Exploration: Identifying the time period covered by the data.
- Measures Exploration: Examining key metrics and their distributions.
- Magnitude Exploration: Understanding the scale of different measures.
- Ranking Analysis: Identifying top performers (e.g., customers, products).
- Change Over Time Analysis: Tracking trends and patterns over time.
- Cumulative Analysis: Examining the accumulated values of metrics.
- Performance Analysis: Evaluating the performance of different aspects of the business.
- Data Segmentation: Grouping data into meaningful segments for targeted analysis.
- Part-to-Whole Analysis: Understanding the contribution of different parts to the overall picture.
The EDA process was conducted using SQL queries. The results of the EDA are stored in the output
directory within the report
folder.
- Install PostgreSQL β Download PostgreSQL
- Clone this repository:
git clone https://github.com/Rudra-G-23/SQL-Data-Warehouse-Project.git
- Load sample datasets from the
/datasets/
folder.
1οΈβ£ Initialize Database:
\i init_database.sql;
2οΈβ£ Run ETL Scripts:
\i scripts/bronze/ -- load data
\i scripts/silver/ -- transform data
\i scripts/gold/ -- final model
3οΈβ£ Start Analysis: Query tables to generate insights!
π Project Assets:
- π Dataset Folder
- π Project Notion Page
- π¨ Diagramming Tool (Draw.io)
π‘ Want to contribute? Fork this repo and submit a pull request!
π© Got questions? Open an issue or reach out to me!
A special thank you to my instructor, Baraa Khatib Salkini.IT Project Manager | Lead Big Data, Data Lakehouse and BI at Mercedes-Benz AG. I learned many things from him.
π§ Email me at: rudraprasadbhuyan000@gmail.com