Retail-dataset-analysis

Data analysis and model evaluation on the retail dataset I made while participating in the ML Bootcamp by Global AI Hub. You can access my notebook from the link : https://www.kaggle.com/code/cansusary/retail-analysis-bc

Problem Identification

The project began with identifying a key business problem: analyzing customer purchasing behavior to uncover insights that can guide marketing strategies and improve sales. The primary goal was to explore relationships between customer segments, product categories, and purchasing patterns to predict trends and better understand customer preferences.

Library Installation and Import

Before diving into the analysis, necessary Python libraries were installed and imported, including Pandas, NumPy, Matplotlib, Seaborn for data handling and visualization, as well as Scikit-learn for machine learning models. These tools are essential for managing data, creating visual insights, and building predictive models.

Data Loading and Reading

Exploratory Data Analysis (EDA)

An initial Exploratory Data Analysis (EDA) was conducted to summarize the dataset. Descriptive statistics like mean, median, and distribution of variables were computed. Additionally, visualizations such as bar plots and count plots were created to provide insights into customer distribution, purchasing trends, and product preferences.

Data Cleaning and Handling Missing Values

The next step involved cleaning the dataset. Missing values were detected using methods like isnull() and filled or dropped where appropriate. For numerical columns, imputation techniques like filling with mean or median values were used.

Data Visualization

To gain deeper insights, various data visualizations were generated using Seaborn and Matplotlib. Bar plots, correlation heatmaps, and scatter plots allowed for visual interpretation of relationships between customer segments, product categories, and purchasing behavior. These visualizations helped to identify patterns and correlations that could influence modeling decisions.

Machine Learning Model Selection and Building

Several machine learning algorithms were considered to determine the most suitable model. Linear regression and K-means clustering were chosen for initial experimentation. Linear regression was used to predict customer purchases based on features like amount and age, while K-means clustering helped identify customer segments based on purchasing behavior.

One-Hot Encoding

One-hot encoding was employed to convert categorical variables into a format suitable for machine learning algorithms. In the retail dataset, features such as 'Customer_Segment' and 'Product_Category' were categorical, meaning they contained non-numeric labels.

Target Variable and Feature Selection

Careful consideration was given to selecting the target variable (Total Purchases) and the relevant features for the models.

Cross-Validation

Cross-validation was applied to both models to evaluate their performance across different data splits. For linear regression, R-squared scores were calculated, while silhouette scores were used for the K-means model. These metrics helped assess the stability and accuracy of the models across multiple folds.

Hyperparameter Optimization

Since K-means produced more favorable results, hyperparameter optimization was performed using Random Search to the model. This optimization step helped identify the best number of clusters and other model parameters, enhancing the model’s performance.

Model Evaluation

The K-means algorithm was determined to be more suitable for the retail dataset. K-means is an unsupervised learning algorithm, which means it doesn’t require labeled data. Since the goal was to identify patterns and group customers based on their purchasing behavior, K-means was ideal for discovering hidden customer segments without predefined categories.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
retail-dataset-explorationBC.ipynb		retail-dataset-explorationBC.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retail-dataset-analysis

Problem Identification

Library Installation and Import

Data Loading and Reading

Exploratory Data Analysis (EDA)

Data Cleaning and Handling Missing Values

Data Visualization

Machine Learning Model Selection and Building

One-Hot Encoding

Target Variable and Feature Selection

Cross-Validation

Hyperparameter Optimization

Model Evaluation

About

Releases

Packages

Languages

cansusaarii/Retail-dataset-analysis

Folders and files

Latest commit

History

Repository files navigation

Retail-dataset-analysis

Problem Identification

Library Installation and Import

Data Loading and Reading

Exploratory Data Analysis (EDA)

Data Cleaning and Handling Missing Values

Data Visualization

Machine Learning Model Selection and Building

One-Hot Encoding

Target Variable and Feature Selection

Cross-Validation

Hyperparameter Optimization

Model Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages