Black Friday Sales Prediction Machine Learning Model
Introduction
Black Friday is a colloquial term for the Friday following Thanksgiving in the United States. It traditionally marks the start of the Christmas shopping season in the United States. In this blog, I will explain about the Sales Prediction Machine Learning model where I will be using various EDA techniques, Data Wrangling, Data Visualisation followed by the applying ML model.
Problem Statement
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.
Data Set Definition
Now, let’s look at the data
Test and Train datasets are being provided in the csv format. I have used the Google Colab Online Jupyter Notebook to write the code in the Python.
Uploading the train set CSV file to the Google Colab Notebook
Creating the dataframe from the train.csv file
Checking the top 5 rows of the dataframe
Data Cleaning
In this section, we will be handling following items -
Checking Null Values
We can see that the Product_Category_2 and Product_Category_3 are containing null values in the dataframe.
Dropping the User_ID and Product ID from the dataset
Finding the average of the Product Category 2
Finding the average of the Product Category 3
Filling NaN with the most occurring class
Uploading the train set CSV file to the Google Colab Notebook
Creating the dataframe from the train.csv file
Checking the top 5 rows of the dataframe
Data Cleaning
Checking the null values
Dropping the User_ID and Product ID from the dataset
Finding the average of the Product Category 2
Finding the average of the Product Category 3
Filling dataframe NaN with most occurring class
Checking again if the dataframe is free of NaN values now
Finding the descriptive statistics of the dataframe
Dataframe has 550068 rows and 10 columns.
Displaying the count of the unique of the each column
Displaying the descriptive features of the dataset
Exploratory Data Analysis
Different Type of Gender in the dataframe
Different Types of the Age Bracket
Different Types of the Occupation
Different Types of the City_Category
Different Types of the Marital_Status
Removing the ‘+’ plus from the ‘Stay_In_Current_City_Years’ column
Data Visualisation
Heatmap
Barplot between ‘Marital Status’ and ‘Purchase’ corresponding to the stay in the current city i.e. number of years.
Gender Distribution
Purchase Pattern in Scatter Plot
BoxPlot Grouped by Gender
Product Category grouped by the Age Bracket
Age bracket of the 36–45 has the maximum average purchase value in the Product Category 1.
Once, our data frame is ready and cleaned we can now prepare it for feeding into the ML model for the prediction.
Converting the ‘Gender’ column in the dataframe to the Binary number of 0 AND 1
Predicting using the Linear Regression ML model
GitHub Link — https://github.com/aseemnarula1/Black_Friday_Sales_Prediction_Model