Black Friday Sales Prediction Machine Learning Model

Aseem Narula
5 min readDec 17, 2021

Introduction

Black Friday is a colloquial term for the Friday following Thanksgiving in the United States. It traditionally marks the start of the Christmas shopping season in the United States. In this blog, I will explain about the Sales Prediction Machine Learning model where I will be using various EDA techniques, Data Wrangling, Data Visualisation followed by the applying ML model.

Problem Statement

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

Data Set Definition

Now, let’s look at the data

Test and Train datasets are being provided in the csv format. I have used the Google Colab Online Jupyter Notebook to write the code in the Python.

Uploading the train set CSV file to the Google Colab Notebook

Creating the dataframe from the train.csv file

Checking the top 5 rows of the dataframe

Data Cleaning

In this section, we will be handling following items -

Checking Null Values

We can see that the Product_Category_2 and Product_Category_3 are containing null values in the dataframe.

Dropping the User_ID and Product ID from the dataset

Finding the average of the Product Category 2

Finding the average of the Product Category 3

Filling NaN with the most occurring class

Uploading the train set CSV file to the Google Colab Notebook

Creating the dataframe from the train.csv file

Checking the top 5 rows of the dataframe

Data Cleaning

Checking the null values

Dropping the User_ID and Product ID from the dataset

Finding the average of the Product Category 2

Finding the average of the Product Category 3

Filling dataframe NaN with most occurring class

Checking again if the dataframe is free of NaN values now

Finding the descriptive statistics of the dataframe

Dataframe has 550068 rows and 10 columns.

Displaying the count of the unique of the each column

Displaying the descriptive features of the dataset

Exploratory Data Analysis

Different Type of Gender in the dataframe

Different Types of the Age Bracket

Different Types of the Occupation

Different Types of the City_Category

Different Types of the Marital_Status

Removing the ‘+’ plus from the ‘Stay_In_Current_City_Years’ column

Data Visualisation

Heatmap

Barplot between ‘Marital Status’ and ‘Purchase’ corresponding to the stay in the current city i.e. number of years.

Gender Distribution

Purchase Pattern in Scatter Plot

BoxPlot Grouped by Gender

Product Category grouped by the Age Bracket

Age bracket of the 36–45 has the maximum average purchase value in the Product Category 1.

Once, our data frame is ready and cleaned we can now prepare it for feeding into the ML model for the prediction.

Converting the ‘Gender’ column in the dataframe to the Binary number of 0 AND 1

Predicting using the Linear Regression ML model

GitHub Link — https://github.com/aseemnarula1/Black_Friday_Sales_Prediction_Model

--

--