Starbucks Capstone Challenge

14 min readAug 29, 2021

Introduction

My name is Aseem Narula, a RPA consultant (….in simple words we make software bots using UiPath….) and a aspiring Data Scientist currently enrolled and pursuing my journey by learning with the Udacity Data Science Nanodegree. This project would be my last Capstone project where I would be using my Data Science and Machine Learning skills to analyse the problem from the given dataset and will present the ML solution to tackle the business problem for the Starbucks.

Starbucks Capstone Challenge

Project Overview

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Problem Statement

The three main datasets are as follow —

portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers completed

***JSON is file format of the input source data***

Our main goal is to combine the above three datasets — Portfolio, Profile and Transcript i.e. combining transaction, demographic and offer data to determine which demographic groups respond best to which offer type.

I will also build ML model from the given dataset to predict the response of a customer to an offer which is either completed or viewed.

Metrics

I will choose the following strategy —

a) Performing the initial EDA to analyse the data.

b) Applying the different statistics to have a deep dive into the datasets.

c) Applying the different machine learning models and then choosing best ML model based on the precision, recall, F1-score.

d) Performing the final EDA to view the merged datasets in the combined way (as per our problem statement overview section).

precision    recall  f1-score   support

Justification for choosing the above metrics for our problem statement These metrics are all fairly insensitive to class imbalances and using them are optimal for these types of datasets where load of data cleaning and data pre-processing is required.

I have used want accuracy score and/or F-score to measure ML model performance in a classification problem because classification accuracy involves first using a classification model to make a prediction for each example in a test dataset. The predictions are then compared to the known labels for those examples in the test set. Accuracy is then calculated as the proportion of examples in the test set that were predicted correctly, divided by all predictions that were made on the test set.

Accuracy = Correct Predictions / Total Predictions

Conversely, the error rate can be calculated as the total number of incorrect predictions made on the test set divided by all predictions made on the test set.

Error Rate = Incorrect Predictions / Total Predictions

The accuracy and error rate are complements of each other, meaning that we can always calculate one from the other. For example:

Accuracy = 1 — Error Rate
Error Rate = 1 — Accuracy

Another valuable way to think about accuracy is in terms of the confusion matrix.

A confusion matrix is a summary of the predictions made by a classification model organized into a table by class. Each row of the table indicates the actual class and each column represents the predicted class. A value in the cell is a count of the number of predictions made for a class that are actually for a given class. The cells on the diagonal represent correct predictions, where a predicted and expected class align.

Data Exploration

In this section, features and calculated statistics relevant to the problem have been reported and discussed related to the dataset. Abnormalities or characteristics about the input data that need to be addressed have been identified.

Portfolio Dataset

The statistics for the portfolio data frame looks good for the columns — ‘difficulty’, ’duration’, ’reward’

Complete view of the Portfolio dataset

Different Type of the Offers

Observations about the Portfolio Dataset

There is no null values in the Portfolio dataset which reduces our effort for the data pre-processing.
There are four different types of channels via which offers are being made to the customers — web, social, mobile, email.
There are three different types of the offers made to the customers — ‘bogo’ (bug one get one free), ‘informational’, ‘discount’.

Profile Dataset

The statistics o the Profile data frame

The above statistics show the maximum value of the age of 118 which seems to be somewhat odd, I need to dig deep dive into this…

Number of the members where age is missing

2. Number of the members where gender is missing

3. Number of members where income is missing

4. Max, Min, Mean values of the income

5. Sample records of the Profile dataset

Observations about the Profile Dataset

The Profile dataset contains null values in the income group.
Gender column contains the ‘None’ values whose income as showing as ‘NaN’ values.
The ‘age’ column contains the value 118 which seems to be default feed by the system for the customers who don’t want to mention their age and income salary with the Starbucks app.
‘became-member-on’ column needs date formatting as it is in the different date format.
There are 2175 values where the number of the members where age is missing.
There are 17000 values where the number of the members where gender and income is missing in the dataset.

Transcript Dataset

This dataset contains the transaction level details for each offer id.

Statistics of the Transcript data frame.

2. Checking the distinct events of the transcript

3. Total value counts of the events in the transcript

4. Number of the null values in the transcript

5. Checking the records for the transcript event = ‘transaction’

Observations about the Transcript Dataset

There is only one column ‘time’ upon the statistics can be applied other columns are categorical in the nature.

2. There are four different types of the offers at the transaction level — ‘offer received’ ‘offer viewed’ ‘transaction’ ‘offer completed’.

3. Number of the events in the transcript →
transaction 138953
offer received 76277
offer viewed 57725
offer completed 33579

4. There are 306534 values in the number of null values in the transcript.

5. The amount of the transaction event in the key-value pair in the dictionary datatype format which needs the cleaning and formatting.

6. Person’s details like name seems to be encrypted and same is case with the offer id under the column ‘value’, this will be handled in the Data Preprocessing section.

Data Visualization

In this section, I am presenting the some of the Data Visualization plots of the input data sets.

(a) Plotting the bar plot for the count of based on the gender

The total count of the number of the ‘Male’ is seems to be larger than the count of the number of the ‘Female’ where as the gender category ‘Others’ are comparatively very low which is less than 1000.

(b) Plotting the graphs based on the income group by age

The income group age looks stable between the range of the 50 to 85 but the income group age looks somewhat odd near the age group of 100–118( as per previous data exploration section), there are few outliers in this age bracket.

(c) Plotting the Age Group Distribution in the Histogram

This plot confirms the outliers in the age bracket of the 100–120, this will be fixed in the Data Pre-Processing section. The count achieves its peak in the age group of 60–65 where most people have income just before the retirement.

(d) Plotting the different types of offer available

This plot is showing the different type of the Offers available to the customer on the Starbucks app.

(e) Plotting the graph when customer becomes member of the Starbuck App

The plot shows the year wise distribution of the members when they have joined the Starbuck app, mostly new joined the app after the 2016 onwards and the spike in the membership seems to be in the first half of the year of 2018.

(f) Plotting the Pie Chart for the gender wise in terms of the percentage

There are the about 58% Male population as compared to the around 42% of the Female population where as ‘O’ gender category has been filtered out.

(g) Pair Plot of the Portfolio Data frame

Following pair plot shows the columns — ‘difficulty’ and ‘duration’ from the Portfolio Data frame.

(h) Heatmaps for the Portfolio data frame

The following heatmap plot shows the correlation of the Portfolio data frame of the all the columns — ‘difficulty’, ‘duration’, ‘reward’, it seems that reward and difficulty have high correlation among them around 0.90.

Data Preprocessing

In this section, I will handling the categorical data in the Portfolio Data frame and other fixes from the data point of view in order to prepare the datasets for the further downstream of the data modelling layer.

Portfolio Data frame

(a) One hot encoding for the channels column using the Multi Label Binarizer so that it can be represented in the binary form of the 0 and 1 values.

(b) Assigning the different types of portfolio offers of the encoded channel to the numerical values 1,2,3

Profile Data frame

(a) Formatting the ‘became member on’ column in the Profile dataset

(b) Filling the NaN values with the mean age of the column

(d) Clubbing the Age Group column so that we can bracket the profile ages in various group for our data modelling layer.

Transcript Data frame

(a) Rename column person to cutsomer_id for better readability

(b) Creating the cross tab view for checking the different types of offer transaction for total count.

(d) Merging of the data frames

Converting the offer completed, offer viewed, offer received to the numerical values so that it can be used for the ML prediction

(e) Filling the NaN values for the income and gender group with the average mean values

(f) Converting the float data types of the few columns to the integer datatype before making ML prediction

After the above step, now our dataset is for the data modelling layer where in we have fixes the issues and inconsistencies identified in the data exploration section, they are now ready for the ML prediction.

Implementation & Refinement

In this section, I need to start with the defining the variables X and Y from the cleaned merged dataset, I have used the following columns to ‘time’,’difficulty’,’duration’,’reward’,’offer_type_number’,’age_group_number’,’income_group_number’,’gender_group_number’ to be put in the variable X

Y variable would be the event.

This is important step to our initial problem statement so as to keep track of the starting data analysis where in we need to make sure that the steps which are undertaking are correct and we should not deviate from our problem statement, we are trying to find the answers for predicting the response of the customer to an offer ‘viewed’ or ‘completed’ by the customer on the Starbucks app.

Splitting the data into the test and train based on 33% test size and then using the Min Max Scaler to fit the transformation that will normalize the features hence the increasing the accuracy of the classifier.

Here we can see that the labels are imbalanced, approximately 2:1 (which is double in our case) for Positive vs. Negative labels. Hence, we are using F1_score instead of accuracy, since F1_score is a metric which is used to balance recall & precision and to deal with imbalanced labels

Model Evaluation and Validation

After the dataset is split into the train & test data, I have used the various ML models to compare and test the accuracy/efficiency of the trained models.

(a) Fitting the simple Linear Regression Classifier giving the accuracy of the 11% which is not so good.

(b) Using the Random Forest Classifier for the prediction, confusion matrix is showing a good accuracy of 67%

(d) Decision Tree Classifier of our ML model is showing accuracy of the around 77% on the training dataset.

(e) KNN Classifier is showing the accuracy of the around 73% on the training dataset and 67% on the test dataset.

(f) Support Vector Machine is producing the 65% accuracy on both the training and test data. This is much of average but not so good in our case.

Tabular view of the ML models comparison

Decision Tree Classifier & K Neighbors Classifier are giving us the better scoring and can be used to predict our response to a customer based on the offer viewed in the Starbucks App.

Model Refinement using Cross Validation

Cross Validation

Since, Decision Tree Classifier is having the highest training accuracy, I am moving ahead with this ML model and doing the Cross Validation.

The result is: 0.00221377245132. This is extremely low, which means that our model has a very low variance, which is actually very good since that means that the prediction that we obtained on one test set is not by chance. Rather, the model will perform more or less similar on all test sets.

Grid Search for Parameter Selection

Using the Grid Search CV for parameter selection is the best way to find best parameters for our ML model, since in our case Decision Tree Classifier is giving the best accuracy till now.

I have used the max-depth array list starting from 4 to 150, also choose the criterion — ‘gini’ and ‘entropy’ for our purpose.

Once, I have applied the Grid Search CV on the Decision Tree Classifier I have got the best parameters as {‘criterion’: ‘entropy’, ‘max_depth’: 15}

Justification

In the end, in my view the Decision Tree Classifier & K Neighbors Classifier are the two best classifiers which can be used for predicting our ML models on the dataset provided for predicting the response of the customer to an offer made in the Star Bucks App which are either viewed or completed, the input datasets are somewhat unclean the data cleaning and wrangling takes most of the time to stabilize them in format which can be feed into the ML models.

Conclusion

In this section, I will summarize my results in link to the initial problem statement and my thoughts what could be made better in this current approach.

Reflection

I started with the initial data analysis of the three given datasets by the Starbucks — i.e. portfolio, profile and transcript, in our initial Exploratory Data Analysis I have found that the Portfolio dataset is clean and easy understandable with no null values, Profile dataset is very unformatted and needs data cleaning before I do some data modelling using ML models. I have found this most difficult part.

Transcript dataset is also very tricky where offer id are encoded and encrypted due to the security purposes.

After data wrangling and cleaning, most interesting to test and compare the different ML models to predict on which offer customer will response more appropriately.

Improvements

In my view, user-user recommendations can be built further for our Starbucks App so that the customer can see their best offers according to the filters they choose on the app but before that our initial input file data needs to be handled in more efficient way because choosing and then comparing the ML models is mostly time consuming, Hyper Parameter Tuning can be further used to identify the features which are to be choose to built this in future.

Acknowledgement

All the datasets used in this Capstone Data Science Project are provided through Starbuck and are used for my project with Udacity Data Scientist Nanodegree.

References

Classification Report

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

Grid Search CV

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

Metrics Used

https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/

https://medium.com/r/?url=https%3A%2F%2Fmachinelearningmastery.com%2Ffailure-of-accuracy-for-imbalanced-class-distributions%2F

K-Fold Cross Validation & Grid Search Hyper Parameter Tuning

https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/