My Story of First Data Science Project
My name is Aseem Narula, a RPA consultant (….in simple words we make software bots….) and a aspiring Data Scientist currently enrolled and started my journey by learning with the Udacity Data Science Nanodegree. My current role requires me to interact with various numerous applications and I also always wonder how much data is stored or located everywhere within large ecosystem of the connected IoT, sensory devices, image processing devices etc.
Though all data is always not necessary needed to build the insights, we always need one clear consolidated dataset to start with, this enables us to help in making better business decision making and the present the data with clear rational and the meaningful visualization.
For my first Data Science Project, I have used the Kaggle Datasets of the Airbnb homes for Seattle and Boston Cities, also I have used the CRISP DM methodology to build my Data Science Project. These datasets contains the three files — Reviews, Calendars, Listings. I have used the Kaggle platform to analyze the Seattle and Boston Airbnb homes datasets. In this Juypter Notebook, I am using the CRISP-DM in my process for finding solutions.
The Cross Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases that naturally describes the data science life cycle. It’s like a set of guardrails to help you plan, organize, and implement your data science (or machine learning) project.
- Business understanding — What does the business need?
- Data understanding — What data do we have / need? Is it clean?
- Data preparation — How do we organize the data for modeling?
- Modeling — What modeling techniques should we apply?
- Evaluation — Which model best meets the business objectives?
- Deployment — How do stakeholders access the results?
I have used the Pandas library for analysing and cleaning the dataset where Matplotlib and Seaborn libraries for the Data Visualisation.
Business Understanding
Well, this is the first basic step of any Data Science project in which our business objective should be clearly defined so that we are clear about what business outcome we are trying to solve.
Airbnb Homes are the business in which users can rent out their house properties for the certain period of time in their neighbourhood area at the competitive prices. Airbnb website make sure that the number of properties being booked should be maximize and house owners are charging within the permissible ranges at the same time company make profit based upon the number of bookings happened in a certain area in a particular city.
In our cases, I will try to answer the following questions -
#Question-1 How much Airbnb homes are earning in certain areas ?#Question-2 What is the name of the area with the highest average price of the rented home ?
#Question-3 Compare the rates between the two cities ?
#Question-4 How many different room types are there and which types of rooms are maximum occupied?
Data Understanding
Analysis of the Seattle and Boston Data Sets. In this section, I have done the initial data analysis of the raw data of the Boston and Seattle Airbnb datasets, I have done the following items to check the raw dataset.
a) Importing & Loading of the Seattle Listings, Reviews, Calendars csv file using pandas.
b) Checking the top 5 records for each data frame.
c) Checking the count of the total numbers of records in the data frame.
d) Checking the datatypes of the columns.
e) Checking the bottom 5 records using tail() function.
f) Checking for the number of unique values in each dataset.
g) Checking for the time range for the each of the dataset taken.
Data preparation
In this section of the CRISP DM, we are preparing the raw dataset to make it ready for the data evaluation. The first thing to do is to handle the missing values, since Seattle and Boston datasets are huge in numbers.
a) Handling Categorical Data.
b) Handling Missing Values.
c) Data Cleaning, removing the un-necessary columns from both the datasets.
I had a quick preview of the count of the missing values(NaN) in the datasets by using the following method-
In the above image, the Seattle Reviews dataset is having no major missing values but the ‘comments’ column is having the 18 null values in the data frame.
Data Evaluation and Results
This is the most important of the Data Science Project wherein we have to summarize, visualize and present the concise business data insights with the meaningful rational and bullet point to the stakeholders so that it can be understood by every audience from various fields.
Coming back to my business questions, which we have drafted in the business understanding section, I will try to answer them using various data visualization techniques using Seaborn and Matplotlib libraries of the Python.
Question-1 How much Airbnb homes are earning in certain areas ?
Conclusion - Following table depicts the average rent prices for each of the neighbourhood areas of the Seattle city for Airbnb homes.10 areas in the areas in the Seattle are having the average rent price greater than $100.
Question-2 What is the name of the area with the highest average price of the rented home?
Conclusion- Magnolia neighbourhood area is the costliest having the highest average price of the $166 in the Seattle City.
For the Boston city, from the bar graph it looks like ‘Bay Village’, ‘South Boston Waterfront’ and ‘Leather District’ neighbourhood areas have almost the same average rent price but ‘Bay Village’ tops the charts with the highest average price of $266.833333
Bay Village 266.833333
South Boston Waterfront 254.855422
Leather District 253.600000
Back Bay 236.811258
West End 209.591837
Question-3 Compare the rates between the two cities ?
Conclusion- I have used heatmaps to compare the rates of the both the Seattle and Boston cities, average price range for Seattle varies between $100-$250 whereas average price range for the Boston varies between $80-$160.
Question-4 How many different room types are there and which types of rooms are maximum occupied?
Conclusion- In both the cities (Seattle and Boston), there are 3 types of the rooms are available round the year with ‘Shared room’ are the mostly available whereas the ‘Entire home/apt’ type are the mostly occupied.
Acknowledgement
All the datasets of Boston and Seattle Cities used in this Data Science Project are provided through Kaggle and are used for my project with Udacity Data Scientist Nanodegree.