Disaster Response Pipeline Machine Learning Project

4 min readJun 1, 2021

My name is Aseem Narula, a RPA consultant (….in simple words we make software bots….) and a aspiring Data Scientist currently enrolled and pursuing my journey by learning with the Udacity Data Science Nanodegree.

This blog is all about explaining the exciting ML project on Disaster Response using ETL Pipeline, Natural Language Processing(NLP) and some basic web framework Flask.

This project is build using the Machine Learning Pipeline to categorize the emergency messages based on the data set containing real messages that were sent during disaster events.

This project consists of the three components :

a) ETL Pipeline
b) ML Pipeline
c) Flask Web App

ETL Pipeline

In this project component, data is actually being picked from source destination, transformed and then cleaned data is loaded into the database from where it is used downstream.

I have loaded the messages and categories datasets, merged the two datasets, cleaned the data and then loaded into the SQL Lite database using the SQL sqlalchemy Python library

During my EDA, I have founded that the data from the genre ‘news’ is having the maximum number of messages in our datasets followed by the ‘direct’ and ‘social’ genre.

Our messages dataset contains the 36 different response categories.

Once the data is cleaned, I have loaded it to the SQLite database.

ML Pipeline

In this step, we are actually consuming the cleaned and transformed so that it can be feed into the ML model.

In a Python script, train_classifier.py, write a machine learning pipeline that:

Loads data from the SQLite database

Splits the dataset into training and test sets

Builds a text processing and machine learning pipeline

Note — This is the most important step in the ML pipeline creation where in the data is actually feed into the system and is getting trained using the Scikit learn functions.

CountVectorizer: Convert a collection of text documents to a matrix of token counts
TfidfTransformer: Transform a count matrix to a tf-idf (term-frequency times inverse document-frequency) representation
MultiOutputClassifier: Multi target classification

Trains and tunes a model using GridSearchCV

This is a cross validation step where best parameters are hyper tuned to train the model.I have kept n_jobs = -1 so that we can utilize the all core available in the parallel processing to maximize the speed.

Outputs results on the test set.

Exports the final model as a pickle file

Flask Web App

A front end web app which takes the user input in the form of text and then classify them into the 36 response category.

Few of the screenshots from the web app

Acknowledgement

All the datasets of Messages and Categories used in this Data Science Project are provided through Figure Eight in collaboration with the Udacity and are used for my project with Udacity Data Scientist Nanodegree.

Disaster Response Pipeline Machine Learning Project

ETL Pipeline

ML Pipeline

Flask Web App

Written by Aseem Narula