Udacity Data Engineer Nanodegree — Capstone Project — US Airports and Immigration- Data… | by Aseem Narula

8 min readDec 29, 2023

Udacity Data Engineer Nanodegree — Capstone Project — US Airports and Immigration- Data Integration and ETL Data Pipeline

My name is Aseem Narula, I am currently working as a Data Engineer at NatWest Group. I have undertaken the Data Engineer Nanodegree. In this module, I will be talking about my capstone project — US Airports and Immigration- Data Integration and ETL Data Pipeline.

Introduction

Project: US Airports and Immigration- Data Integration and ETL Data Pipeline

Project Summary

The main aim of this project is to integrate I94 immigration data, world temperature data and US demographic data with ETL data pipelines to build the data warehouse with fact and dimension tables to derive and generate the data insights for the business reporting.

In the data model, we will work on the STAR schema where I have combined the US Immigration data with the cities demographics along with the dimensions coming from the Airports, Countries, Temperatures data etc. which are optimized for the OLAP cube queries i.e. slicing and dicing if we wish to further expand the project to scale up with the additional resources.

We will first start with the details about the brief summary of the ETL steps that I will be performing to complete this project keeping the project rubrics in our mind.

Here we with the steps as follows —

Step 1: Scope the Project and Gather Data

Since the scope of this US Immigration data project will be highly dependent on the data, these two things happen simultaneously. In this step, I will:

Identify and gather the data for my capstone project.
Explain what end use cases I would like to prepare the data for (e.g., analytics table, app back-end, source-of-truth database, etc.)

The following datasets are included in the project workspace.

I94 Immigration Data: This data comes from the US National Tourism and Trade Office. Each report contains international visitor arrival statistics by world regions and select countries (including top 20), type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry (for select countries). https://travel.trade.gov/research/reports/i94/historical/2016.html

World Temperature Data: This dataset came from Kaggle. https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data

U.S. City Demographic Data: This data comes from OpenSoft. https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/

Airport Code Table: This is a simple table of airport codes and corresponding cities. https://datahub.io/core/airport-codes#data

Following are the source files looks like —

a) airport-codes_csv.csv
b) immigration_data_sample.csv
c) us-cities-demographics.csv
d) I94_SAS_Labels_Descriptions.SAS
e) Temperature Data

We can access the temperature data in a folder with the following path: ../../data2/. There's just one file in that folder, called GlobalLandTemperaturesByCity.csv. I have read the file into a pandas data frame using this sample code.

fname = '../../data2/GlobalLandTemperaturesByCity.csv'
df = pd.read_csv(fname)

Step 2: Explore and Assess the Data

In this step, we will assess the datasets for any data quality issues, like missing values, duplicate data, etc. Here, we have the read the data sources into the respective dataframes followed by previewing the data using head() command.

Immigration Dataset

Global Temperature by City Dataset

US City Demographics Dataset

Airports Codes Dataset

Step 3: Cleaning the dataset and define the Data Model for STAR Schema with data dictionary details for each table.

In this third step of ETL journey, we will be cleaning the datasets, defining the data model for STAR schema with data dictionary details for each table, the pre-processing of the Immigration data followed by the creating I94 fact table surrounded by the various dimension tables to fulfill the Data Warehousing Business requirement which will be used for meaningful insights, data visualization and business reporting for the stakeholders and the strategic data based driven decisions.

Step 3: Define the Data Model

Conceptual Data Model

Our Udacity Capstone Project — US Airports and Immigration Data Integration and ETL Data Pipeline will comprise of the STAR schema data model where we have the centralized the FACT Immigration Table surrounded by the four DIMENSION tables i.e. Arrival Date Table, Country Temperature Table, Demographics Table and Airports Table.

Reason to choose this data model?

The main reason to choosing the data model is its simplicity and ease of building the ETL pipeline hereby choosing this will help the business and organization building efficient Data Ware House and sustainable Business Intelligence system that help them in solving their daily data based decision making related to the US Airports and Immigration Data (i94) that will help in gathering and aggregating the climate global temperature variation in regard to the each state, city and country to analyze the data insights to find out the exact reasons on how the airports and immigration has impacted the climate change.

Step 4: Run ETL to Model the Data

List the steps necessary to pipeline the data into the chosen data model

1) All the following data sources are loaded into the respective datadrames using Pandas.

2) Data Assessment and Data Exploration was done using the various Pandas functions like info, shape, describe etc. to get the descriptive statistics of the data columns in the dataframes.

3) Pre-processing data quality checks on all the dataframe has been performed e.g. finding duplicates, finding NULLs, NaNs, volume count, data types etc. on the following

4) Data Dictionary for all the data sources has been performed.

5) Conceptual Data model which STAR schema has been visualized clearly depicting the centralized FACT and surrounding DIMENSION tables for better understanding.

6) Cleaning steps has been performed on the data frames to get the cleaned and neat data set into the respective tables i.e.

fact_immigration_table    
dimension_arrival_date_table    
dimension_airports_table    
dimension_demographics_table    
dimension_countries_temperature_table

7) Using Spark SQL Fact and Dimension tables are joined to get the relevant dataset for Business Insights and Data Visualization

Create the data pipelines and the data model — This has been included in the GitHub repo. — https://github.com/aseemnarula1/Udacity_Data_Engineer_Nanodegree_Capstone_Project/blob/main/etl.py
Data Dictionary for US Immigration Fact and Dimension Tables-
Following are the data dictionary for the Fact and Dimension Tables

Run data quality checks to ensure the pipeline ran as expected. — DQ checks has been performed, all passed as expected. Code snippet available in GitHub Repo — https://github.com/aseemnarula1/Udacity_Data_Engineer_Nanodegree_Capstone_Project/blob/main/ETL_Data_Quality_Checks.ipynb

Step 5: Final Capstone Project Writeup and Business Justification

In our US Airports and Immigration- Data Integration and ETL Data Pipeline — Udacity Capstone Project,

Tools and Technologies-

a) Python Pandas for Data Exploration, Data Wrangling and Data Cleaning.

b) PySpark for large sets data processing and manipulation to transform the dataset into the meaningful normalized dataset

c) ETL concept — Extract, Transform & Load concept from DWH techniques.

d) GitHub — For storing and maintaining the cloud repository for CI/CD DevOps purpose.

Scalability with PySpark:

PySpark is the Python API for Apache Spark, a distributed data processing engine. It enables parallel processing and can handle large-scale data efficiently.
The US Immigration dataset is likely to be vast, and PySpark’s distributed computing capabilities allow for parallelization of data processing tasks, leading to faster ETL operations.

Ease of Data Handling with Pandas:

While PySpark is excellent for distributed processing, Python Pandas is a versatile and user-friendly library for data manipulation and analysis.
Use PySpark for the initial data extraction and transformation tasks at scale, and then switch to Pandas for more complex data manipulations, exploratory data analysis (EDA), and feature engineering in a more interactive and expressive manner.

Parallel Processing and Performance:

PySpark’s ability to distribute computation across a cluster of machines enables parallel processing, which is crucial for handling the large volume of data in US Immigration datasets.
Leveraging the performance optimizations of PySpark ensures that ETL jobs are executed efficiently, minimizing processing time and resource utilization.

Data Frequency Update —

Data is being updated as and when it is sourced from the various data systems e.g. Global Temperature is being updated every month then it is being called via API. When being scheduled in the Airflow using DAG then workflow should run daily basis to provide the latest version of the data for loading into the database.

Future Scenarios-

a) If the data was increased by 100x — In this scenario, we will deploy the project to the AWS EMR cluster to scale up the configuration needed to the run the project and S3 would be used for data and parquet file storage. AWS will easily handle scalability when data increases by 100x.

b) If the pipelines were run on a daily basis by 7am — In this scenario, we will use the Apache Airflow to build and schedule the DAG workflow to run the ETL pipeline in real time.

c) If the database needed to be accessed by 100+ people — In this business scenario, we will utilize the AWS eco system e.g. IAM role and security group for granting the access to the 100+ people, and the target database will be built online cloud on the AWS Redshift Cluster to do the massive parallel processing so it can cater to the 1000+ thousands of users for running the SQL queries at the same time on the concurrent basis.

Reference Links -

Python Pandas Functions — https://pandas.pydata.org/docs/reference/general_functions.html

2. Pandas CSV Read — https://www.w3schools.com/python/pandas/pandas_csv.asp

3. Pandas Head Function — https://www.w3schools.com/python/pandas/ref_df_head.asp

4. Splitting the Code Variable in SAS format — https://communities.sas.com/t5/SAS-Programming/Splitting-a-city-and-zip-code-into-2-variables/td-p/435156

5. Descriptive Statistics Analysis — https://datatofish.com/descriptive-statistics-pandas/

6. STAR Schema for Data Modeling — https://www.geeksforgeeks.org/star-schema-in-data-warehouse-modeling/

7. PySpark SQL Function — https://sparkbyexamples.com/pyspark/pyspark-sql-with-examples/

8. Data Quality Checks in Pandas — https://www.telm.ai/blog/9-data-quality-checks-you-can-do-with-pandas/

GitHub Link — https://github.com/aseemnarula1/Udacity_Data_Engineer_Nanodegree_Capstone_Project

Acknowledgement

All the datasets of US Immigration & Airports data used in this Capstone Project are provided through Udacity and are used for my project with Udacity Data Engineer Nanodegree and reference links are also provided where the docs are referred.

Udacity Data Engineer Nanodegree — Capstone Project — US Airports and Immigration- Data Integration and ETL Data Pipeline

Introduction

Project: US Airports and Immigration- Data Integration and ETL Data Pipeline

Project Summary

Conceptual Data Model

Written by Aseem Narula