hajrashahab.github.io

My undergraduate training in Economics coupled with my on-going graduate studies in Data Analytics at Carnegie Mellon University has helped me develop an interdisciplinary understanding to make data-driven decisions. I have a profound interest in statistical methods, data analytics, machine learning, data visualization and human-computer interaction. At school, I am learning tools such as Python, SQL, R, Excel, and STATA that are helping me pave a path into the field of Data Science (beyond the buzz word!).

At CMU, I am enrolled in a highly selective STEM certified program that includes robust coursework in data mining & machine learning, statistics & modeling, and computer programming. I strongly believe in leveraging data science techniques to solve some of the most challenging problems of our times.

So far, I have become proficient in three programming languages within a year (while continuously battling with impostor syndrome), worked with different relational and non-relational databases to understand different software architecture, applied decision science techniques such as simulation and optimization to address complex and large-scale decision-making problems that arise in policy and business and lastly, built computational foundation in Machine Learning!

Data Mining & Machine Learning

Predictive and Descriptive Analysis using Data Mining Techniques

As I continue to work with large datasets, tools covered in the study of Data Mining and Machine Learning have helped me leverage methods such as predictive and descriptive analytical tasks in Python and R to solve business and policy challenges. Some projects that I worked on include:

Exploratory analysis using bikeshare data from the Capital Bikeshare system in Washington DC: Conducted an exploratory analysis to identify qualitative predictors, fit regression model, deal with collinearity issues, and explore non-linearities using several data visualization libraries including ggplot2
Splines and Degree of Freedom Selection: Worked with manual placement of knots for smoothing splines
Regression Analysis on Life Expectancy Dataset: Worked with multiple functions to build regression models, calculate cross-validation error, test and train datasets
Classification Techniques for Targeted Marketing Campaigns: Used logistic regression for two-class classification problem to classify people in different age groups between adult and non-adult categories; also worked with other advanced classification techniques such as Decision Trees, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes
Advanced Classification Techniques for Targeted Marketing Campaigns: Used random forest algorithm (ensemble learning methods) to build decision trees to correctly predict and classify observations in a dataset
Predict Flight Delays in 2018 from 2006 US Flight Dataset: Developed a model to help people who come to pick passengers from airports predict flight delays and plan pickup accordingly.

Mathematical and Computational Foundations of Machine Learning

Going deeper into the understanding of Machine Learning, I have been developing a robust mathematical foundation in ML with several proof techniques based in discrete mathematics (contrapositive, truth tables, counterexamples, and contradiction). On the computation end, I have learned to work with different algorithms such as perceptron algorithm, training data to minimize errors in perceptron mistake boundary, computational complexity (Big O notation), dynamic programming with Markov Chains and recursion. Simultaneously, I have also worked with data structures including tree traversals (Depth-first search and Breadth-first search), Stacks and Queues (LIFO vs. FIFO), Graphs and Bayes Nets.

Statistics & Modeling

Optimization and Modeling Framework

In Fall 2021, I took advanced level Decision Analytics course that introduces modeling frameworks and computational tools to address complex, ill-defined, and large-scale decision-making problems that arise in policy and business. It covered advanced methods of decision-making: (large-scale) deterministic optimization, stochastic/robust optimization, and sequential decision-making using Gurobi solver in Python. Case studies were drawn from a variety of real-world settings in transportation, energy, information systems, health care, supply chain management, etc. Here are the case studies and final project that I completed during the course of this class.

Portfolio

Final Project

Brief Description: “To prepare for large public health emergencies, the Allegheny County is prioritizing the list of Points of Dispense (PODs) to open, which are used to distribute essential supplies and medicines to the public during such emergencies. Your goal for this project is to formulate a facility location model, and optimally select a list of PODs from the candidate sites. You need to compare two different formulations: minimizing the total/weighted travel distance of the population, and minimizing the maximum travel distance for anyone.”

Lead Poisoning Crisis in Allegheny County

AB Testing

To extend my learning of Randomized Controlled Trails (RCTs) from economic policy issues to tech products, I took AB testing class in Fall 2021 that focuses on frameworks to measure causal effects across different industries. The course used examples with real-world datasets drawn from research performed at the Heinz College in entertainment and education. Significant effort was placed on understanding how to design randomized experiments (aka A/B tests) to measure causal effects. The assignments and project completed below also leverage the tools that can be used to analyze data from observational studies where randomization can not be implemented.

The concepts and tools discussed in this course are general in nature and can be applied in different settings. The following concepts have been covered in the assignments below:

Randomized Control Experiments
Time and Individual Fixed Effects
Instrumental Variables
Natural Experiments
Differences in Differences
Propensity Score Matching
Compliance in Experiments
Heterogeneous Effects
Interference in Networked Experiments

Portfolio

Final Project

Brief Description: Using data from a Qualtrics survey executed by our team in Pittsburgh, the paper explored the causal effect of including scooter rides as part of the university transportation fee on weekly scooter usage. Causal inferences were drawn as we eliminate selection bias through random assignment of experimental subjects to the treatment and control group. Moreover, the analysis extends to explore heterogeneity in the average causal effect across groups within the sample using key barriers of distance, safety and income as moderators. The project consisted of the following parts: Experiment Design & Causal Question of Interest, Survey Design & Execution, Data & Analysis and Recommendations.

Scooter Usage in City of Pittsburgh

Computer Programming

Spatial Data Analytics

I took Geographic Information Systems (GIS) class that worked on storage, retrieval, and visualization of geographically referenced data as well as design and analysis of spatial information. By the end of the course, I developed an understanding of the world’s quickly-growing spatial data infrastructure and how to put it to work for producing location-based information, identified relevant spatial characteristics of diverse application areas enabling myself to integrate spatial thinking and GIS analysis into my career.

Some of the concepts that I learned includes

Geographic concepts (world coordinate systems, map scale/projections, sea level/elevation)
Government-provided map infrastructure, geodatabases, geodesign
Spatial data processing (clipping, merging, appending, joining, dissolving)
Spatial analysis (proximity analysis, risk surface, site suitability, spatial data mining)

Portfolio

Advanced Programming in Python

My journey in the world of data science started with foundational and advanced courses in Python in the first semester. In this course, I gathered data from various sources including web scraping, web API, CSV and other structured data files, and databases; used it for data cleansing; worked the Pandas library for data analysis; regular expressions and other string processing methods; classes and object-oriented programming; and built a real-world software application.

Portfolio

Final Project

Brief Description: The Symptom Checker ChatBot - The objective of our project was to create a application in Python that functions like a chatbot and takes user input (based on symptoms) via a series of questions to:

Obtain a list of diseases from reliable/vetted online sources (use of maximum three web sources based on reliability) most commonly associated with the symptoms shared by the user
Allow the user to retrieve more information, such as causes, treatment, a list of over-the-counter medications to alleviate the symptoms, and nearby health facilities based on their location
Provide a reference list of a panel of doctors or hospitals based on user’s location for further review by the user.

SimpSymps: A Health Symptom Checker

Programming in R for Analytics

Through this course, I learned to use RStudio, read R documentation, and write R scripts, import, export and manipulate data, produce statistical summaries of continuous and categorical data, produce basic graphics using standard functions, and produce more advanced graphics using the ggplot2 library, perform common hypothesis tests, and run simple regression models in R and lastly, produce reports of statistical analyses in R Markdown/R Notebooks.

Portfolio

Final Project

Brief Description: “Sex-related differences: Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?”

The main question of interest was answered in the project through the following steps: 1) Data processing and summarization: Insightful graphical and tabular summaries of the data 2) Methodology: Dealing with missing values and topcoded variables; exploring trends and correlations; variable selection 3) Findings: Tabular summaries; graphical Summaries; regression and interpretation of coefficients; assessment of statistical significance 4) Discussion: Potential confounders; model fit limitations; confidence in results for policy makers

Understanding Sex Related Differences in Income Between Men and Women

Hope you enjoyed reading about my journey! If you want to learn more feel free to reach out on LinkedIn