My undergraduate training in Economics coupled with my on-going graduate studies in Data Analytics at Carnegie Mellon University has helped me develop an interdisciplinary understanding to make data-driven decisions. I have a profound interest in statistical methods, data analytics, machine learning, data visualization and human-computer interaction. At school, I am learning tools such as Python, SQL, R, Excel, and STATA that are helping me pave a path into the field of Data Science (beyond the buzz word!).
At CMU, I am enrolled in a highly selective STEM certified program that includes robust coursework in data mining & machine learning, statistics & modeling, and computer programming. I strongly believe in leveraging data science techniques to solve some of the most challenging problems of our times.
So far, I have become proficient in three programming languages within a year (while continuously battling with impostor syndrome), worked with different relational and non-relational databases to understand different software architecture, applied decision science techniques such as simulation and optimization to address complex and large-scale decision-making problems that arise in policy and business and lastly, built computational foundation in Machine Learning!
Data Mining & Machine Learning
Predictive and Descriptive Analysis using Data Mining Techniques
As I continue to work with large datasets, tools covered in the study of Data Mining and Machine Learning have helped me leverage methods such as predictive and descriptive analytical tasks in Python and R to solve business and policy challenges. Some projects that I worked on include:
- Exploratory analysis using bikeshare data from the Capital Bikeshare system in Washington DC: Conducted an exploratory analysis to identify qualitative predictors, fit regression model, deal with collinearity issues, and explore non-linearities using several data visualization libraries including ggplot2
- Splines and Degree of Freedom Selection: Worked with manual placement of knots for smoothing splines
- Regression Analysis on Life Expectancy Dataset: Worked with multiple functions to build regression models, calculate cross-validation error, test and train datasets
- Classification Techniques for Targeted Marketing Campaigns: Used logistic regression for two-class classification problem to classify people in different age groups between adult and non-adult categories; also worked with other advanced classification techniques such as Decision Trees, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes
- Advanced Classification Techniques for Targeted Marketing Campaigns: Used random forest algorithm (ensemble learning methods) to build decision trees to correctly predict and classify observations in a dataset
- Predict Flight Delays in 2018 from 2006 US Flight Dataset: Developed a model to help people who come to pick passengers from airports predict flight delays and plan pickup accordingly.
Mathematical and Computational Foundations of Machine Learning
Going deeper into the understanding of Machine Learning, I have been developing a robust mathematical foundation in ML with several proof techniques based in discrete mathematics (contrapositive, truth tables, counterexamples, and contradiction). On the computation end, I have learned to work with different algorithms such as perceptron algorithm, training data to minimize errors in perceptron mistake boundary, computational complexity (Big O notation), dynamic programming with Markov Chains and recursion. Simultaneously, I have also worked with data structures including tree traversals (Depth-first search and Breadth-first search), Stacks and Queues (LIFO vs. FIFO), Graphs and Bayes Nets.
Statistics & Modeling
Optimization and Modeling Framework
In Fall 2021, I took advanced level Decision Analytics course that introduces modeling frameworks and computational tools to address complex, ill-defined, and large-scale decision-making problems that arise in policy and business. It covered advanced methods of decision-making: (large-scale) deterministic optimization, stochastic/robust optimization, and sequential decision-making using Gurobi solver in Python. Case studies were drawn from a variety of real-world settings in transportation, energy, information systems, health care, supply chain management, etc. Here are the case studies and final project that I completed during the course of this class.
Portfolio
- Integrated Manufacturing & Inventory Planning using Linear Optimization
- Emergency Response using Integer Optimization
- Traveling Salesman Problem using Integer Optimization
- World Health Organization’s Nutrition Policy using Multi-Objective Optimization
- Aircraft Configuration using Stochastic Optimization
Final Project
Brief Description: “To prepare for large public health emergencies, the Allegheny County is prioritizing the list of Points of Dispense (PODs) to open, which are used to distribute essential supplies and medicines to the public during such emergencies. Your goal for this project is to formulate a facility location model, and optimally select a list of PODs from the candidate sites. You need to compare two different formulations: minimizing the total/weighted travel distance of the population, and minimizing the maximum travel distance for anyone.”
AB Testing
To extend my learning of Randomized Controlled Trails (RCTs) from economic policy issues to tech products, I took AB testing class in Fall 2021 that focuses on frameworks to measure causal effects across different industries. The course used examples with real-world datasets drawn from research performed at the Heinz College in entertainment and education. Significant effort was placed on understanding how to design randomized experiments (aka A/B tests) to measure causal effects. The assignments and project completed below also leverage the tools that can be used to analyze data from observational studies where randomization can not be implemented.
The concepts and tools discussed in this course are general in nature and can be applied in different settings. The following concepts have been covered in the assignments below:
- Randomized Control Experiments
- Time and Individual Fixed Effects
- Instrumental Variables
- Natural Experiments
- Differences in Differences
- Propensity Score Matching
- Compliance in Experiments
- Heterogeneous Effects
- Interference in Networked Experiments
Portfolio
Final Project
Brief Description: Using data from a Qualtrics survey executed by our team in Pittsburgh, the paper explored the causal effect of including scooter rides as part of the university transportation fee on weekly scooter usage. Causal inferences were drawn as we eliminate selection bias through random assignment of experimental subjects to the treatment and control group. Moreover, the analysis extends to explore heterogeneity in the average causal effect across groups within the sample using key barriers of distance, safety and income as moderators. The project consisted of the following parts: Experiment Design & Causal Question of Interest, Survey Design & Execution, Data & Analysis and Recommendations.
Computer Programming
Spatial Data Analytics
I took Geographic Information Systems (GIS) class that worked on storage, retrieval, and visualization of geographically referenced data as well as design and analysis of spatial information. By the end of the course, I developed an understanding of the world’s quickly-growing spatial data infrastructure and how to put it to work for producing location-based information, identified relevant spatial characteristics of diverse application areas enabling myself to integrate spatial thinking and GIS analysis into my career.
Some of the concepts that I learned includes
- Geographic concepts (world coordinate systems, map scale/projections, sea level/elevation)
- Government-provided map infrastructure, geodatabases, geodesign
- Spatial data processing (clipping, merging, appending, joining, dissolving)
- Spatial analysis (proximity analysis, risk surface, site suitability, spatial data mining)
Portfolio
- Mapping Health Data
- Compare the Walkability of Pittsburgh Neighborhoods
- Compare Serious Violent Crime with Poverty in Pittsburgh
- Build a Study Area: Supporting Geodatabase and Map for a Rapidly Growing Texas Metropolitan Area
- Perform a Cluster Analysis of Tornadoes
Advanced Programming in Python
My journey in the world of data science started with foundational and advanced courses in Python in the first semester. In this course, I gathered data from various sources including web scraping, web API, CSV and other structured data files, and databases; used it for data cleansing; worked the Pandas library for data analysis; regular expressions and other string processing methods; classes and object-oriented programming; and built a real-world software application.
Portfolio
- Data Analysis Using Pandas
- Data Processing Using SQLite
- Regular Expressions and Web API’s
- Data Visualization and Object Oriented Programming
Final Project
Brief Description: The Symptom Checker ChatBot - The objective of our project was to create a application in Python that functions like a chatbot and takes user input (based on symptoms) via a series of questions to:
- Obtain a list of diseases from reliable/vetted online sources (use of maximum three web sources based on reliability) most commonly associated with the symptoms shared by the user
- Allow the user to retrieve more information, such as causes, treatment, a list of over-the-counter medications to alleviate the symptoms, and nearby health facilities based on their location
- Provide a reference list of a panel of doctors or hospitals based on user’s location for further review by the user.
Programming in R for Analytics
Through this course, I learned to use RStudio, read R documentation, and write R scripts, import, export and manipulate data, produce statistical summaries of continuous and categorical data, produce basic graphics using standard functions, and produce more advanced graphics using the ggplot2 library, perform common hypothesis tests, and run simple regression models in R and lastly, produce reports of statistical analyses in R Markdown/R Notebooks.
Portfolio
Final Project
Brief Description: “Sex-related differences: Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?”
The main question of interest was answered in the project through the following steps: 1) Data processing and summarization: Insightful graphical and tabular summaries of the data 2) Methodology: Dealing with missing values and topcoded variables; exploring trends and correlations; variable selection 3) Findings: Tabular summaries; graphical Summaries; regression and interpretation of coefficients; assessment of statistical significance 4) Discussion: Potential confounders; model fit limitations; confidence in results for policy makers
Hope you enjoyed reading about my journey! If you want to learn more feel free to reach out on LinkedIn