Where should MQMers live and work after graduation?

Nghi Truong
Jan 31, 2020
5 min read

This project was brought to you as a collaboration with my two classmates Weiqiao Bi and Youai Qin in Duke Fuqua - Master of Quantitative Management Program: Business Analytics Program during our first summer data competition about creating a meaningful dataset which we won first prize over other +200 peers

MQM has entered into its third year. Though the program has earned an excellent reputation for job placement, and that current students are confident about landing on their dream jobs, it is always hard to determine which city to work and live after graduation since cities are different in various ways. And gathering overall, objective and high-quality information is never easy. To solve the problem, we decided to create a dataset that can provide information for major cities in the United States by applying multiple scrapping techniques.

We hope that students can access our original, organized, up-to-date and authorized data and then run queries and analysis of their own to see which city or cities are their favorite. Hopefully, our dataset will make this critical “where to go after graduation” decision easier.

Dataset Relational Schema

Data Collection

Our dataset is original and comprehensive as we used different web scraping techniques to collect data from different sources including google maps, glassdoor, HIFLD, and other open authorized websites. Major_Cities

We limited this table to the top 100 cities by population in the states. There were two reasons for this limitation. Firstly, it would take a long time for us to scrap information for every single city in the US. Currently, we do not have enough time and storage to do that. Secondly, we assumed greater cities have more job offerings suitable for MQM students. Therefore, we chose the largest 100 cities as our target. We scrapped the table from Wikipedia, which ranks cities by their population, then draw the first 100 big cities from the table. After that, we cleaned the table, including removing the unnecessary unit, dividing two parts of the coordinate into two columns, renaming each column, etc. We then used this table as our base table and used either column ‘City’ or ‘Coordinate’ to join other information and create a standard for scraping information from other resources. As a result, our tables could be joined lately by using the city as the primary key. We all used web scraping techniques to collect information from different sources (stated in data resources). To attain recreation and infrastructure information for target city, we scraped search results from Google by using google API places, used google self-defined places categories including night clubs, hospitals, movie theaters, shopping malls, art gallery, book stores, parks and restaurants and came up with a reasonable radius from center points to count how many locations there were for each category. As Google limited the search results in only 60 places per search, the number of places in some big cities in our dataset might not reflect the actual number of places in reality as 60 means 60 or more.

Fortune500, Housing, Tax_rate

For Fortune500, Housing, Tax_rate, we use BeautifulSoup package in Python to get the full table from websites and cleaned up the tables after scraping. We manually updated the missing values. Job_By_City and Job_Overview

For Job_By_City and Job_Overview table, we use selenium webdriver package to search for jobs on glassdoors and retrieve information. As the job positions would vary a lot at our graduation time, we only took the search results for total positions found in each city and the sample job information from the first two pages of searching to provide an overview of the labor market in each city. One main note about the searching is that the default radius setting in glassdoor in 25 miles, which means the job search also reflects available positions in nearby areas. There are two main key words that we used for searching based on the MQM jobs report: data analyst and data science. For the web scraping code, we mostly learned it from the referencing source and then modified the code to match the current HTML script of the websites and our expected data

Possible Deployment

The dataset provides a comprehensive picture of a city based on different dimensions from the expected expense, job opportunities to entertainment and food options. As a result, analysis can be done in both personal settings or general research. Personally, students who are interested in finding the best location to work after graduation in data field can themselves create a set of criteria and their weighted value and then score each city based on this data to choose their best place to live.

Taken an example from our dataset, students might find Durham is the better place to live compared to Boston as they share similar annual salary but significant different in rental fee. On another hand, students who looking for well-paid jobs might find San Jose much more attractive compared to other cities.

Another example is students can draw visualization graphs to compare different states based on their housing prices, tax rate, expected salary, numbers of available positions, etc. to easily find pros and cons for each city. In a more general approach, analysts can use this dataset to recommend people who are interested to work in the data field about which cities they should work after graduation. Based on Glassdoors data, we created a map to show number of available data-related jobs in United States to provide an overview about labor market. We can see that California, Texas and North Carolina have greatest number of available jobs which reflects technology nature of these states

Next Steps

Firstly, because of time and storage constraints, we did not collect information for all cities in the US, also not for major cities in other countries such as London, Shanghai, Paris. We believe there are valuable jobs in other cities and countries. Therefore, if we move furtherly, we will try to collect information for more cities.

Secondly, we scrapped job listings on glassdoor with key words as data scientists or data analyst. However, some students in MQM program may find other positions interesting. If we have enough time, we will scrap more positions such as business analyst, digital marketing analyst, product manager, project manager so that students can have more insights about different career availabilities. Moreover, for deeper analysis about different places for entertainment in the city, we can scrap all detailed information about each place including the address, budget, and review score so that students can easily search for their favorite places after selecting their target city

Lastly, we can survey MQM students about their preferences of cities to live after graduation and use it as a benchmark for later analysis.

Where should MQMers live and work after graduation?

Recent Posts

Comments