Using the NYC Data Mine for an Intro Database Assignment

On October 6th, I was attending the New York Tech Meetup, and there I learned about the NYC Data Mine repository, which contains "many sets of public data produced by City agencies [...] available in a variety of machine-readable formats".

I went over the data sets available there and indeed the data sets were big, comprehensive, and (mostly) well-structured. So, I decided to use these data sets for the introductory database assignment for my "Information Technology in Business and Society" class. It is a core, required class at Stern and the students are mainly non-majors. Still, I wanted to see what they will do with the data.

So, I created an assignment, asking them to get two or more data sets, import them in a database and run some basic join queries to connect the data sets. Then, they had to bring the data into Excel and perform some PivotChart-based analysis. I left the topic intentionally open, just to see what type of questions they will ask.

Here are the results, together with my one-sentence summary of the analysis/results.

Socioeconomic Analysis of New York: An analysis of the ethnic composition and ancestry for each NYC borough.
Recycling rates across geographic and economic zones in New York City: Richer areas, recycle more.
Sidewalk Cafes and Electronics Stores: Disposable income and prevalence of sidewalk cafes and electronics stores.
Analysis of Crime, Graffiti, and "Broken Window Theory": Prevalence of graffiti is positively correlated with many other types of violent crime.
Recycling Your Expectations: How public satisfaction rate with recycling and public cleanliness correlate with that particular borough's actual recycling behaviors/habits/initiatives.
Borough graffiti vs. emergency response: Does a quicker response time from authorities mean less small crime?
Analyzing the correlation between marriage rates and population growth in NYC: Is there a correlation between marriage/divorce rate and the future expected growth rate of the population?
Analyzing Government's Use of Funds on Sports Parks in Relation to Public Demand: Does demand for a sport influence the city's decisions on fund allocation for various sports?
Analysis of the Number of High Schools per Borough in relation to Race and Income: Do wealthier boroughs have more and better high schools?
NYC Parks & Recreation Capital Budget Analysis: Which parks get most of the funding in New York?
Vandalism and demographics: Analysis of the relationship between unemployment rate, income level and population per police precinct with vandalism rate
NYC Events by Borough: The Relation between Event Types and Social, Demographic, and Economic Factor: What types of events take place in the different NYC boroughs?
Graffiti in New York City: Graffiti and socioeconomic factors.
War of the Boroughs: Projected population growth and correlation with socioeconomic factors.
Examining the socioeconomic composition of NYC by borough: Ethnic diversity, housing types, and educational attainment.
Recycling level compared with education level, income and poverty level by borough
Restaurant Code Violations Around New York City: Restaurants in St. Mark's Place are filthy. Sidewalk cafes are a safer bet.
The Great Discovery: Overcapacity trends in Manhattan Schools: NYC needs to build more schools in the northern part of the city.
Why Graffiti in NYC?: Graffiti and socio-economic factors.
Demographic Analysis of NYC: Relationship between fertility rate, education level and unemployment in the boroughs of New York City.
Where to Raise Your Children in NYC? What NYC borough would be the most ideal environment to raise children in.
Understanding the NYC Household: Analyzing Laundromats, Electronic Stores, and Schools: Analyzing relationships between consumption (electronic stores), necessities (laundromats), and education (schools)
Graffiti Incidents and Income Levels Among Residents Living in Brooklyn: Graffiti incidents in Brooklyn, broken down by community district board, and correlated with poverty and income levels.

Given that this was the first time that I was giving this assignment, and that this was the first time that students were learning about databases, I was pretty happy with the results. Most of them understood well the datasets and wrote meaningful queries against the data.

However, I would like to encourage the analysis of a more diverse set of data: Students seemed particularly attracted to the graffiti dataset and (expectedly) most used the data set with the socio-economic numbers of each borough.

The rather disappointing fact was that many teams took the "easy way out" and joined data based on the borough (Manhattan, Queens, Brooklyn, Bronx, Staten Island), while it would have been much more interesting to see joins based on zip codes, community boards, districts etc. I guess this becomes a requirement for next year.

Finally, I should encourage people to work with really big datasets (e.g., property valuation statistics), instead of the relatively small ones. But perhaps this is something reserved for the data mining class...

A Computer Scientist in a Business School

Wednesday, November 18, 2009

Using the NYC Data Mine for an Intro Database Assignment