Wednesday, November 18, 2009

Using the NYC Data Mine for an Intro Database Assignment

On October 6th, I was attending the New York Tech Meetup, and there I learned about the NYC Data Mine repository, which contains "many sets of public data produced by City agencies [...] available in a variety of machine-readable formats".

I went over the data sets available there and indeed the data sets were big, comprehensive, and (mostly) well-structured. So, I decided to use these data sets for the introductory database assignment for my "Information Technology in Business and Society" class. It is a core, required class at Stern and the students are mainly non-majors. Still, I wanted to see what they will do with the data.

So, I created an assignment, asking them to get two or more data sets, import them in a database and run some basic join queries to connect the data sets. Then, they had to bring the data into Excel and perform some PivotChart-based analysis. I left the topic intentionally open, just to see what type of questions they will ask.

Here are the results, together with my one-sentence summary of the analysis/results.
Given that this was the first time that I was giving this assignment, and that this was the first time that students were learning about databases, I was pretty happy with the results. Most of them understood well the datasets and wrote meaningful queries against the data.

However, I would like to encourage the analysis of a more diverse set of data: Students seemed particularly attracted to the graffiti dataset and (expectedly) most used the data set with the socio-economic numbers of each borough.

The rather disappointing fact was that many teams took the "easy way out" and joined data based on the borough (Manhattan, Queens, Brooklyn, Bronx, Staten Island), while it would have been much more interesting to see joins based on zip codes, community boards, districts etc. I guess this becomes a requirement for next year.

Finally, I should encourage people to work with really big datasets (e.g., property valuation statistics), instead of the relatively small ones. But perhaps this is something reserved for the data mining class...