Tuesday, November 23, 2010

NYC, I Love You(r Data)

Last year, I experimented with the NYC Data Mine repository as a source of data for our introductory course on information systems (for business students, mainly non-majors). The results of the assignment were great, so I repeated it this year.

The goal of the assignment was to teach them how to grab and run database queries against large datasets. As part of an assignment, the students had to to go the NYC Data Mine repository, pick two datasets of their interest, join them in Access, and perform some analysis of interest. The ultimate goal was to get them to use some real data, and use them to perform an analysis of their interest.

Last year, some students took the easy way out and joined the datasets manually(!) on the borough values (Manhattan, Bronx, Brooklyn, Queens, Staten Island). This year, I explicitly forbid them from doing so. Instead, I explicitly asked them to join only using attributed with a large number of values.

The results are here and most of them are well-worth reading! The analyses below is almost like a tour guide on the New York's data sightseeings :-) The new generation of Nate Silver's is coming.



Enjoy the projects:
  • Academia and Concern for the Environment! Is there a correlation between how much you recycle and how well students perform in school? Are kids who are more involved in school activities more likely to recycle? Does school really teach us to be environmentally conscious? To find out the answers check out our site!
  • An Analysis of NYC Events: One of the greatest aspects about New York are the fun festivals, street fairs and block parties where you can really take in the culture. Our charts demonstrate which time to visit New York or what boroughs to attend events. We suggest that tourists and residents check out our research. Also organizers of events or people who make there money from events should also consult our analysis.
  • How are income and after school programs related?: This study is an analysis of how income levels are related to the number of after school programs in an area. The correlation between income and number of school programs was interesting to analyze across the boroughs because while they did follow a trend, the different environments of the boroughs also had an exogenous effect. This is most evident in Manhattan, which can be seen in the study.
  • Restaurant Cleanliness in Manhattan What are the cleanest and dirtiest restaurants in Manhattan? What are the most common restaurant code violations? We analyzed data on restaurant inspection results and found answers to these questions and more.
  • Ethnic Dissimilarity's Effect on New Business: This analysis focuses on the relationship between new businesses and specific ethnic regions. Do ethnically dominated zip codes deter or promote business owners of differing ethnicities to open up shop?
  • Does The Perception Of Safety In Manhattan Match With Reality? People’s perception of events and their surroundings influence their behavior and outlook, even though facts may present a different story. In this regard, we took a look at the reported perception of people’s safety within Manhattan and compared it to the actual crime rates reported by the NYPD. The purpose of our study was to evaluate the difference between the actual crime rate and perceived safety of citizens and measure any discrepancy.
  • Women's Organizations love food stores!: We have concluded that a large percentage of women's organizations are located near casual dining and takeout restaurants as well as personal and professional service establishments compared to what we originally believed would be shopping establishments.
  • Hispanics love electronics!: Our goal for this project is to analyze the relationship between electronic stores and demographics in a particular zip code. We conducted a ratio analysis instead of a count analysis to lessen the effects of population variability as to create an "apples to apples" comparison. From our analysis, it can be seen that there is a greater presence of electronic stores in zip codes with a higher proportion of Hispanics.
  • Political Contributions and Expenditures: A comprehensive analysis of the political contributions and expenditures during the 2009 elections. The breakdown of who, in what areas of Manhattan contribute as well as how candidates spend their money are particularly interesting!
  • How Dirty is Your Food? Our goal for this project is to analyze the various hygiene conditions of restaurants in New York City. We cross referenced the inspection scores of the restaurants with the cuisine they serve to find out if there was any correlation between these two sets of data. By ranking the average health score of the various cuisines, we can determine which kinds of cuisines were more likely to conform to health standards.
  • Want to Start a Laundromat? An Electronic Store? The best possible places to start a Laundromat and an electronic store. For Laundromats we gave the area that had the lowest per capita income, as we noticed a trend that Laundromats do better in poorer neighborhoods. For electronic stores we found the lowest saturated areas that have the highest per capita income.
  • Where to Let Your Children Loose During the Day in NYC: For this analysis, we wondered whether there was a correlation between how safe people felt in certain areas in New York and the availability of after-school programs in the different community boards.
  • Best Place to Live in Manhattan After Graduation: We analyzed what locations in Manhattan, classified by zip code, would be the best to live for a newly graduate. We used factors like shopping, nightlife, gyms, coffeehouses, and more! Visit the website to get the full analysis.
  • Political Contributions and Structures: Our report analyzes the correlation between political contributions and structures in New York in varying zip codes.
  • Best Places to Eat and Find Parking in New York City: Considering the dread of finding parking in New York City, our analysis is aimed at finding the restaurants with the largest number of parking spaces in their vicinities.
  • Are the Cleanest Restaurants Located in the Wealthiest Neighborhoods? Our analysis between property value and restaurant rating for the top and bottom ten rated restaurants by zip codes in New York City
  • Analysis of Popular Baby Names
  • Restaurant Sanitary Conditions: Our team was particularly interested in the various cuisines offered in various demographic neighborhoods, grouped by zip codes. We were especially curious about the sanitary level of various cuisines offered by restaurants. The questions we wanted to answer were:
    • What zip codes had the highest rated restaurants? What type of cuisines are found in these zip codes?
    • What zip codes had the lowest rated restaurants? What type of cuisines are found in these zip codes?
  • Does having more community facilities improve residents' satisfaction with city agencies? Does having more public and private community facilities in NYC such as schools, parks, libraries, public safety, special needs housing, health facilities, etc lead to greater satisfaction with city services? On intuition, the answer is a resounding YES! With more facilities, we would enjoy our neighborhood better and develop a better opinion of New York City services. But how accurate is this intuition? In this analysis, we put that to the test.
  • Housing Patterns in Manhattan: The objective of our analysis was to identify factors which play a role in determining vacancy rates in Manhattan’s community districts. We inferred that vacancy rates are representative of the population’s desire to live in a particular district. We examined determining factors of why people want to live in a particular district including: quality of education, health care services, crime control in each district, etc.
  • Analysis of Cultural Presence and Building Age by Zip Code: Manhattan is a hub for cultural organizations and opportunities for community involvement. But does the amount of "community presence" differ based on area that you live? Is there any relationship between the year that buildings in various areas were built, and the available programs for students and cultural organizations for the general public in that area? We analyzed whether a relationship existed between the number of cultural organizations and after school programs available in a zip code, and the average year that the buildings in the zip code were built. To further our analysis we looked at whether the age of buildings in areas with greatest "cultural presence" affected the sales price of the buildings.
  • Analysis of Baby Names across the Boroughs: We decided to analyze the Baby Names given in 2008 across the boroughs of Manhattan, the Bronx and Brooklyn. We found the most popular names in each Borough, along with top names specific to each borough that were unpopular in other Boroughs. We also found certain factors that could be a determining factor in the naming of these babies.
  • Analysis of New York City Street Complaints: We analyzed the different kinds of street complaints made in New York City, how the city tends to respond to them, and which streets have the most overall complaints when you also bring building complaints into the picture. This analysis taught us that Broadway has the most street complaints but it also piqued our interest in conducting even further analyses.
  • Campaign Contributions and Community Service Programs The goal of our analysis was to determine if there is a correlation between contributions by NYC residents to election candidates and community service programs. We wanted to see if people who are more financially invested in elections are also more inclined to be involved in their neighborhoods through community programs.
  • Public Libraries in Queens: We looked at how many public libraries there were in each zip code in Queens. We also looked at the number of people and racial composition in each zip code, to see if these factors are related.
  • Sidewalk Cafe Clustering: Our study’s goal is to understand where sidewalk cafes cluster and some potential reasons why they cluster. We start by looking at what areas of the city are most populated with sidewalk cafes. Then we look to see if there are any trends related to gender or race demographics. We finally look to see if there is any influence on property value on the abundance of sidewalk cafes.


The surprise in this year: Most students could not understand what is the "CSV" data file. Many of them thought it was some plain text, and did not try to use it. (Hence the prevalence of electronic and laundromat analyses, which were based on datasets available in Excel format.) I guess next year I will need to explain that as well.