A Computer Scientist in a Business School

Showing posts with label large datasets. Show all posts

Tuesday, April 2, 2013

Intrade Archive: Data for Posterity

A few years back, I have done some work on prediction markets. For this line of research, we have been collecting data from Intrade, to perform our experimental analysis. Some of the data is available through the Intrade Archive, a web app that I wrote in order to familiarize myself with the Google App Engine.

In the last few weeks, through, after the effective shutdown of Intrade, I started receiving requests on getting access to the data stored in the Intrade Archive. So, after popular demand, I gathered all the data from the Intrade Archive, and also all the past data that I had about all the Intrade contracts going back to 2003, and I put them all on GitHub for everyone to access and download. The Excel file contains a description of the contracts, while the zip file contains information about all the individual trades and the daily opening and closing prices.

On purpose, I exclude all the Financial contracts, as the trading of these events have limited research interest. (Plus, they were too many of them.) The information from "official" stock and options exchanges has much higher volume and is a better source of information than the comparatively illiquid contracts on Intrade.

The link to the GitHub repository is also now available from the home page of the Intrade Archive. I hope that the resource hungry crawlers can now be put to sleep, not to ever come back again :-)

Enjoy!

Thursday, June 21, 2012

The oDesk Flower: Playing with Visualizations

In the few couple of weeks, while at oDesk, I am trying to learn the data stored in the database, and I create random plots to understand what is happening in the market. My absolutely favorite source of data is the data about the micro-level activity of the workers (when they work, how much they type, how much they move the mouse, etc.).

A few weeks back, I posted a blog about the activity levels of different countries, with the basic observation that the activity in Philippines fluctuates much less within the 24-hr day compared to all other countries.

You are doing it wrong: The use of radar plots

After I posted that plot, I received the following email:

This is periodic data, which means modular thinking. When you visualize periodic data using a linear plot, you necessarily have a cutting point for the x-axis, which can affect the perception of various trends in the data. You should use something similar to the Flickr Flow, e.g a radar plot in Excel.

So, following the advice of people that really understand visualization, I transformed the activity plot into a radar plot, (in Excel):

The oDesk Flower

As you can see, indeed the comment was correct. Given the periodicity of the data, having a cyclical display is better than having a single horizontal line display. Beautiful to look at? Check. I called this visualization "The oDesk Flower" :-)

Unfortunately, it is not truly informative due to the huge number of countries in the plot. But I think it works well to give the global pace of activity over the week and across countries.

One thing that I did not like in this plot was the fact that I could not really compare the level of activity from one country to other. So, I normalized the values to be the percentage of contractors from that country that are active. A new flower emerged:

For comparison, here is the corresponding linear plot, illustrating the percentage of contractors from various countries that are active at any given time:

Fighting overplotting using kernel smoothing and heatmaps

The plot above is kind of interesting and indeed it shows the pattern of activity. However, we have a lot of "overplotting", which makes the plot busy. It is hard to understand where the majority of the lines are falling.

To understand better the flow of the lines, I decided to play a little bit with R. I loaded the data set with the activity line from each country, and then used kernel based smoothing (bkde2D) to find the regions of the space that had the highest density. To plot the result, I used a contour plot (filled.contour), which allows for the easy generation of heatmaps. Here is the R code:

and here is the resulting plot:

I like how this plot shows the typical activity across countries, which ranges from 2% to 6% of the total registered users. At the same time, we can see (the yellow-green "peaks) that there are also countries that have 8% to 10% of their users being active every week.

Need for interactivity

So, what did I learn from all these exercises? While I could create nice plots, I felt that static visualization are at the end of limited value. Other people cannot do any dynamic exploration of the data. Nobody can customize the plot to show a slightly different view and in general we lack the flexibility given by, say, the visualization gadgets of Google or by the data driven documents created using d3.js.

I would love to be able to create some more interactive plots and let other people play with and explore the data that oDesk has. Perhaps I should hire a contractor on oDesk to do that :-)

Tuesday, November 23, 2010

NYC, I Love You(r Data)

Last year, I experimented with the NYC Data Mine repository as a source of data for our introductory course on information systems (for business students, mainly non-majors). The results of the assignment were great, so I repeated it this year.

The goal of the assignment was to teach them how to grab and run database queries against large datasets. As part of an assignment, the students had to to go the NYC Data Mine repository, pick two datasets of their interest, join them in Access, and perform some analysis of interest. The ultimate goal was to get them to use some real data, and use them to perform an analysis of their interest.

Last year, some students took the easy way out and joined the datasets manually(!) on the borough values (Manhattan, Bronx, Brooklyn, Queens, Staten Island). This year, I explicitly forbid them from doing so. Instead, I explicitly asked them to join only using attributed with a large number of values.

The results are here and most of them are well-worth reading! The analyses below is almost like a tour guide on the New York's data sightseeings :-) The new generation of Nate Silver's is coming.

Enjoy the projects:

Academia and Concern for the Environment! Is there a correlation between how much you recycle and how well students perform in school? Are kids who are more involved in school activities more likely to recycle? Does school really teach us to be environmentally conscious? To find out the answers check out our site!
An Analysis of NYC Events: One of the greatest aspects about New York are the fun festivals, street fairs and block parties where you can really take in the culture. Our charts demonstrate which time to visit New York or what boroughs to attend events. We suggest that tourists and residents check out our research. Also organizers of events or people who make there money from events should also consult our analysis.
How are income and after school programs related?: This study is an analysis of how income levels are related to the number of after school programs in an area. The correlation between income and number of school programs was interesting to analyze across the boroughs because while they did follow a trend, the different environments of the boroughs also had an exogenous effect. This is most evident in Manhattan, which can be seen in the study.
Restaurant Cleanliness in Manhattan What are the cleanest and dirtiest restaurants in Manhattan? What are the most common restaurant code violations? We analyzed data on restaurant inspection results and found answers to these questions and more.
Ethnic Dissimilarity's Effect on New Business: This analysis focuses on the relationship between new businesses and specific ethnic regions. Do ethnically dominated zip codes deter or promote business owners of differing ethnicities to open up shop?
Does The Perception Of Safety In Manhattan Match With Reality? People’s perception of events and their surroundings influence their behavior and outlook, even though facts may present a different story. In this regard, we took a look at the reported perception of people’s safety within Manhattan and compared it to the actual crime rates reported by the NYPD. The purpose of our study was to evaluate the difference between the actual crime rate and perceived safety of citizens and measure any discrepancy.
Women's Organizations love food stores!: We have concluded that a large percentage of women's organizations are located near casual dining and takeout restaurants as well as personal and professional service establishments compared to what we originally believed would be shopping establishments.
Hispanics love electronics!: Our goal for this project is to analyze the relationship between electronic stores and demographics in a particular zip code. We conducted a ratio analysis instead of a count analysis to lessen the effects of population variability as to create an "apples to apples" comparison. From our analysis, it can be seen that there is a greater presence of electronic stores in zip codes with a higher proportion of Hispanics.
Political Contributions and Expenditures: A comprehensive analysis of the political contributions and expenditures during the 2009 elections. The breakdown of who, in what areas of Manhattan contribute as well as how candidates spend their money are particularly interesting!
How Dirty is Your Food? Our goal for this project is to analyze the various hygiene conditions of restaurants in New York City. We cross referenced the inspection scores of the restaurants with the cuisine they serve to find out if there was any correlation between these two sets of data. By ranking the average health score of the various cuisines, we can determine which kinds of cuisines were more likely to conform to health standards.
Want to Start a Laundromat? An Electronic Store? The best possible places to start a Laundromat and an electronic store. For Laundromats we gave the area that had the lowest per capita income, as we noticed a trend that Laundromats do better in poorer neighborhoods. For electronic stores we found the lowest saturated areas that have the highest per capita income.
Where to Let Your Children Loose During the Day in NYC: For this analysis, we wondered whether there was a correlation between how safe people felt in certain areas in New York and the availability of after-school programs in the different community boards.
Best Place to Live in Manhattan After Graduation: We analyzed what locations in Manhattan, classified by zip code, would be the best to live for a newly graduate. We used factors like shopping, nightlife, gyms, coffeehouses, and more! Visit the website to get the full analysis.
Political Contributions and Structures: Our report analyzes the correlation between political contributions and structures in New York in varying zip codes.
Best Places to Eat and Find Parking in New York City: Considering the dread of finding parking in New York City, our analysis is aimed at finding the restaurants with the largest number of parking spaces in their vicinities.
Are the Cleanest Restaurants Located in the Wealthiest Neighborhoods? Our analysis between property value and restaurant rating for the top and bottom ten rated restaurants by zip codes in New York City
Analysis of Popular Baby Names
Restaurant Sanitary Conditions: Our team was particularly interested in the various cuisines offered in various demographic neighborhoods, grouped by zip codes. We were especially curious about the sanitary level of various cuisines offered by restaurants. The questions we wanted to answer were:
- What zip codes had the highest rated restaurants? What type of cuisines are found in these zip codes?
- What zip codes had the lowest rated restaurants? What type of cuisines are found in these zip codes?
Does having more community facilities improve residents' satisfaction with city agencies? Does having more public and private community facilities in NYC such as schools, parks, libraries, public safety, special needs housing, health facilities, etc lead to greater satisfaction with city services? On intuition, the answer is a resounding YES! With more facilities, we would enjoy our neighborhood better and develop a better opinion of New York City services. But how accurate is this intuition? In this analysis, we put that to the test.
Housing Patterns in Manhattan: The objective of our analysis was to identify factors which play a role in determining vacancy rates in Manhattan’s community districts. We inferred that vacancy rates are representative of the population’s desire to live in a particular district. We examined determining factors of why people want to live in a particular district including: quality of education, health care services, crime control in each district, etc.
Analysis of Cultural Presence and Building Age by Zip Code: Manhattan is a hub for cultural organizations and opportunities for community involvement. But does the amount of "community presence" differ based on area that you live? Is there any relationship between the year that buildings in various areas were built, and the available programs for students and cultural organizations for the general public in that area? We analyzed whether a relationship existed between the number of cultural organizations and after school programs available in a zip code, and the average year that the buildings in the zip code were built. To further our analysis we looked at whether the age of buildings in areas with greatest "cultural presence" affected the sales price of the buildings.
Analysis of Baby Names across the Boroughs: We decided to analyze the Baby Names given in 2008 across the boroughs of Manhattan, the Bronx and Brooklyn. We found the most popular names in each Borough, along with top names specific to each borough that were unpopular in other Boroughs. We also found certain factors that could be a determining factor in the naming of these babies.
Analysis of New York City Street Complaints: We analyzed the different kinds of street complaints made in New York City, how the city tends to respond to them, and which streets have the most overall complaints when you also bring building complaints into the picture. This analysis taught us that Broadway has the most street complaints but it also piqued our interest in conducting even further analyses.
Campaign Contributions and Community Service Programs The goal of our analysis was to determine if there is a correlation between contributions by NYC residents to election candidates and community service programs. We wanted to see if people who are more financially invested in elections are also more inclined to be involved in their neighborhoods through community programs.
Public Libraries in Queens: We looked at how many public libraries there were in each zip code in Queens. We also looked at the number of people and racial composition in each zip code, to see if these factors are related.
Sidewalk Cafe Clustering: Our study’s goal is to understand where sidewalk cafes cluster and some potential reasons why they cluster. We start by looking at what areas of the city are most populated with sidewalk cafes. Then we look to see if there are any trends related to gender or race demographics. We finally look to see if there is any influence on property value on the abundance of sidewalk cafes.

The surprise in this year: Most students could not understand what is the "CSV" data file. Many of them thought it was some plain text, and did not try to use it. (Hence the prevalence of electronic and laundromat analyses, which were based on datasets available in Excel format.) I guess next year I will need to explain that as well.

Wednesday, November 18, 2009

Using the NYC Data Mine for an Intro Database Assignment

On October 6th, I was attending the New York Tech Meetup, and there I learned about the NYC Data Mine repository, which contains "many sets of public data produced by City agencies [...] available in a variety of machine-readable formats".

I went over the data sets available there and indeed the data sets were big, comprehensive, and (mostly) well-structured. So, I decided to use these data sets for the introductory database assignment for my "Information Technology in Business and Society" class. It is a core, required class at Stern and the students are mainly non-majors. Still, I wanted to see what they will do with the data.

So, I created an assignment, asking them to get two or more data sets, import them in a database and run some basic join queries to connect the data sets. Then, they had to bring the data into Excel and perform some PivotChart-based analysis. I left the topic intentionally open, just to see what type of questions they will ask.

Here are the results, together with my one-sentence summary of the analysis/results.

Socioeconomic Analysis of New York: An analysis of the ethnic composition and ancestry for each NYC borough.
Recycling rates across geographic and economic zones in New York City: Richer areas, recycle more.
Sidewalk Cafes and Electronics Stores: Disposable income and prevalence of sidewalk cafes and electronics stores.
Analysis of Crime, Graffiti, and "Broken Window Theory": Prevalence of graffiti is positively correlated with many other types of violent crime.
Recycling Your Expectations: How public satisfaction rate with recycling and public cleanliness correlate with that particular borough's actual recycling behaviors/habits/initiatives.
Borough graffiti vs. emergency response: Does a quicker response time from authorities mean less small crime?
Analyzing the correlation between marriage rates and population growth in NYC: Is there a correlation between marriage/divorce rate and the future expected growth rate of the population?
Analyzing Government's Use of Funds on Sports Parks in Relation to Public Demand: Does demand for a sport influence the city's decisions on fund allocation for various sports?
Analysis of the Number of High Schools per Borough in relation to Race and Income: Do wealthier boroughs have more and better high schools?
NYC Parks & Recreation Capital Budget Analysis: Which parks get most of the funding in New York?
Vandalism and demographics: Analysis of the relationship between unemployment rate, income level and population per police precinct with vandalism rate
NYC Events by Borough: The Relation between Event Types and Social, Demographic, and Economic Factor: What types of events take place in the different NYC boroughs?
Graffiti in New York City: Graffiti and socioeconomic factors.
War of the Boroughs: Projected population growth and correlation with socioeconomic factors.
Examining the socioeconomic composition of NYC by borough: Ethnic diversity, housing types, and educational attainment.
Recycling level compared with education level, income and poverty level by borough
Restaurant Code Violations Around New York City: Restaurants in St. Mark's Place are filthy. Sidewalk cafes are a safer bet.
The Great Discovery: Overcapacity trends in Manhattan Schools: NYC needs to build more schools in the northern part of the city.
Why Graffiti in NYC?: Graffiti and socio-economic factors.
Demographic Analysis of NYC: Relationship between fertility rate, education level and unemployment in the boroughs of New York City.
Where to Raise Your Children in NYC? What NYC borough would be the most ideal environment to raise children in.
Understanding the NYC Household: Analyzing Laundromats, Electronic Stores, and Schools: Analyzing relationships between consumption (electronic stores), necessities (laundromats), and education (schools)
Graffiti Incidents and Income Levels Among Residents Living in Brooklyn: Graffiti incidents in Brooklyn, broken down by community district board, and correlated with poverty and income levels.

Given that this was the first time that I was giving this assignment, and that this was the first time that students were learning about databases, I was pretty happy with the results. Most of them understood well the datasets and wrote meaningful queries against the data.

However, I would like to encourage the analysis of a more diverse set of data: Students seemed particularly attracted to the graffiti dataset and (expectedly) most used the data set with the socio-economic numbers of each borough.

The rather disappointing fact was that many teams took the "easy way out" and joined data based on the borough (Manhattan, Queens, Brooklyn, Bronx, Staten Island), while it would have been much more interesting to see joins based on zip codes, community boards, districts etc. I guess this becomes a requirement for next year.

Finally, I should encourage people to work with really big datasets (e.g., property valuation statistics), instead of the relatively small ones. But perhaps this is something reserved for the data mining class...

Tuesday, June 9, 2009

Google Fusion Tables: Databases on the Cloud

From the Google Research Blog: Google Fusion Tables.

Now it is possible to upload tabular data sets on Google, let other people use the data, and provide easy-to-use visualizations. No complicated joins or other heavy-duty relational stuff but there is functionality to connect (fuse) tables. There is also functionality embedded to discuss the contents of the data set.

Here is an early example. I took the data from a survey of Mechanical Turkers and imported it in Google Tables. Here is the resulting intensity map that shows the distribution of workers per country:

and the "lift" of the distribution of workers per state (we are comparing actual population percentage with percentage of Turkers):

I am truly excited about this feature. Just the idea that it will be possible to release "live" data sets, without having to set up complicated web interfaces, worrying about security, SQL injections, and so on, makes this absolutely wonderful for me.

For comparison, see the corresponding visualizations from Many Eyes:

But the flexibility of Google Tables for data management counters the relative lack of visualization options.

My only real complaint: The 100Mb limit. I was ready to upload my Mechanical Turk archive (see the related blog post) there, and let other people use it. Unfortunately, it is larger than the 100Mb limit. If only I could use the extra storage that I bought from Google for my Gmail account...

Thursday, November 13, 2008

Social Annotation of the NYT Corpus?

While I am waiting for the arrival of the New York Times Annotated Corpus, I have been thinking about the different tasks that we could use the corpus for. For some tasks, we might have to run additional extraction systems, to identify entities that are not currently marked. So, for example, we could use the OpenCalais system to extract patent issuances, company legal issues, and so on.

And then, I realized that most probably, tens of other groups will end up doing the same, over and over again. So, why not run such tasks once, and store them for others to use? In other words, we could have a "wiki-style" contribution site, where different people could submit their annotations, letting other people use them. This would save a significant amount of computational and human resources. (Freebase is a good example of such an effort.)

Extending the idea even more, we could have reputational metrics around these annotations, where other people provide feedback on the accuracy, comprehensiveness, and general quality of the submitted annotations.

Is there any practical problem with the implementation of this idea? I understand that someone needs access to the corpus to start with, but I am trying to think of more high-level obstacles (e.g., copyright, or conflict with the interests of publishers)?

Friday, October 31, 2008

The New York Times Annotated Corpus

Last week, I was invited to give a talk at a conference at the New York Public Library, about the preservation of news. I talked about our research in the Economining project, where we are trying to find the "economic value" of textual content on the Internet.

As part of the presentation, I discussed some problems that I had in the past with obtaining well-organized news corpora that are both comprehensive and also easily accessible using standard tools. Factiva has an excellent database of articles, exported in a richly annotated XML format but unfortunately Factiva prohibits data mining of the content of its archives.

The librarians in the conference were very helpful in offerring suggestions and acknowledging that providing content for data mining purposes should be one of the goals of any preservation effort.

So, yesterday I received an email from Dorothy Carner informing me about the availability of The New York Times Corpus, a corpus of 1.8 million articles from The New York Times, dating from 1987 until 2007. The details are available from http://corpus.nytimes.com but let me repeat some of the interesting facts here (the emphasis below is mine):

The New York Times Annotated Corpus is a collection of over 1.8 million articles annotated with rich metadata published by The New York Times between January 1, 1987 and July 19, 2007.
With over 650,000 individually written summaries and 1.5 million manually tagged articles, The New York Times Annotated Corpus has the potential to be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction.

The corpus is provided as a collection of XML documents in the News Industry Text Format (NITF). Developed by a consortium of the world’s major news agencies, NITF is an internationally recognized standard for representing the content and structure of news documents. To learn more about NITF please visit the NITF website.
Highlights of The New York Times Annotated Corpus include:
Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
Over 650,000 article summaries written by the staff of The New York Times Index Department.
Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at NYTimes.com.
Java tools for parsing corpus documents from xml into a memory resident object.
To learn more about The New York Times Annotated Corpus please read the PDF Overview.

Yes, 1.8 million articles, in richly annotated XML, with summaries, with hierarchically categorized articles, and with verified annotations of people, locations, and organizations! Expect the corpus to be a de facto standard for many text-centric research efforts! Hopefully more organizations are going to follow the example of New York Times and we are going to see such publicly available corpora from other high-quality sources. (I know that Associated Press has an archive of almost 1Tb of text, in computerized form, and hopefully we will see something similar from them as well.)

How can you get the corpus? It is available from LDC, for 300 USD for non-members; members should get this for free.

I am looking forward to receiving the corpus and start playing!

Sunday, July 27, 2008

Using The New York Times Reader

A few weeks back, I installed on my computer the "New York Times Reader". It is an application from The New York Times that runs quietly in the background, downloading locally all the articles of NY Times published over the last week. It also provides its own non-browser interface for browsing through the articles. The layout emulates more the layout of a paper newspaper than the layout of a web edition. I have used the reader a little bit when I downloaded it, but then I forgot about it and kept reading the news over the web.

Today, though, I found myself stuck on a 10-hr flight to Greece, with no Internet connectivity. Well, no problem. Actually I enjoy such long flights *because* there is no Internet connectivity and I can really focus on whatever I am doing, without (voluntarily or not) interruptions.

After going through all my email, I answered all the emails that were staying in my inbox for a while, and then I started reading blogs using the offline option of Google Reader. Unfortunately, reading blogs offline is not a very enjoyable experience. Some blogs are simply pointing to external articles, some others have only partial feeds, and some others are not meaningful to read without going over the comments and the discussion. So, quickly, I ran out of stuff to do.

Then, I noticed that I had the paper version of New York Times in front of me. I tried to read a little bit, just to realize that it is a royal pain to read a newspaper with the layout of New York Times on a plane. New York Times deserves and needs a coffee table, not a tray that can barely fit the laptop.

At that time, I realized that I had the Reader available on my laptop. Not sure if it syncs, I opened it. Fortunately, it has been quietly syncing all the material, and now I had one week of New York Times articles at my disposal. They layout was nice, the font type excellent, and the interface very intuitive. Plus the ability to go through all the sections (some of them published only once a week) is a big advantage. Therefore, I happily read one week worth of NY Times (ok, impossible, but it felt like that) on my laptop, ignoring completely the paper version sitting next to me.

Then, I noticed the "search" link. I went to the search screen and I started typing various queries. Well, was I surprised! Search was immediate, "results-as-you-type"-style. Plus the results were nicely organized as a newspaper, and ordered by relevance going from left to right. Here is a screenshot of the results for the query "economy":

Next step: See what is this "Topic Explorer". This generates a result screen like this:

Not very impressive initially, but the more interesting thing happens when you click an article, and you see a list of all the associated topics:

Very easy to go through related articles, easy to see the level of interest for each topic, and so on. I guess a little bit further visualization could also help. Also, some extra work to allow for faceted navigation would make the interface even more interesting. But it is definitely an enjoyable experience as-is, demonstrating the power of truly online interfaces over interfaces that simply try to emulate paper.

Monday, May 12, 2008

Experimental Repeatability or simply Open Source?

This year SIGMOD and KDD started playing with the idea of experimental repeatability. The basic idea is to generate guidelines and processes that will encourage repeatability of the experiments presented in many papers.

The reasons are rather obvious: We need to be able to reproduce the experiment, to avoid any hidden bias, catch errors, and even avoid outright fraud. Furthermore, this encourages publications of techniques that are easy to implement and test. Why do we care? If the method is impossible to implement then it is an obstacle to research progress. A published paper that claims to be the state of the art, but is not reproducible may prevent other reproducible methods from being published, just for lack of comparison with the current state of the art.

Now, to achieve experimental repeatability we need two things:

Access to the data sets
Access to the code

Both parts tend to have issues: When someone uses multi-terrabyte data sets, it is highly unclear how to give access to such data to outsiders. (Our work on the evolution of web databases used a 3.3Tb dataset -- I have no idea how to even make the data available.) Other issues include copyrighted datasets, e.g., archives of newspaper articles. Despite these issues, I believe that at the end it is relatively easy to give access to the used datasets. See, for example, the UCI Machine Learning Repository, the UCR Time Series, the Linguistic Data Consortium, the Wharton Research Data Services (WRDS), and Daniel Lemire's set of pointers. (Feel free to post more pointers in the comments.)

The second aspect is access to the underlying code. One may argue that instead of giving access to the code we should describe clearly how to implement the algorithms, give the settings, and so on. This avoids any intellectual property issues, and everyone is happy. Personally, I do not buy this. No matter how nicely someone implements someone else's algorithms, nobody is going to spend much of time optimizing the code for a competing technique. This may lead to flawed experimental comparisons. Another alternative is to use common datasets and simply pick the performance numbers from the published paper, without reimplementing the competing technique. (This works only when the underlying hardware is irrelevant -- e.g., for precision/recall experiments in information retrieval.)

My own take? Encourage publication of open source software. If the code is open and available, comparisons are easy, and the whole issue of experimental repeatability becomes moot. No need for committees to verify that the reported results are indeed correct, no need to upload code into machines with different architecture, making sure that the code runs without any segmentation faults, and so on. If the code is available, even if the results are incorrect, someone will catch that in the future. (If the results are incorrect, the code and data is available, and nobody cares to replicate the results, then experimental repeatability is a moot point.)

Now, it is easy to talk about open source, but anyone who tries knows what a pain it is to take the scripts used to run experiments and make them ready to use by anyone else. (Or even to be reused later, from the author :-) Therefore, we need to give further incentives. The idea of the JMLR journal to have a track for submissions of open source software; this track serves as "a venue for collection and dissemination of open source software"

Perhaps this is the way to proceed, an alternative to the "experimental repeatability requirements" that may be too difficult to follow.

Wednesday, June 6, 2007

Playing with Wikipedia

I was working with Wisam Dakka on a Wikipedia project, and I was puzzled by some Wikipedia entries that had really long titles.

The first one that I noticed was a term with 163 characters: "Krungthepmahanakornamornratanakosinmahintarayu
tthayamahadilokphopnopparatrajathaniburiromudomra
janiwesmahasatharnamornphimarnavatarnsathitsakkattiyavi"
which redirects to Bangkok. I do not know if this is a prank, or a valid entry. (Update: It is a correct entry, according to the talk page of the entry.)

Then, I noticed another term with 255 characters:
"Wolfeschlegelsteinhausenbergerdorffvoralternwarenge wissenhaftschaferswessenschafewarenwohlgepflegeund sorgfaltigkeitbeschutzenvonangreifendurchihrraubgierig
feindewelchevoralternzwolftausendjahresvorandieer
scheinenwanderersteerdemenschderraumschiffgebrauchl,"
which in fact is a valid term and the 255 characters is simply a shortcut for the 580 character entry :-)

Finally, there is a term with 182 characters:
"Lopadotemachoselachogaleokranioleipsanodrimhypotrim
matosilphioparaomelitokatakechymenokichlepikossypho
phattoperisteralektryonoptekephalliokigklopeleio
lagoiosiraiobaphetraganopterygon,"
that has Greek roots, and I will let you click to find out its exact meaning.

Also, these entries seem to trigger some buggy behavior on Google. If you do a web search for the above terms, you will find no web page with these words. However, Google returns a set of product matches on Google, none of which is really correct.

The joy of large-scale data processing!