Friday, July 6, 2007

Data Cleaning Service

Often, I get questions from colleagues and students on how to match "dirty" data, i.e., string database entries that are similar but not exactly identical. Although there is a very significant amount of research literature, there are not that many software packages available for free. For this reason, I decided to implement the technique that we described in our WWW2003 paper, and make it available on the web for anyone to use:

http://hyperion.stern.nyu.edu/SMWebSite/Default.aspx

So, now if you have two lists of names that need to be matched you can try this service. It works well for small datasets with a few thousands entries in each. I am pretty sure that it can scale to much bigger datasets, but it will take some time for the service to finish the job. One of the neat features of the service is that you can submit the job, keep the id of the submitted job, and come back to retrieve the results later.

If you have any ideas, or any bug reports, feel free to contact me.