Monday, February 25, 2013

WikiSynonyms: Find synonyms using Wikipedia redirects

Many many years back, I worked with Wisam Dakka on a paper to create faceted interfaced for text collections. One of the requirements for that project was to discover synonyms for named entities. While we explored a variety of directions, the one that I liked most was Wisam's idea to use the Wikipedia redirects to discover terms that are mostly synonymous.

Did you know, for example, that ISO/IEC 14882:2003 and X3J16 are synonyms of C++? Yes, me neither. However, Wikipedia reveals that through its redirect structure.

The Wikisynonyms web service

What we mean by redirects? Well, if you try to visit the Wikipedia page for President Obama, you will be redirected to the canonical page Barack Obama. Effectively "President Obama" is deemed by Wikipedians to be a close synonym of "Barack Obama", and therefore the redirect. Similarly, the term "Obama" is also a redirect, etc. (You can check the full list of redirects here.)

While I was visiting oDesk, I felt that this service can be useful for a variety of purposes so, following the oDesk model, we hired a contractor to implement this synonym extraction as a web API and service. If you want to try it out please go to:

The API is very simple. Just issue a GET request like this:

curl '{TERM}

For example, to find synonyms for Hillary Clinton:

and for Obama

Mashape integration

Since we may change the URL of the service, I would recommend registering and using Mashape to access the WikiSynonyms service through Mashape instead:

curl '{TERM}' --header 'X-Mashape-Authorization: your_mashape_key'

You can easily download Wikipedia

Interestingly enough, this synonym extraction technique remains little-known, despite the easiness of extracting these synonyms. And whenever I mention Wikipedia, most people are worried that they will need to scrape the HTML from Wikipedia, and nobody likes this monkey business. 

Strangely, most people are unaware that you can download Wikipedia in a relational form and put it directly in a database. In fact, you can download only the parts that you need. Here are the basic links:
This redirect structure (as opposed, say to the normal link structure and the related anchor text) is highly precise. By eyeballing the results, I would guess that precision is around 97% to 99%.

Application: Extracting synonyms of oDesk skills

One application that we used the service was to extract synonyms for the set of skills that are used to annotate the jobs posted on oDesk. For example, you can find the synonyms for C++:

Or you can find the synonyms for Python:

Oops, as you see the term Python is actually ambiguous, and Wikipedia has a disambiguation page with the different 'senses' of the term. Since we are not doing any automatic disambiguation, we return a 300 HTTP response and ask the user to select one of the applicable terms. So, if we query now with the term 'Python (programming language)' we get:

Open source and waiting for feedback

The source code together with the installation instructions for the service is available on GitHub. Feel free to point any problems or suggestions for improvement. And thank oDesk Research for all the support in creating the service and making it open source for everyone to use.