A Computer Scientist in a Business School

Showing posts with label wikipedia. Show all posts

Monday, February 25, 2013

WikiSynonyms: Find synonyms using Wikipedia redirects

Many many years back, I worked with Wisam Dakka on a paper to create faceted interfaced for text collections. One of the requirements for that project was to discover synonyms for named entities. While we explored a variety of directions, the one that I liked most was Wisam's idea to use the Wikipedia redirects to discover terms that are mostly synonymous.

Did you know, for example, that ISO/IEC 14882:2003 and X3J16 are synonyms of C++? Yes, me neither. However, Wikipedia reveals that through its redirect structure.

The Wikisynonyms web service

What we mean by redirects? Well, if you try to visit the Wikipedia page for President Obama, you will be redirected to the canonical page Barack Obama. Effectively "President Obama" is deemed by Wikipedians to be a close synonym of "Barack Obama", and therefore the redirect. Similarly, the term "Obama" is also a redirect, etc. (You can check the full list of redirects here.)

While I was visiting oDesk, I felt that this service can be useful for a variety of purposes so, following the oDesk model, we hired a contractor to implement this synonym extraction as a web API and service. If you want to try it out please go to:

http://wikisynonyms.ipeirotis.com/search

The API is very simple. Just issue a GET request like this:

curl 'http://wikisynonyms.ipeirotis.com/api/{TERM}'

For example, to find synonyms for Hillary Clinton:

curl 'http://wikisynonyms.ipeirotis.com/api/Hillary_Clinton'

and for Obama

curl 'http://wikisynonyms.ipeirotis.com/api/Obama'

Mashape integration

Since we may change the URL of the service, I would recommend registering and using Mashape to access the WikiSynonyms service through Mashape instead:

curl 'https://wikisynonyms.p.mashape.com/{TERM}' --header 'X-Mashape-Authorization: your_mashape_key'

You can easily download Wikipedia

Interestingly enough, this synonym extraction technique remains little-known, despite the easiness of extracting these synonyms. And whenever I mention Wikipedia, most people are worried that they will need to scrape the HTML from Wikipedia, and nobody likes this monkey business.

Strangely, most people are unaware that you can download Wikipedia in a relational form and put it directly in a database. In fact, you can download only the parts that you need. Here are the basic links:

The Wikipedia schema is at http://www.mediawiki.org/wiki/Manual:Database_layout
The files are at http://dumps.wikimedia.org/enwiki/latest/ and you can download each table individually.
To implement the synonyms service, you only need to fetch the redirect table and the page table.
More instructions at http://en.wikipedia.org/wiki/Wikipedia:Database_download

This redirect structure (as opposed, say to the normal link structure and the related anchor text) is highly precise. By eyeballing the results, I would guess that precision is around 97% to 99%.

Application: Extracting synonyms of oDesk skills

One application that we used the service was to extract synonyms for the set of skills that are used to annotate the jobs posted on oDesk. For example, you can find the synonyms for C++:

Or you can find the synonyms for Python:

Oops, as you see the term Python is actually ambiguous, and Wikipedia has a disambiguation page with the different 'senses' of the term. Since we are not doing any automatic disambiguation, we return a 300 HTTP response and ask the user to select one of the applicable terms. So, if we query now with the term 'Python (programming language)' we get:

Open source and waiting for feedback

The source code together with the installation instructions for the service is available on GitHub. Feel free to point any problems or suggestions for improvement. And thank oDesk Research for all the support in creating the service and making it open source for everyone to use.

Thursday, February 23, 2012

Crowdsourcing and the end of job interviews

When you discuss crowdsourcing solutions with people that have not heard the concept before, they tend to ask the question: "Why is crowdsourcing so much cheaper than existing solutions that depend on 'classic' outsourcing?"

Interestingly enough, this is not a phenomenon that appears only in crowdsourcing. The Sunday edition of the New York Times has an article titled Why Are Harvard Graduates in the Mailroom?. The article discusses the job searching strategy in some fields (e.g., Hollywood, academic, etc), where talented young applicants are willing to start with jobs that are paying well below what their skills deserve, in exchange for having the ability to make it big later in the future:

[This is] the model lottery industry. For most companies in the business, it doesn’t make economic sense to, as Google does, put promising young applicants through a series of tests and then hire only the small number who pass. Instead, it’s cheaper for talent agencies and studios to hire a lot of young workers and run them through a few years of low-paying drudgery.... This occupational centrifuge allows workers to effectively sort themselves out based on skill and drive. Over time, some will lose their commitment; others will realize that they don’t have the right talent set; others will find that they’re better at something else.

Interestingly enough, this occupational centrifuge is very close to the model of employment in crowdsourcing.

In crowdsourcing, there is very little friction in entering and leaving a job. In fact, this is the key crucial difference with traditional modes of employment: There is no interview and the employment is truly at will. You want to work on a task? Start working. You are bored? Stop working. No friction with and interviewing and hiring process, and no friction if the worker decides to stop working.

As in the case of Hollywood and academia, the evaluation is being done on-the-job. While currently the model is mainly applied to small tasks, there is nothing that fundamentally prevents this model from being applied to any other form of employment. With the Udacity and Coursera model, we start seeing that concept being applied to education. Later on, we may see other jobs adapting this model for their purposes (stock trading, anyone)?

What you observe in such settings is that the distribution of participation and engagement is heavy-tailed, tending to follow a power-law: A few participants will provide a significant amount of input, while there will be a long tail of participants that will come, do a few things (complete HITs on MTurk, write Wikipedia articles, watch lectures and homeworks in Coursera, trade stocks, pick your task...) and then leave.

What does it mean to have a power law distribution of participation in crowdsourced projects?

It means that the long-tail of the occasional participants is just not naturally attracted to the task. The persistent few are the good matches for the task. This is self-selection at its best.

No interview needed, and only the people that are truly interested stick around.

Crowdsourcing is the new interview.

The selection of the best participants happens naturally, without the artificial introduction of a selection process mediated through am interview. The interview is an artificial process. It tries to keep out from the task the participants that are not qualified and tries to identify the ones that are the best. This is an imperfect filter. It has false positives and also false negatives. Many people are hired with great hopes, just to be later proven to be ill-suited for the task (false positives). And many good people do not get the chance to work on a task just because they do not look good on paper (false negatives; I am dying to make a Jeremy Lin joke here...)

Think now of an environment where everyone gets a shot to try working on something they are interested in. No friction of getting hired and getting fired. You have a benefit where the best people work on the tasks that they are best at. [You ask what if there are fewer dream jobs than available labor? What to do when training on the job is not possible (cough, doctors, cough). Let me dream for now, and let's bury under the carpet the millions of details need to be addressed before this mode of operation has a shot in becoming reality.]

To answer the question posed at the beginning of the post, "Why is crowdsourcing so much cheaper than existing solutions that depend on 'classic' outsourcing?" The process of self-selection in matching workers and tasks is the key reason on why crowdsourcing is typically cheaper than the traditional process of assigning directly tasks to people. The easier it is for the crowd to find jobs they like, the more efficient the matching and execution.

When you effectively have the most interested and self-selected people working on a given task, the productivity of a team for the task is much higher than the performance of a team consisting of people that may simply be bored or not very interested in the task. Just consider the productivity of five programmers that are dedicated and enthusiastic about what they are building, compared to a similar team of five programmers that were assigned the task by someone and they have to implement it.

At oDesk, there is a significant effort to improve the matching process of projects and contractors, by showing to contractors the best projects for them and to employers the best contractors for a task. My own dream is to be able to eliminate the friction of interviewing and get the process of finding a job and working to be as seamless as possible.

Monday, July 9, 2007

Blending Mind Maps with Wikipedia

Following up on the topic, of how to use wikis to organize research literature about a given topic.

I have been thinking about the idea of keeping wikis to keep track of the state-of-the-art in any field. One way to organize the literature is to transform an existing survey article into a wiki and let people edit at-will (e.g., see this article on duplicate record detection). Another alternative is to keep a set of tables in a wiki, which summarize the results of a paper in a single line (e.g., see here and here for NLP-related tasks).

A problem with both approaches is the lack of a network that will allow people to mark and annotate the relations across different papers. The idea of using CMapTools (http://cmap.ihmc.us/) was interesting but I could not manage to install the software to see how it really works.

Today, I run into WikiMindMap, a web-based service that uses the wiki structure to convert Wikipedia articles into a mind map. The heuristics that it uses are rather basic but one could use this tool to organize the literature in a wiki (having the advantage of collaboration), and still have all the advantages of a visual tool that can show connections across entities. See for example the wikified article, mentioned above, transformed into a mind map.

Friday, July 6, 2007

Wikipedia-based Web Services

We have launched a beta version of some web services that use Wikipedia data for named entity extraction, for term identification, and for document expansion. You can take a look and try the services at:

http://wikinet.stern.nyu.edu/Wikinet/Wikinet.aspx

We are actively working on this, so if you see a bug or some unexpected behavior, feel free to drop me an email.

Wednesday, June 6, 2007

Playing with Wikipedia

I was working with Wisam Dakka on a Wikipedia project, and I was puzzled by some Wikipedia entries that had really long titles.

The first one that I noticed was a term with 163 characters: "Krungthepmahanakornamornratanakosinmahintarayu
tthayamahadilokphopnopparatrajathaniburiromudomra
janiwesmahasatharnamornphimarnavatarnsathitsakkattiyavi"
which redirects to Bangkok. I do not know if this is a prank, or a valid entry. (Update: It is a correct entry, according to the talk page of the entry.)

Then, I noticed another term with 255 characters:
"Wolfeschlegelsteinhausenbergerdorffvoralternwarenge wissenhaftschaferswessenschafewarenwohlgepflegeund sorgfaltigkeitbeschutzenvonangreifendurchihrraubgierig
feindewelchevoralternzwolftausendjahresvorandieer
scheinenwanderersteerdemenschderraumschiffgebrauchl,"
which in fact is a valid term and the 255 characters is simply a shortcut for the 580 character entry :-)

Finally, there is a term with 182 characters:
"Lopadotemachoselachogaleokranioleipsanodrimhypotrim
matosilphioparaomelitokatakechymenokichlepikossypho
phattoperisteralektryonoptekephalliokigklopeleio
lagoiosiraiobaphetraganopterygon,"
that has Greek roots, and I will let you click to find out its exact meaning.

Also, these entries seem to trigger some buggy behavior on Google. If you do a web search for the above terms, you will find no web page with these words. However, Google returns a set of product matches on Google, none of which is really correct.

The joy of large-scale data processing!