Saturday, September 22, 2007

Ambiguous First Names and Disambiguation

I was preparing an assignment for my class, trying to introduce students to issues of data quality, and I was using Facebook data for this.

As a simple example, I wanted students to find automatically the gender of a person, given only the first name, since 1/3 of the Facebook users do not list their gender. (The "homework motivation" was the need to send letters to customers, and we need to decide whether to put "Dear Mr." or "Dear Ms." as a greeting.) In general, the task is relatively easy and the majority of the names are not ambiguous. However, there is a set of highly ambiguous names, for which inference based on first name is problematic. For your viewing pleasure, the most ambiguous first names, together with the confidence that the name belongs to a male:

Ariel 50.00%
Yang 50.00%
Kiran 50.00%
Nikita 50.00%
Casey 49.30%
Min 46.67%
Paris 53.85%
Dorian 53.85%
Adi 45.45%
Kendall 45.45%
Quinn 54.55%
Aubrey 54.55%
Sunny 44.83%
Angel 55.32%
Yan 41.67%
Yi 41.67%
Yu 58.33%
Devon 59.46%
Nana 40.00%
Jin 38.89%
Ji 38.46%
Ming 61.54%
Taylor 37.80%
Rory 62.50%
Carey 36.36%
Sami 63.64%
Robin 34.55%
Ali 34.45%
Jean 34.09%

The next part of the homework, motivated by the ambiguity for some of the first names, asks students to guess the gender of a person based on the other stated preferences on Facebook profiles, regarding movies, books, TV shows and so on.

Based on the analysis of these features, women favor overwhelmingly the books "Something Borrowed," "Flyy Girl," "Good In Bed," "The Other Boleyn Girl," "Anne Of Green Gables", the movie "Dirty Dancing" and they like dancing as an activity.

On the other hand, characteristics that are unique to men are movies like "Terminator 2," "Wall Street," "Unforgiven," "The Good the Bad and the Ugly," "Seven Samurai"; the book "Moneyball"; sports-related activities (baseball, lifting) and sports-related TV shows (e.g., PTI, Sportscenter, Around the Horn). Another distinguishing feature of men is that they list "women" and "girls" as their interests (and in this case they should also think about taking perhaps some dancing lessons :-)