Lately, we have been working on a project for improving local search for hotels. The project, funded by a Microsoft Virtual Earth award, tries to incorporate into the ranking function features such as the price of the hotel, its amenities, geographical characteristics such as "proximity to a beach" (inferred using image classification of satellite images), and customer reviews. The final outcome is a ranking function that we call "value for the money".
Of course, the most important thing is to actually test how good the final ranking is. For this reason, we generated our ranking for a few US cities, and compared it against some other baselines, such as "rank by distance", "rank by price", "rank by Tripadvisor ranking" and so on. To give some guidance, we gave the corresponding title to each ranking and presented them to the users, asking them to choose the ranking that they preferred most.
Overwhelmingly, the users picked out ranking, that was titled "Value for the Money". The difference was so striking, that we got suspicious. Therefore, we decided to run the following experiment. We swapped the titles of the rankings and we named a baseline technique as "Value for the Money". Interestingly enough, the baseline technique, which was the worst performing one, ended up being the most preferred under the new title! In other words, we simply gave a fancy title to a bad ranking technique and the users were convinced that this was the best possible ranking!
Finally, we decided to run a truly blind test. We presented pairs of ranked hotel lists to the users, without any title, and asked the users to pick the one that they liked best. Our ranking strongly outperformed the other baselines, which I guess is a good thing :-). But I think that the important lesson from this experiment was to really compare apples to apples. Add even subtle "non-equal" external elements and the experiment can give deceiving results.