Thursday, June 21, 2012

The oDesk Flower: Playing with Visualizations

In the few couple of weeks, while at oDesk, I am trying to learn the data stored in the database, and I create random plots to understand what is happening in the market. My absolutely favorite source of data is the data about the micro-level activity of the workers (when they work, how much they type, how much they move the mouse, etc.).

A few weeks back, I posted a blog about the activity levels of different countries, with the basic observation that the activity in Philippines fluctuates much less within the 24-hr day compared to all other countries.


You are doing it wrong: The use of radar plots

After I posted that plot, I received the following email:
This is periodic data, which means modular thinking. When you visualize periodic data using a linear plot, you necessarily have a cutting point for the x-axis, which can affect the perception of various trends in the data. You should use something similar to the Flickr Flow, e.g a radar plot in Excel.
So, following the advice of people that really understand visualization, I transformed the activity plot into a radar plot, (in Excel):

The oDesk Flower

As you can see, indeed the comment was correct. Given the periodicity of the data, having a cyclical display is better than having a single horizontal line display. Beautiful to look at? Check. I called this visualization "The oDesk Flower" :-)

Unfortunately, it is not truly informative due to the huge number of countries in the plot. But I think it works well to give the global pace of activity over the week and across countries.

One thing that I did not like in this plot was the fact that I could not really compare the level of activity from one country to other. So, I normalized the values to be the percentage of contractors from that country that are active. A new flower emerged:


For comparison, here is the corresponding linear plot, illustrating the percentage of contractors from various countries that are active at any given time:



Fighting overplotting using kernel smoothing and heatmaps

The plot above is kind of interesting and indeed it shows the pattern of activity. However, we have a lot of "overplotting", which makes the plot busy. It is hard to understand where the majority of the lines are falling.

To understand better the flow of the lines, I decided to play a little bit with R. I loaded the data set with the activity line from each country, and then used kernel based smoothing (bkde2D) to find the regions of the space that had the highest density. To plot the result, I used a contour plot (filled.contour), which allows for the easy generation of heatmaps. Here is the R code:



and here is the resulting plot:


I like how this plot shows the typical activity across countries, which ranges from 2% to 6% of the total registered users. At the same time, we can see (the yellow-green "peaks) that there are also countries that have 8% to 10% of their users being active every week.

Need for interactivity

So, what did I learn from all these exercises? While I could create nice plots, I felt that static visualization are at the end of limited value. Other people cannot do any dynamic exploration of the data. Nobody can customize the plot to show a slightly different view and in general we lack the flexibility given by, say, the visualization gadgets of Google or by the data driven documents created using d3.js.

I would love to be able to create some more interactive plots and let other people play with and explore the data that oDesk has. Perhaps I should hire a contractor on oDesk to do that :-)