Tuesday, November 23, 2021

"Geographic Footprint of an Agent" or one of my favorite data science interview questions

 Last week we wrote in the Compass blog how we estimate the geographic footprint of an agent.

At the very core, the technique is simple: Use the addresses of the houses that an agent has bought or sold in the past; get their longitude and latitude; and then apply a 2-dimensional kernel density estimation to find what are the areas where the agent is likely to be active. Doing the kernel density estimation is easy; the fundamentals of our approach are material that you can find in tutorials for applying a KDE. There are two interesting twists that make the approach more interesting:

  1. How can we standardize the "geographic footprint" score to be interpretable? The density scores that come back from a kernel density application are very hard to interpret. Ideally, we want a score from 0 to 1, with 0 being "completely outside of the area of activity" and 1 being "as important as it gets". We show how to use a percentile transformation of the likelihood values to create a score that is normalized, interpretable, and very well calibrated.
  2. What are the metrics for evaluating such a technique? We show how we can use the concept of "recall-efficiency" curves to provide a common way to evaluate the models.

You can read more in the blog post.