Thursday, May 20, 2010

Google Prediction API: Commoditization of Large-Scale Machine Learning?

Today Google has announced the availability of the Google Prediction API: In brief, it allows users to upload massive data sets into the Google Datastore and then let Google built a supervised machine learning model (aka classifier) from the data. This is simply big news!

Google seems to promise great simplicity: Upload data in CSV format and Google takes care of the rest. They select the appropriate model for the data, train the model, report accuracy statistics, and let you classify new instances. Building classifiers from large-scale datasets becomes trivial.

While I have not had the chance to access the API, this seems to be a game changer. The ability to scale models to work with massive datasets was beyond the reach of many, and now suddenly becomes a commodity. Research labs that wanted to built classifiers as tools (and not as the focus of their research) will be able to do so without requiring much expertise. Similarly, startups will be able to use a scalable machine learning infrastructure, without having access to an inhouse expert.

In a sense, it seems to bring machine learning to the masses, bringing the performance baseline to very high level. If Google Predict is "good enough", will people seek for more advanced solutions? The optimizer of MySQL pretty much sucks but it is "good enough" for many.

Will Google Predict make large-scale machine learning a commodity? Does it mean that the value is now in having the data and in feature engineering? Unclear, but definitely a plausible scenario.

I will withhold further commentary until I manage to get access to the API. But I am excited!

3 comments:

  1. This looks pretty exciting. It seems a great path to incorporate the leading experience in a web service and scale that up.

    Would it be possible to use the same model for computing research? E.g. a researcher would provide a web service to access their latest algorithm (and some like google would host it). It will then be pretty easy to "use a method developed in[14]".

    Btw, it has always been that most of the value is in the feature engineering.

    ReplyDelete
  2. I think that lately you were getting a lot of value by being able to tune the algorithm properly, and being able to use massive datasets. The latter task was not trivial, as it required setting up Hadoop, mahout, etc.

    In my mind, it moves machine learning towards a "query optimizer" mode: In databases, you give the data and the query, you get an execution plan from the optimizer. In machine learning, you give the data and get back a prediction model.

    ReplyDelete
  3. Hello, I apologize for contacting you in this fashion, but time is at a premium ( work, kids, etc ) but I think, for promotional purposes, you might be interested in submitting your site to my new tech directory…The Tech Directory at thetazzone.net

    I’m assuming comments are moderated so when I click submit this post won’t automatically appear on site, if it does, I again apologize.

    ReplyDelete