by

Google Flu Trends: Ohio has the flu

Google Flu Trends is a fascinating example of data aggregation & correlation with a real-world impact

According to Google, they’re only counting “every person who searches for ‘flu’”, but what’s interesting is the that during the 2007-2008 flu seasons Google was able to accurately estimate current flu levels 1-2 weeks faster than the CDC. Today, the data is as follows:

image

But in counting searches, there’s a problem… not everyone wants to be counted:

A Google employee, Niniane Wang, demonstrated conclusively that a lot of the talk about "anonymized" logs is just pure B.S., and people like Greg Linden have shown that it’s possible to identify people even from aggregate data. So I think some of the Google apologists here are being a bit too partisan. But in this specific case, I think it’s perfectly legitimate (and even honorable) to mine the data in aggregate. – Joshua Allen

I wondered how data pulled out of Google’s front end (crawled web content, available to anyone) compares to what only Google can pull out of the back end (what users are search for, theoretically not available to everyone).

I wrote an Excel UDF to pull “estimated result count” from http://www.google.com/search?q= (Google’s search SOAP API is no longer supported for new users) to see how many blogs/Twitterers and other sites are saying “’I have the flu’ <State Name>” today, per capita.

A couple of the states Google sees more flu searches from came out on top:

  • Ohio (1st, estimated 156000 webpages/11.5M population)
  • Vermont (3rd, 21400/2.9M)
  • Alabama (5th, 18900/4.6M)
  • Maryland (8th, 16900/5.6M)

and the bottom:

  • Kentucky (36th)
  • West Virginia (38th)
  • South Carolina (47th)

I’ve no conclusion here… other than Ohio really needs to drink plenty of fluids and get some rest, and something about “both ends.”

You can find out which states Google users are asking questions about the flu from at http://www.google.org/flutrends/

Write a Comment

Comment