News Grader: our KNC 2011 entry

So tomorrow is December 1, which is the deadline for entries into the Knight News Challenge, an annual contest for innovative ideas for the intersection of technology and news, funded by the John S. and James L. Knight Foundation.

I (along with my buddy Eric Marden) have an entry in this year’s competition.  It’s a project we’re calling News Grader.  Here’s the meat and potatoes of the proposal:

We propose building a web service which can intelligently parse and analyze text it has never seen before, and offer insight into the quality of content, along with the degree to which the filter is confident in its analysis: the “News Grade”. We envision making this available in much the same way as Akismet or OpenCalais provides their service – an open API, providing third-party developers an easy way to make use of News Grader’s analysis and quality ratings. This API will enable the easy integration of our system into a variety of formats which could include browser extensions, CMS plugins, and desktop/web applications such as RSS readers, news aggregators or social networking software.

We believe we can do most of the heavy lifting using well-known algorithms, including a modified version of Bayesian induction, the Porter stemmer algorithm, entity extraction algorithms, and manifold learning algorithms. We intend to identify and weight word clusters in an article, and compare a new article to the ratings of other articles with similar word clusters, amongst other techniques.

This automated process will be supplemented by a mechanism for users to provide structured feedback, which will allow the web service to “learn”, and which will increase the quality of the analysis provided over time. The more that people use the service, the smarter it will get. It also makes it more difficult to “game” the analysis of a given piece of content, since the machine intelligence will compare new content to other content it’s previously encountered, and will weight new feedback as only a portion of its analysis.

You can read our full proposal here.  If (and that’s a big if, since there are hundreds of entries) the KNC people end up being interested in the idea, you can be sure I’ll be writing more about it when we put together a business plan and timeline for the second round of the competition.

I’ve been pretty interested in mechanisms for machine learning lately, and since the KNC folks were specifically looking for some proposals about authenticity, trust, and content discrimination, I thought this idea might be up their alley.

Of course, some folks think that open-ended machine learning systems are an all-too-common startup idea which never seems to quite work out.  To those folks, I’d like to point out that useful expert systems have been relatively rare until pretty recently, simply because doing it right is computationally intensive.

Furthermore, I think it’s worth noting that where this sort of system has worked in the past, it’s worked really well.  Netflix, for instance, paid out a million-dollar prize in 2009 for improving their recommendation algorithms by a mere 10%.  Amazon relies on it’s recommendation system as a driver of sales.  There’s just no question that systems like these can work; the only questions are what do you want to measure, and how do you use the information?

Netflix and Amazon want to predict what an individual will think about a particular recommendation.  Will you buy it?  Will you like it?  And that’s a great idea — it drives commerce on these sites, and makes them more useful for users.

But the questions we’re interested in answering don’t rely on personalization; we’re not so much interested in what a particular user cares about.  We’re interested in predictive modeling of things like bias, completeness, and novelty, independent of the tastes of a particular user.  The question we’re asking is, “Can we discover good journalism, regardless of subject matter?”

We think the answer is going to be that we can.  We won’t know until we actually run the experiment, however.  As far as I can determine, nobody’s ever tried precisely what we’re proposing vis-a-vis journalism on the web.  Only time will tell if we get the opportunity to try.

**Update, 1/12/2010:  The Knight Foundation declined our proposal.  Anyone want to fund the idea?  Otherwise, I’d say it’s dead in the water. **