05 Jul 2011
Metadata and Collaborative Filtering

In last week’s post, I discussed how Apple doesn’t sweat perfection. At a very high level, the company gets the big stuff right. In this post, I’d like to extend that concept to unstructured data and Collaborative Filtering (CF), defined by Wikipedia as:
the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including sensing and monitoring data – such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data – such as financial service institutions that integrate many financial sources; or in electronic commerce and web 2.0 applications where the focus is on user data, etc.
For instance, let’s say that I am a fan of Rush, the criminally underrated Canadian power trio. In iTunes, I rate their songs, along with those from other bands. Because I enjoy Rush’s music, I am likely to like Genesis, Pink Floyd, Yes, and other 70s prog rock bands. (I do.) I rate many songs by those bands high as well, helping Apple learn about my listening habits.
Now, multiply that by millions of people. While no two people may have precisely the same taste in music (or books, apps, movies, TV shows, or art, for that matter), the technology and data collectively allow Apple to learn a great deal about group listening habits. As a result, its technology will enable relevant recommendations to me.
The Law of Large Numbers
But let’s say that some people out there have odd tastes. They like Rush but they’re also huge Beyoncé fans. (There’s nothing wrong with her or her music, it’s just that most people don’t like both her and Rush.) They give high marks to the latest Beyoncé album, as well as Rush’s most recent release. Won’t this throw off Apple’s rating systems?
Or what about those who hate Rush? Sadly, many people not only don’t share my passion for the band, but actively despise their cerebral approach to lyrics and odd time signatures. And, yes, many people hate the band because of Geddy Lee’s voice. What if they intentionally rate Pink Floyd songs high but Rush songs low? What if they rate Rush songs high and intentionally suggest songs in a completely unrelated genre such as Gangsta Rap? Won’t this make Apple’s recommendations less relevant?
In a word, no. Apple’s recommendation technology takes advantage of the law of large numbers which states, in a nutshell, that large sample sizes can withstand a few inaccurate entries, selections, and flat-out errors.
And Apple isn’t alone. Google, Facebook, Netflix, Pandora, Amazon.com, and a host of other companies utilize this law. Couple with accurate metadata, these companies are able to make remarkably accurate suggestions and learn a great deal about their users and customers. For more on the concept of metadata, click here.
Simon Says
Accuracy and data quality is still very important, particularly in structured data sets. One errant entry in a table can cause many problems, something that I have seen countless times. With unstructured data, however, the bar is much lower. Embrace large datasets, for they are much better able to withstand corrupt or inaccurate information.
Feedback
What say you?

