Misconceptions of machine learning can run fairly deep.  People not only have unrealistic expectations of what a machine can automatically figure out, but truly impossible ones.  I sometimes think of this as the omniscience vs. intelligence fallacy.  Intelligence is the ability to generalize and learn useful patterns from data and/or experiences.  This is what we humans do naturally every day.  But intelligence requires data.

Very often the data that the machine learning system has access to is insufficient for anyone, even a super sophisticated intelligence, to generalize from.  This is known as the problem of data sparcity, which was mentioned in Robin’s recent blog – Big Data, Little Data, and the 18th Century Mathematics behind Product Recommendations.

Ideally, you would like to have a nice, dense data set with the same data points for everyone. When you do, it is relatively easy to see if certain data elements are predictive of others. For example, if I have a group of people and I know their age, gender, income, location, and car make, I can predict what kind of car they are likely to own based on one or more of these factors. Unfortunately, in the real world the data is much more complex and much more diverse across users.

Each person that comes to your website browses a set of pages and has a certain set of profile attributes. Each user, and what they choose to look at and buy, is different from one user to the next. Think of it like a really big data table where each person has thousands of possible cells and only 20 filled in; and the 20 that are filled in are pretty different for every person. Yes, you have tons of data, but the data is mostly different. If you plotted this data it would have a lot of white space.

The right machine learning techniques can get pretty good at spotting patterns in large volumes of sparse data but that doesn’t mean that the machine can see something that is not there.

Stay tuned for part 2, where I will talk about over-fitting and the impacts it can have on personalization.