right-wrongOne of the beauties of big data and machine learning is that is it supposed to be completed agenda-less.  The system collects data of all types and the machine learning algorithms watch and establish meaningful patterns and outputs as a result of the abundant data.  Seems simple enough. Unfortunately, as with most things, it’s not that simple. First, most systems, at least in ecommerce, don’t collect enough data fast enough to establish good patterns.  This can be because the traffic on the site is too low when considered against the size of the catalog, or that the data points themselves are too similar.  This is called data sparsity.  Data scientists who work within a given business domain might be inclined to fill this data gap with business logic or additional algorithms that account for the lack of data.  This might be considered an introduction of bias into the machine learning equation.  Not a good thing.

Machine learning is supposed to be about exploration – the ability to let the machine run and freely explore the various connections between data points.  When additional logic is introduced that limits this exploration, it can bias the output.  This might mean the machine is told to look at a specific input or a specific feature in the data.  Usually this is unintentional.  Here is how it might happen: Let’s say a data scientist works in-house at one of the new Bay Area labs recently established by big retailers.  If they have an inkling that a certain input or feature might create any given outcome, then what they are really doing is degrading the organic capacity of the machine to explore.  Their instructions will send the algorithms down a more defined path – one where it is looking for something and once it finds it, machine learning exploits it.  For some applications, this is a decent outcome – for others, not so good.

So if you plan to hire data scientists any time soon, make sure you philosophically agree on the level of exploration you are willing to accommodate.  While it is still early days for many big data thinkers, my suspicion is that we will actually become much more biased, by domain over time, if we are not careful to allow our data scientists the room to roam freely.