The 451 Take
There is little doubt that machine learning has become a defining force in the analytics sector, set to exhibit significant growth, foster further startup activity and drive market consolidation in 2017. As this type of artificial intelligence enters the mainstream, it will shape data-driven organizations, which are fast becoming the preponderance of all companies. Biased analysis and poor or insufficient data are crucial issues to tackle in order to ensure machine-learning-driven analysis lives up to its full potential. While data scientists and other analytically savvy personnel are already familiar with these issues – they understand, for example, the bias-variance trade-off that arises from simultaneously minimizing two sources of error that prevent supervised machine-learning models from being generalized – business users aren't. Yet data scientists remain in short supply. We therefore need to ensure that similar rigor and intelligence are employed by those outside of the data science realm.
Lies, damned lies and statistics: Be aware of bias
A key question to ask when considering the use of machine learning for analytics is whether the results are accurate and representative of the population. This may seem obvious and easy. However, it is more difficult than first appears. Bias can often creep in, and can be difficult to eradicate once it takes hold, since the software 'learns' these biases.
Bias can occur at a number of different points in the machine-learning process – for example, it can be present in the information used to train the machine-learning models. Consider medical diagnoses, object descriptions and hiring data. All of these information sources house ingrained assumptions that can create biased analysis, which produces flawed decision-marking.
Furthermore, the assumptions used to create datasets for machine-learning models might contain unconscious bias. Humans aren't totally impartial, even if they try their hardest to be. Machine learning takes its cue from people – and we are all subject to influences: cultural, educational, gender and race.
Bias can also surface in the machine-learning algorithms themselves. Data scientists and developers go to great lengths to create good algorithms, but unconscious assumptions can impact the way their code is written despite the coder's best intentions, and therefore skew the analysis. Machine-learning code – like all forms of software – requires the creator to make certain assumptions.
Additionally, individuals are complex and difficult to fit into the neat, delineated groups required for machine-learning programs. Any decision-making based on machine-learning-based analysis involving data associated with people is therefore vulnerable to prejudice. For example, it is difficult to determine race because it is a social construct, which is also based on a mixture of fixed and changing traits, and not just skin color. Race is also not immutable. A person could change their religion and therefore race. Racial descriptions can also vary from country to country and over time.
Gender bias is another form of error to be aware of in machine-learning-driven analysis. Some argue that it can be coded out. At a high level, this involves identifying gender stereotypical analogies in use in the learning process by the algorithms, such as 'a man is to computer programmer as a woman is to homemaker,' and shifting the relationship between those words. However, it needs careful handling in order to not create further misunderstanding and fresh errors. The classic basic example here is the system that learns gender-neutral language and is no longer able to distinguish that a king is a man and a queen is a woman.
Machine-learning-based analysis lives or dies depending on the volume of data used for training purposes and ongoing feeding of the system. Learning algorithms, after all, learn patterns automatically on the basis of the large datasets with which they are presented. If the dataset isn't sufficiently large, there is a risk that the conclusions drawn from the analysis could be erroneous. Data sampling can compound the problem, since a data sample might not be representative of the dataset as a whole.
The state of the data in use also needs to be ascertained, ideally at the outset. The maxim 'garbage in, garbage out' is just as critical in machine-learning-driven analytics as it is in any type of analysis. Indeed, it might be even more so. Why? If a machine-learning system is trained on information that is riddled with data-quality issues, it will make poor inferences. Furthermore, these inferences can become magnified as they are 'learned' and engrained. Therefore, errors such as null values, duplicates and other data inconsistencies need to be weeded out before they enter the machine-learning system and create inaccurate or incomplete analysis.
As a Senior Analyst for the Data Platform and Analytics team, Krishna Roy is responsible for the coverage of self-service analytics, predictive analytics and performance management.
As a Senior Research Associate in 451 Research’s Information Security Channel, Patrick Daly covers emerging technologies in Internet of Things (IoT) security. His research focuses on different industrial disciplines of IoT security, including the protection of critical infrastructure, transportation and medical devices. In addition, Patrick’s coverage spans technological domains, including security for IoT devices, applications, platforms and networks.
Keith Dawson is a principal analyst in 451 Research's Customer Experience & Commerce practice, primarily covering marketing technology. Keith has been covering the intersection of communications and enterprise software for 25 years, mainly looking at how to influence and optimize the customer experience.