This year as nations prepare for an uncertain future of lockdowns, business closures and social distancing due to COVID-19, we have also been forced to reckon yet again with another ongoing epidemic: the systematic oppression of Black people. This past June, the murders of George Floyd and Breonna Taylor inspired not only renewed protests against unjustified police violence, but also deeper reflections of the effects of racism and racial bias in our society more broadly. As Black Lives Matter demonstrations took place across the globe, corporations publicly pledged support for the movement and promised to implement more DE&I measures, state and local governments across the country enacted police reforms, and many individuals who had not previously engaged the issue started questioning their own biases, prejudices and roles in perpetuating systems of oppression and discrimination.
As a data scientist working at a human-centered design firm, I tend to gravitate towards finding data- and technology- driven solutions to human problems. But as a Black man who has lived in the UK and the United States, I know that the societies we live in and the personal experiences we have in that society train us to think and act in different ways with respect to issues of race. Some people in my field have pointed to AI as a powerful tool for fighting bias in a variety of areas, from criminal justice to corporate hiring to loan applications. But while I believe AI has the potential to reduce bias in these and other areas, my experience has also taught me that many AI systems simply replicate the biases of the societies that the humans who designed them live in. Before we hand decision-making power in critical social and economic spheres over to AI systems, we must first acknowledge this possibility, correct the biases we’ve built into existing systems, and develop tools and procedures for safeguarding against bias in the AI systems of the future.
The existence of biases within machine learning systems is well documented, and they are already taking a devastating toll on vulnerable and marginalized communities. As data scientists and AI system designers, we must be critical, vigilant and proactive in identifying and removing potential biases from AI models and systems if we want our work to help create a more diverse, equitable and inclusive world for all.
AI models: a brief overview
AI models are typically viewed as closed systems, beginning with a dataset and ending with predictions. To build an AI model, a dataset is collected and split into three subsets: training, validation and testing. A number of candidate algorithms are fit to the training set and tuned until their predictive performance on both the training and validation sets are optimal. This process may be repeated several times with different algorithms and model architectures to identify the most powerful model. Generally, that’s the model with the best ability to both predict for data it has already seen, as well as to generalize (i.e., to make accurate predictions for data it has not seen). Finally, the best algorithm is run on the test dataset to evaluate its true generalization capability. While the definition of AI has changed over time, the most popular current application of the term refers to multi-layer (deep) neural network models that learn from large datasets using the process described above. This is commonly known as deep learning.
There are several variations and nuances to this process, but the above description is sufficient for non-data scientists to understand the issues and implications raised below.
AI model biases
AI models can contain embedded biases, but those biases can be measured and corrected at three stages of the model development pipeline. At the dataset input stage, datasets can be pre-processed to mitigate bias; at the model training stage, algorithms can be employed to correct known biases; and once a model’s predictions have been made, they can be post-processed to remove any effects of bias. There are many tools and algorithms which can be used to mitigate in-model biases. At frog, we think IBM’s AIF 360 tool is a great place to start, as it contains a summary of other open-source tools for AI bias and fairness, as well as an extensive list of fairness and bias metrics.
If AI models are not screened and corrected for embedded biases, those biases will shape any AI system in which they are deployed. This is inevitable, since AI systems necessarily incorporate all of the decisions made in the selection and collection of model data, the design choices regarding how the model should be used (e.g., who is the intended audience), and the effects of the system’s own predictions and inferences on its users. For example, an AI system that makes a purchasing recommendation to a customer doesn’t only give the customer a recommendation–it also takes their purchasing decision and uses it to refine future recommendations for that and other customers. In collecting new user data to retrain itself and influencing the behavior of the users to whom it makes recommendations, the AI system breaks the closed loop of the model on which it is built, amplifying the presence and effect of any embedded biases.
Below we will consider a common AI system task, image classification, to illustrate how the failure to screen our datasets for bias or to consider the potential uses of our models can lead to adverse effects on end users after our systems are deployed. We choose this task because its use cases, and therefore its potential associated impacts, are so wide-ranging, e.g. photo-tagging, medical diagnosis, self-driving car navigation, behavioral anomaly detection, drone targeting and security systems.
A biased dataset is one that does not contain data diverse enough to sufficiently represent the universe of predictions it is intended to make. Often, data scientists might be unaware that a dataset is biased because they did not assemble it themselves–it is easy to download a dataset and model online, as many data scientists do. The advantages of this approach are clear: it helps us avoid days or weeks of collecting and labeling our own data, and it allows us to compare our results to benchmark performance metrics from other model architectures that use the same dataset. If we make a significant improvement upon the benchmark performances using a novel architecture or approach, we may end up being published or invited to speak at a conference.
But what are the disadvantages of this approach? The primary disadvantage is that without stopping to consider the composition of the dataset and its possible biases, we are blind to how those biases affect the intended predictions of the model, and in turn how those predictions could produce erroneous or unintended effects in the world. If our work is purely academic, perhaps this risk isn’t so serious. But if we adopt this approach in a commercial context, we are making a huge mistake.
For example, a DuckDuckGo search for the term “CIFAR 100 diversity” (the CIFAR-100 is a dataset widely used for training image-recognition algorithms) is unlikely to produce any results that break down the “people” superclass in greater detail than “baby,” “boy,” “girl,” “man,” or “woman.” Obviously, these subclasses are entirely insufficient to describe the diversity we see in the world. With such a limited range of descriptors, a model trained on this dataset might identify someone in a wheelchair as a motorcycle as easily as it identifies them as a person. How can we ensure that our datasets contain sufficient diversity and nuance to avoid such misclassifications by the models we use them to train?
One potential mitigation: the relatively laborious process of individually screening each of the 3,000 images of people within the dataset and adding additional labels so that we can evaluate the model’s ability to correctly classify such a person. But this approach creates deeper problems when we consider classes of diversity such as race, ethnicity, gender expression and gender identity, which can be ambiguous or fluid. Do we really want to impose our subjective evaluations of such attributes, based on single images of people whose identities we do not know, as the objective truth in regard to our dataset?
Compounding effects: intersectionality
Another important consideration for evaluating the diversity of a dataset is intersectionality. The term has become politically and culturally charged, but for our purposes we can define it as the compounding effects an individual experiences as a member of more than one marginalized group. To illustrate the phenomenon, consider the example of income disparity. According to a 2019 survey in the United States, uncontrolled for job type and qualifications, white women earned 21 percent less than white men. Black men earned 22 percent less than white men. Black women, however, earned 25 percent less than white men, demonstrating the compounded economic disadvantages experienced by Black women relative to white men. While this example relates to income, there are similar effects in the areas of education, housing, criminal justice and others.
This concept has important implications for data scientists. It suggests that even if our datasets contain a proportionally representative sample of the population, any algorithm trained with them may still be prone to misclassifications. Consider the case of a Black woman living in the United States who uses a wheelchair. Even if the CIFAR-100 were perfectly proportional in regard to demographics, the dataset would contain fewer than five examples of such women, one of which would need to be used for testing to have any effect on the AI model. Even a highly optimized algorithm will be likely to discard the “outlier” images in favor of a higher accuracy on the majority classes. While this will lead to 99.95 percent accuracy within the dataset, it could then misclassify up to 170,000 similar people when applied on a national scale. Without doing such basic analysis of our datasets, we virtually guarantee that our algorithms will misclassify people who already experience disproportionate discrimination in our society.
While there are a number of computational tricks that can artificially increase the representation of marginalized groups within a dataset, they still cannot fully reproduce the diversity within these groups that an algorithm must learn in order to perform accurate classifications. And, if the dataset contains zero examples of a given group, then there are no such tricks to use–even the most sophisticated algorithms cannot accurately classify something they have never seen before. In that case, we must take the time to collect and label more data. Often, a quick internet search will produce hundreds of additional examples we can download and process with a bit of Python scripting.
But there are cases in which additional data collection is not possible. In those cases we must ask ourselves: Why does this data not exist? Is there some bias in the collection process causing disproportionate collection or representation of data regarding this group? Will the deployment of our AI system at scale lead to further disproportionate representation? We must consider and discuss these questions before we train any AI model. Remember that the example above considers only one of the infinitely complex possibilities of intersectional identity–how many other people might our data exclude?
It is well documented that humans are deeply susceptible to various cognitive biases.
Our example of the data scientist who downloads and uses a dataset without any critical analysis of its composition illustrates at least two of them: status-quo bias, which causes us to assume that the status quo is inherently superior to alternate approaches, and in-group bias, which causes us to treat members of our own group (in this case, AI designers) more favorably or forgivingly than we otherwise might. Together, these cognitive biases allow us to avoid accountability for building biased systems by saying to ourselves: “This is what other AI designers are doing, and they’re fine with it, so it’s what I’ll do, too.”
But that’s only the beginning. By failing to fully consider the potential consequences of our designs–for example, the possible effects of a misclassification error in an AI system used for autonomous vehicle navigation–then we are engaging in optimism bias, which causes us to overestimate the probability of positive outcomes and underestimate the probability of negative ones. We must recognize that we choose how to allocate our time and resources when designing AI systems, and that we have a responsibility for the potential negative impacts of those choices. If our systems do cause negative outcomes for users, how will we respond? Will we succumb to self-serving bias–the tendency to assume responsibility for success and avoid responsibility for failure–by telling ourselves there was no way we could have seen it coming? That we didn’t have time to fix it? That it’s not our fault the dataset everyone uses is flawed?
Some people are uncomfortable with or even hostile to the implications of how deeply bias is rooted in not just our AI systems, but our own personal and collective psyches. AI and deep learning have come so far so quickly, and many fear that such a radical change in approach will lead to a decline in our model performances and an increase in overall time and money spent throughout the research, design and production process. It is not clear, however, that these concerns are founded in reality. And even if they are, privileging them over the experiences and material conditions of real people—many of whom are already marginalized by other societal systems—only vindicates the view that recent proclamations of corporate solidarity are nothing more than opportunistic virtue signaling. Dismantling any oppressive and discriminatory system–whether technological or social, whether constructed intentionally or unintentionally–will require hard work, commitment and resources. At frog we have committed to raising these issues with our clients early and often to make sure we are building an AI solution that both meets their goals and reduces the effects of discriminatory bias.
To design against bias, we must look to both mitigate unintentional bias in new AI systems, as well as correct our reliance on entrenched tools and processes that might propagate bias, such as the CIFAR-100 dataset. Because the dataset is likely representative of the images available online at the time it was generated, it carries the bias for majority-group representations that characterizes media generally. When that biased data serves as the basis for new AI models and systems, it leads to even further discrepancies between representation and reality. Additionally, we need not be constrained to the narrow classifications of past datasets simply because they were someone’s best guess about the most useful classes for future algorithms. We can and must confront and correct such biases in our tools and in our thinking.
Humans rarely act on their negative biases intentionally, which is exactly why we must create opportunities to recognize, discuss and reflect on them. One of the best ways to do this is to ensure that there is a diverse set of designers, data scientists and other stakeholders discussing potential bias and its effects early in the design process. Now that AI is being used on an industrial scale by companies all over the world, its potential effects, positive and negative, are magnified. Therefore, as data scientists and AI designers, we have a moral responsibility to intercede at all points before, within and after the AI system pipeline to mitigate and, where possible, to remove these negative effects.