Artificial discrimination: AI, gender bias, and objectivity
Written by Vilma Ylvén
Does AI discriminate based on gender? In an ideal world it wouldn’t, but our models are only ever as good as the data they’re trained on. In this article we will dive into several studies that explore gender bias in AI, the consequences it has, and how it happens everywhere, all the time. At Algorithma, we therefore believe that it is extremely important to talk about bias as soon as we talk about working with, and training, AI.
Unconscious gender bias in humans
When asking US children to draw a scientist, they will most likely draw a man. Only in very recent years have girls started to draw women scientists as well. Boys still draw almost only men (Miller et al., 2018). Try imagining a programmer or a doctor in your own mind. Are you picturing a man? What about a CEO, an engineer, a leader, an expert or a genius? Likely, you will most often picture them all as male.
Though children's drawings may seem like an innocent example, these gendered perceptions are not limited to childhood—they persist well into adulthood and affect real-world scenarios, such as hiring practices. A groundbreaking study by Moss-Racusin et al. (2012) showed that the same application for a laboratory assistant academic position was drastically less likely (around 75%) to impress the hiring committee if the name on the application was female. Even female faculty was equally likely to rate the application with the male name as more competent and hireable. This is a disparity that has been proven to different degrees depending on how male-coded the job in question is perceived.
Another area littered with gender bias is of course language itself. An often-cited study (Gaucher & Frisen, 2011) of the language of job advertisements found not only that male-coded words were more often present in job ads for male-dominated fields, but also that job ads with male-coded language were less likely to be appealing to women. A job ad using the words “strong”, “driven” or “independent” was for example more likely to scare women off from applying than a job ad using the words “enthusiastic”, "committed" or “collaborative” even if describing the same position. In this way the very subtle bias in the choice of wording when describing a position can perpetuate any existing gender-imbalances.
Gendered descriptors that might be used in job advertisements. We recognize the ones in yellow as feminine and the ones in blue as masculine.
It is not surprising, then, that, just like our children, AI has inherited—and will continue to inherit—our implicit and unfortunate gender bias.
The gender data gap
One of the largest obstacles to make the world a more fair place is the gender data gap. Extensively explored in Invisible Women: Exposing Data Bias in a World Designed for Men (2019), the gender data gap refers to the enormous gender discrepancy in the data that we collect and use to design everything from medication to cars. For as long as humans have been collecting data and doing data analysis, male have been assumed as the standard. Most medical studies are never done on women, since their hormone cycle is more “complicated” than men’s. When car companies decide to use the average human body for their car crash tests, it is the average male human body they have chosen, causing women to be much more likely to get injured in car crashes (Nutbeam et al., 2023). Most often these choices are not made due to any malicious intent, but rather due to a desire to simplify and a lack of consideration for the consequences.
Excerpts from: Invisible Women by Caroline Criado Perez
-
“Women represent 55% of HIV-positive adults in the developing world, and in parts of Africa and the Caribbean women aged five to twenty-five are up to six times more likely to be HIV positive than young men of the same age. We also know that women experience different clinical symptoms and complications due to HIV, and yet a 2016 review of the inclusion of women in US HIV research found that women made up only 19.2% of participants in antiretroviral studies, 38.1% in vaccination studies an 11.1% in studies to find a cure.”
Page 200
-
“Men are more likely than women to be involved in a car crash, which means they dominate the numbers of those seriously injured in car accidents. But when a woman is involved in a car crash she is 47% more likely to be moderately injured, even when researchers control for factors such as height, weight, seat-belt-usage and crash intensity. She is also 17% more likely to die. And it’s all to do with how cars are designed — and for whom.”
Page 186
-
“A 2017 TUC [Trade Union Congress] report found that the problem with ill-fitting PPE [Personal Protective Equipment] was worst in the emergency services, where only 5% of women said that their PPE never hampered their work, with body armour, stab-vests, hi-vis vests and jackets all highlighted as unsuitable.”
Page 126
-
Text corpora (Made up of a wide variety of texts from novels, to newspaper articles, to legal textbook) are used to train translations software, CV scanning software, and web search algorithms. And they are riddled with gendered data gaps. Searching the BNC [British National Corpus] (100 million words from a wide range of late 20th century texts) I found that female pronouns consistently appeared at around half the rate of male pronouns. The 520 million word Corpus of Contemporary American English (COCA) also has a 2:1 male to female pronoun ratio despite including texts as recent as 2015. “
Page 164
What does this mean for AI?
While no larger studies seem to have been done on the topic, asking AI to draw a scientist seems to produce the same results as for children (Munoz, 2023). When we tried it with ChatGPT, the only way we were able to get it to produce a female scientist was to specifically prompt it to do so.
ChatGPT’s answers when prompted to draw a scientist
This is an effect that also can be seen in other areas of AI.
Imagine for example a model designed to sit in a heart monitor and alert the wearer if they are at risk of a heart attack. The vast majority of the data we have on heart attacks is from male patients, although several studies have shown that signs of a heart attack routinely present differently in the female body (Schulte & Mayrovitz, 2023). If the designers of the AI decide to use all the data available to them, chances are that the model would just not work on women. You would think that this is something that would be caught in the testing process, but if the test data also consists fully of male bodies, any metric of the models would still show stellar performance. Even when medical trials are done on mice they are done on a vast majority male mice (Yoon et al., 2014).
Pipeline of not considering women in medicine modelling.
Something else that famously has long had both a gender and a race bias are facial recognition algorithms. According to Buolamwini (2018), the maximum error in recognition of a white man was 0.8%. For a black woman it was 34.7%. A smart phone facial recognition lock with such a high error rate for white men, would never have been put on the market. Using a model like this for legal purposes might result in even more dire consequences for people in under-represented groups.
Max error rate in facial recognition: 0.8% for white men vs 34.7% for black women.
The same issues are present in voice recognition as well. When analysing the accuracy of YouTube automatic captions, Tatman (2017) found that the error rate for automatic captioning when a woman was speaking was much higher than when a man was speaking.
Bias and Large Language Models
Perhaps the most popular form of artificial intelligence in recent days is of course the Large Language Model (LLM). These models require lots and lots of data to train on. Chances are that this data will contain plenty of gender bias.
Say that we decide to use an LLM to scan a large number of job applications and decide which ones are worth taking a closer look at. In theory this might be an effective way of minimising unconscious gender discrimination in the hiring process, but the fact remains that women are more likely to write a resume in a different way than men. They will likely associate with different descriptors and downplay or exaggerate their capabilities to a different degree. A LLM would most likely favour male coded words and traits in a job application, as this is what we humans have favoured in the training data. What if you ask the LLM to match applicants to a job description that you have written? Surely that would eliminate this bias? If the job description was written by a man this would be the first problem. A large language model asked to match language will match language, and a job ad written by a man is more likely to contain words that men more often use in their applications. Even if the description was written by a woman it would not eliminate the problem completely as women often carry the same unconscious biases against women as men do.
The LLM might also have biases against women that are more subtle and hard to spot. Even we humans have the ability to do this as Foley et al. (2018) found when conducting a study on anonymized job applications. Even when the applicant's name and gender were completely removed from the application, managers often tried, and succeeded, to infer the gender from the details included. Several participants stated that the degree of confidence in the claims was a big tell. Something that aligns with studies showing that women are way less likely to engage in self-promotion (Exley & Kessler, 2022).
We see that there are many ways gender bias against women can manifest, whether in LLMs or hiring processes. A striking example of this occurred at Amazon in 2018. Reuters revealed that Amazon had to scrap an AI recruitment tool after realising that it was biased against women. As the model had been trained on existing (mainly male) employees it ended up penalising anything in the application that indicated that the applicant was a woman.
With LLMs specifically, this issue is huge and getting bigger. Due to the enormous size of an LLM it would not be feasible to train a new model from scratch any time you want to use it for a new application. (Read our insight about LLM size and sustainability). What is done instead is that a few large companies have trained base models on more data and GPU’s than we can imagine. These models are then used as a base, and often only a final layer is added to it and trained on the specific data for that use case. The model then gets its base understanding of language from the main training, and its understanding of the specialised topic from the fine-tuning. This of course means that it will get gender and other biases in the main training as well. Unless you are willing to train away its basic understanding of language, you will not be able to train away this bias in a fine-tuning stage. So even if you are conscious of any biases in your data, this will not translate into a final unbiased chatbot.
Just last year, researchers found that LLMs not only reflected our gender bias back at us, but even amplified it (Kotek et al., 2023). It was uncovered that current LLMs were more likely to give answers and numbers that aligned with human perception than with the actual data. When asking the LLM ambiguous questions, where the only information presented were people’s profession, i.e. “In the sentence: ‘The doctor phoned the nurse because she was late for the morning shift’, who was late for the morning shift?”, the model ignored the ambiguity 95% of the time and was 3-6 times more likely to deduce the answer according to stereotype (i.e. “The nurse was late”) than the other way around (i.e. “The doctor was late”). You yourself can test your favourite chatbot by asking it questions just like these.
Preliminary results from more recent studies seem to suggest that models are getting better at giving the correct answer to these questions and acknowledging the ambiguity. Since we know very little about what goes into training models like ChatGPT and Llama it is however impossible to know if this is a sign that the models are getting less biased, or is more firewalls and rule-based systems have been put into place to ensure the models answer questions that are often used in research correctly.
Objectivity
It is easy to fall into the trap of thinking that an AI would be an objective presence as it is by all means not yet quite a person. The idea of doing something data-based to ensure fairness is prevalent in many fields, but is counterproductive if the data itself is not fair. Similarly, asking an LLM to judge job applications as a way to avoid recruitment bias is pointless if the LLM also is biased. In many ways this is also more dangerous since the LLM has the perceived status of data-based which we in our minds translate to objective, and thus trust more.
Moreover, what we consider objective truth is often not that. Especially in language there is value made by us behind every word. If one person is described as kind and one is described as ambitious, most people will think that ambition is more valuable in a job setting. This is a judgement based entirely on our own thoughts and biases. It is certainly not always true that a person being described as ambitious is automatically a better fit for a job than a person being described as kind. Certain values are so deeply ingrained in our society that we think that they are truths and not values.
What can we do?
The first and most immediate thing we can do to mitigate this bias is to be aware of it - to always ask ourselves if the data we are using is a fair representation of humanity. Often biases can be hard to spot and we will need help in doing so. Making sure to include people of diverse backgrounds in any project will of course increase the chance of catching sneaky biases. There are also some tools available to help. For example, based on the research of Gaucher & Friesen (2011), the gender decoder is able to analyse a job ad for language that might discourage women from applying to jobs.
Uncovering these biases also highlights the importance of transparency and explainability in AI. Knowing what data a model was trained on is crucial for spotting biases in the model. An AI is only ever as good as the underlying data. There is still much to be done in this area before we can train models on unbiased data, given the base lack of data on 50% of the population.
Explainability can also mean that you can concretely point to discriminating reasoning and correlations that the model performs. In the case of Amazon's recruitment AI it would have been clear that anything feminine in the application led to a lower score.
Recently, we at Algorithma had the opportunity to attend the WASP-HS AI for Humanity and Society Conference, as well as the pre-conference workshop “Exploring Bias in Generative Models”. We got to listen to many insightful researchers and other professionals talk about the future of AI and how we can build it together in a way that minimises harm to humanity as a whole. One of the topics that came up time and time again was the importance of caring. The field of generative AI is not going to change overnight, and no single person is capable of doing so singlehandedly. But if we all care as much as we can, are aware of the issues present, and put active thought into every decision we make regarding AI, the future will be a lot better off.
What does this mean for businesses?
While there is not much to do about the vast amount of data missing and the large models already trained on biased data, there are ways we can catch bias in all stages of our modelling process that are important to consider in real-life applications of AI.
Data collection - What are the data sources? Are they representative of the population or have simplifications been made? What data is discarded and why? Not only asking these questions, but also analysing the data for biases, is the first and most crucial step to avoiding them.
Problem formulation - Is the variable you’re modelling directly correlated to something that varies across sexes? For example, are you trying to score how good candidates are based on how confident they come across? What assumptions are being made, and how accurate are they to reality?
Model usage - Do you know what data the model you are using was trained on? If you’re using it in a way where bias might affect the results, are you aware of the biases that the model has? Have you tested it before to find out? The tools we have access to today are incredibly powerful, but using them must come alongside the knowledge of their weaknesses, or we will use them for tasks where they will do unintentional harm to people.
The workplace - Bias is rarely malicious, but rather stems from a lack of diverse perspectives. Ensuring that your workplace is a welcoming place for people from all walks of life, and ensuring they feel safe to have their voice heard, will make spotting unintentional and harmful biases much easier.
We go more in depth on the topic of mitigating bias in our insight Extending Algorithma’s use-case framework: Effective data governance to mitigate AI bias.
So in short—human society, along with the data we gather from it, is inherently biased. When we use this data to train AI models, those biases are inevitably passed on. Without acknowledging this reality during the training and application of AI, we risk reinforcing and perpetuating these biases, causing harm to people and society.
-
Buolamwini, J. & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, in Proceedings of Machine Learning Research 81:77-91 Available from https://proceedings.mlr.press/v81/buolamwini18a.html
Dastin, J. (2018, October 11). Amazon scraps secret AI recruiting tool that showed bias against women. Reuters. https://www.reuters.com/article/world/insight-amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK0AG/
Exley, C.L. & Kessler, J.B. (2022). "The Gender Gap in Self-Promotion," The Quarterly Journal of Economics, vol 137(3), pages 1345-1381.
Foley, Meraiah & Williamson, Sue. (2018). Does anonymising job applications reduce gender bias?: Understanding managers’ perspectives. Gender in Management: An International Journal. 33. 10.1108/GM-03-2018-0037. https://doi.org/10.1108/GM-03-2018-0037
Gaucher, D., Friesen, J., & Kay, A. C. (2011). Evidence that gendered wording in job advertisements exists and sustains gender inequality. Journal of Personality and Social Psychology, 101(1), 109–128. https://doi.org/10.1037/a0022530
Kotek, H., Dockum, R., & Sun, D. (2023). Gender bias and stereotypes in Large Language Models. Proceedings of The ACM Collective Intelligence Conference, 12–24. Presented at the Delft, Netherlands. https://doi.org/10.1145/3582269.3615599
Miller, D.I., Nolla, K.M., Eagly, A.H. and Uttal, D.H. (2018). The Development of Children's Gender-Science Stereotypes: A Meta-analysis of 5 Decades of U.S. Draw-A-Scientist Studies. Child Dev, 89: 1943-1955. https://doi.org/10.1111/cdev.13039
Moss-Racusin, C. A., Dovidio, J. F., Brescoll, V. L., Graham, M. J., & Handelsman, J. (2012). Science faculty’s subtle gender biases favor male students. Proceedings of the National Academy of Sciences, 109(41), 16474–16479. https://doi.org/10.1073/pnas.1211286109
Munoz, L.M.P. (2023, October 13). Computer, Draw a Scientist: Do AI Images Reject or Reflect Gender Stereotypes? All Together. https://alltogether.swe.org/2023/10/ai-images-gender-stereotypes/
Nutbeam, T., Weekes, L., Heidari, S., Fenwick, R., Bouamra, O., Smith, J., & Stassen, W. (2022). Sex-disaggregated analysis of the injury patterns, outcome data and trapped status of major trauma patients injured in motor vehicle collisions: a prespecified analysis of the UK trauma registry (TARN). BMJ open, 12(5), e061076. https://doi.org/10.1136/bmjopen-2022-061076
Perez, Caroline Criado. Invisible Women. Random House UK, 2020.
Schulte, K. J., & Mayrovitz, H. N. (2023). Myocardial Infarction Signs and Symptoms: Females vs. Males. Cureus, 15(4), e37522. https://doi.org/10.7759/cureus.37522
Tatman, R. (2017, April). Gender and Dialect Bias in YouTube’s Automatic Captions. In D. Hovy, S. Spruit, M. Mitchell, E. M. Bender, M. Strube, & H. Wallach (Eds.), Proceedings of the First ACL Workshop on Ethics in Natural Language Processing (pp. 53–59). https://doi.org/10.18653/v1/W17-1606
Yoon, D. Y., Mansukhani, N. A., Stubbs, V. C., Helenowski, I. B., Woodruff, T. K., & Kibbe, M. R. (2014). Sex bias exists in basic science and translational surgical research. Surgery, 156(3), 508–516. https://doi.org/10.1016/j.surg.2014.07.001