Introduction
Wikipedia is a huge database of information with a large range of uses. From reading about every political scandal your favourite celebrity has been involved in, to learning about the origins of spaghetti, the information is at your fingertips. This range of data subsequently is proving useful in recent times in the training of Large Language Models (LLMs) for a variety of tasks. Thus, any bias in Wikipedia if not handled correctly could prove disastrous for these models.
Wikispeedia is a game built off Wikipedia, in which players are tasked to navigate from one Wikipedia article to another, using only the hyperlinks in the articles. The aim is to get to the target article in as few clicks as possible. You can play the game here.
In this project, we look at the bias in the representation of countries in the Wikispeedia dataset, and whether this is then reflected in the players' gameplay patterns. We will look at the player's behaviour, to see if it is a reflection of the bias already existing in the dataset, or if it takes a different or more extreme bias. A bias in a dataset such as Wikipedia would be detrimental to new models that are being trained on it, and additionally is detrimental for the population trying to gather objective information from a source that is inherently biased.
The data
The data that we have from the game is as follows:
- articles.tsv
- Containing all navigable articles in the game.
- categories.tsv
- The category and sub-categories that each article falls into.
- links.tsv
- Containing all links between articles. It should be noted that this set is directed, A -> B does not imply B -> A.
- paths_finished.tsv
- Containing all paths that were finished by players in the game.
- paths_unfinished.tsv
- Containing all paths that were unfinished by players in the game.
- shortest-path-distance-matrix.txt
- A matrix containing the shortest distances between all articles.
We decided to investigate if there was a representation bias towards certain countries or world regions in both the articles that the game is based on, and subsequently how people played the game. Our analysis looks at the following aspects of a countries representation:
- Article Length: The number of words in each article.
- In-/Out-degree: The number of links into and out of an article.
- Sentiment of the article: How positive or negative the language used in an article is.
What countries are in our dataset?
In this project, we followed the country classifications as outlined by the UN. These countries can be seen in the map below.
The map can be viewed separated by either the regions used in our analysis, or by the economic regions outlined by the UN. Of course it is known that certain countries have huge populations compared to others, and that there is serious economic disparity throughout the world. Indeed in Western Society we sometimes see a 'West is best' mentality, and we look into whether this is evident in our dataset here.
The bar chart below shows us how countries are distributed in each income band, with a slightly higher number of high-income countries, and slightly lower in the low-income category.
Let's have a closer look at the articles!
The articles describing the countries of the world vary greatly in their lengths. The box plot below shows the distribution of article lengths for each country in our dataset.
We see countries such as Argentina, Peru and Spain with articles of more than 70,000 words, and conversely countries such as Martinique and Anguilla with less than 8,000 words. Overall, we see that the longest articles tend to come from Western Europe, Western Asia and Northern America. Indeed, this would make sense if the lengths of articles corresponded to something inherent to the country such as population or area, however below we can see that there is a tendency for high-income countries to have longer articles than their lower-income counterparts.
This disparity is something to note, as it may reflect a blatant bias resulting from higher income countries having extra information in their articles. This may then be the root of all other biases in the dataset. This could have a variety of causes that we cannot see, such as history, where the author is from, or what political opinions they hold. The article length is an important fact to consider when we are looking at the number of hyperlinks each article has. Of course, we would expect a longer article to have more hyperlinks, however is this always the case?
Hyperlinks: a key ingredient to be well connected to the network of articles
The map below shows simply the in- and out-degree of each countries article. Indeed in this plot we can see that there is quite a strong cluster of high in-degree articles around Europe. This plot however does not take into account article length, so it may well just portray the already mentioned bias with article lengths. To combat this, we normalized all in-degrees with the length of the article in question.A neutral tone ?
Wikipedia aims to keep a neural point of view (NPOV) across all of its articles. In order to test this claim across the articles describing countries, we ran a sentiment analysis model across each article. Rankings were given between -1 and 1, with -1 being certainly negative and 1 being certainly positive.
All country articles received a sentiment score of between -0.2 and 0.2, which would suggest that these articles are meeting the NPOV criteria of Wikipedia. Great news!
When grouped by both economic classification and region, different results emerge. We see below that there is no significant disparity in the sentiments of articles based on economic classification
However, some differences arise when we look at the sentiments grouped by geographical region.
Here, we can see a disparity between Northern Africa, and the regions Australia and New Zealand, and Western Europe. This could suggest that there are certain regions of the world that are portrayed in a more positive light than others. We do have to consider however that some countries may just have a more negative history, whether through colonisation, war, famine or any other range of things, which will in turn reduce their sentiment score. That being said, the disparities are quite small and all articles are indeed very neutral.
Ok, so the data has some inherent biases, but how do they manifest in how people play the game?
Below we look at the number of finished and unfinished paths where a country article was a target.
The region which sees the largest number of paths finished is Northern America, but this is not exactly a shock. We have seen the in- and out- degree of the United States of America is huge. The likelihood of finding something connected to your target in that article is quite high. It is more interesting to see that paths starting with countries in Micronesia and Polynesia have much less finished paths than their “western” counterparts.
Indeed, knowing all of this, we would like to see if there is some correlation between all of these variables...
In walks regression analysis
By calculating the correlations between all of our variables, we see the in-degree has a high correlation with game success rate. We wanted to see what, if anything, influeced the in-degree of an article. To do this, we performed linear regression analysis.
In-degree is an important attribute as this signifies how often this country is appears among other Wikipedia articles. A high in-degree means that there are a lot of articles regarding, for example, people with a relation to this country, or historical events that link to that country. A big population is naturally linked to having more famous people with a Wikipedia article. Also, a strong economy allows a state to invest resources and time in innovation and science, which generates more content that can published to Wikipedia and thus linked back to the country. Therefore, population size and economic classification are very likely to have an influence on the in-degree of a countries article.
Regression analysis shows that a reasonable amount of variance in the in-degrees is explained by the economic classification and the log of the population variable (R-squared = 0.462). The log is taken here due to the power law distribution of the population. When the region is considered in addition to these two variables, even more variance is explained (R-squared = 0.702)! This shows that, even if we control for the two most influential confounders, the region still has an effect on the in-degree.
Most notably, the positive effect size is largest for Northern America, Western Europe, and Australia and New Zealand, which are commonly referred to as "Western" countries. For Northern America and Australia and New Zealand the affect is statistically significant, however for Western Europe it is not. On the contrary, Latin America, Northern Africa, Western Asia, Central Asia, Southern Asia and South-Eastern Asia have a very large negative effect size, all of which are statistically significant.
This result further solidifies our theory that "Western" countries are overrepresented in the WikiSpeedia dataset, even if we control for economic classification and population size.
Ethical Considerations
The presence of biases in Wikipedia, such as gender biases that underrepresent women and perpetuate stereotypes, poses risks. These biases can distort public knowledge and, potentially, be significantly magnified by LLMs[1]. In addition to gender bias, biases related to ethnicity and race are among others notable societal issues[2] .
Initially, our research aimed to categorize individuals with a Wikipedia article by ethnicity, to perform similar analysis done on gender. However, we encountered significant technical and ethical challenges. The fluidity of ethnic groups and the personal nature of ethnic identity make accurate classification difficult, increasing the likelihood of errors. A particular ethical dilemma arose from the potential misuse of findings: if our flawed ethnicity-based analysis suggested an absence of racial bias in Wikipedia, such results could be erroneously exploited to downplay the seriousness of ethnic discrimination on the platform[3].
To navigate these complexities, we refocused our research on examining country representation within Wikipedia. This approach helps identify biases in the dataset and user navigation patterns. We hypothesize that analyzing representation by country can indirectly reveal racial biases. It's important to note that underrepresentation of certain countries does not necessarily indicate intentional racism by Wikipedia contributors. Rather, it may reflect systemic disparities, where countries with lower financial resources, often non-western, are less represented. By shifting our research question, we aim to mitigate the ethical risks associated with misinterpreting data on racial discrimination, opting for an inquiry that is more ethically sound.
Conclusion
In conclusion, our dataset and analysis suggest a bias towards high-income countries and against places like Micronesia and Polynesia. This dataset could work for an LLM for the US but not for other regions, should be considered before training a model! However, it is essential to recognize that many other unexplored factors, such as digital access disparities, may contribute to these patterns. Future research should take into account these multifaceted aspects to provide a more comprehensive understanding of biases in online knowledge repositories like Wikipeedia and their implications for training machine learning models.