Detecting Bias via Swedish Parliament SpeechesNimish Gåtam May 12, 2018
What kinds of words do representatives use as though they were synonyms? When they say “citizen”, do they mean “fellow person”? Or do they use it interchangeably with “consumer”?
I made a tool to explore and answer these questions. Give it a try! (in Swedish)
How to use the data explorer
You pick the political party, type in a word, and you get the top 25 words with the highest similarity to the combination you typed in.
Hovering over the word will give you the cosine similarity to 2 decimal places, multiplied by 100 (so a similarity of 0.5539650 = 55). You can click on the words returned and search that word, and so on.
Background – Word2Vec
Word2vec is an algorithm that takes words in a corpus, and turns each one into an axis. It also takes every word and creates a ‘word vector’ out of it.
The ‘word vector’ part is based on taking the words surrounding the target word within a given window (I set it to 5). So if you had the following sentences as your entire corpus:
I really think cats are great
I really think dogs are great
Your dimensions would be: I, really, think, cats, dogs, are, great. The word vector for cats would be pointed towards the I, really, think, are and great dimensions. In fact, it would be pointed equally towards all of those dimensions since those words occur equally within its context.
The cool part is that these are still 1-dimensional lines, even if they exist in multidimensional space, so you can find the angle between them.
You can even take the cosine of the angle between them. If they’re pointed the exact same way (which, in this case would mean they occur in the exact same contexts), the cosine of the angle between their word vectors would be 1. If they’re completely independent, you’ll get a cosine of 0. If they’re inversely related, you’ll get a cosine of -1.
When done on real data sets with unbiased words, we do actually see that words with similar contexts have very strong semantic relationships. They even allow for semantic math of sorts (most famously, the algorithm resolves
king - man = queen). Here is an excellent post that goes into greater detail.
Interpreting the results
Because the similarities come back as a value between 0 and 1, it’s very easy to (mistakenly) interpret them as percentages. They’re not. That having been said, though, I needed a way to show some words were ‘more’ similar than others visually. So I treated the similarity metric as a scaling factor and made the more similar words bigger.
The ‘meaningfulness’ of this depends on the size of the corpus and the relative frequency of that word in the corpus. Frequently used words, like “Sweden” give good results, but less-frequent words might show a strong similarity metric just because they were used together a handful of times.
The only way to know how similar is ‘similar enough’ is to sanity-check against an expert who knows Swedish parliamentary speeches, history, and data really well. I don’t have access to such an expert, so I left it all as-is.
Swedish parliament has all of its speech data available here. I downloaded the XML for speeches between 2006 and early 2018. I did some basic cleanup (capitalization, punctuation etc) and created groupings by political party.
Obviously, there are many caveats to this approach. Some political parties are older than others, the memberships change over time, my cleanup is overly simplistic (not stemming, or combining word forms in general), this might not be enough data, etc.
All that having been said though, the initial results are interesting and fun.
The end result is, well, not very clear-cut.
Some associations make sense, such as Sweden being in the same semantic category as Norway, Denmark etc.
There are also some of the expected associations (xenophobia in the xenophobic party, environmentalism in the environmentalist party etc.) but not as many as I would have guessed.
A fast majority, though, is noise and random chance. We see ‘enemy‘ and ‘12-year-old‘ paired up for the Social Democrats. I doubt they see 12-year-olds as enemies. It’s most likely that these two words are very infrequent, and happened to have similar contexts the small handful of times they were used.
Still, it’s kind of cool when you put in something like ‘cultural heritage’ (kulturarv) and see that every political party wants to protect it (försvar), but they all have different ideas of what it is.
- Code on github.