Getting a word list from WordNet

A good selection of words is essential for a fun drawing-and-guessing game; they must neither be too easy nor too difficult. However, I’m thoroughly lazy, so I was not going to compile a word list by hand.

Instead, I used Princeton WordNet, which is essentially a graph linking English words to each other in various relationships. (I’ll be focusing on English initially, but won’t be losing sight of other languages!) To make the following comprehensible, I’ll have to explain some more about WordNet’s structure.

The main construct is the synset, of which WordNet contains about 118,000. A synset (“synonym set”) does not directly represent a word, but rather a “concept”. For example, the “fly.n.01” synset represents the concept of a certain type of two-winged insect. The “fly.n.03” synset, on the other hand, represents the concept of an opening in trousers closed by a zipper. Likewise, “fly.v.01” (‘v’ of course standing for ‘verb’, where previously ’n’ represented ’noun’) represents the concept of moving through the air.

Each synset contains one or more lemmas, which are the actual words that are used in every-day language to represent the concept. The insect mentioned above has just the lemma “fly”, but the notion of moving through air has lemmas for both “fly” and “(to) wing”.

Finally, there are relationships between synsets. One of the most important is the hypernym relation, which is essentially a superset. “Insect” is a hypernym of “fly”, “organism” is a hypernym of “insect”, and eventually all nouns trace back to the root hypernym “entity”, in one way or more. The opposite relation is hyponym, which allows us to find all hyponyms of the “insect” synset, which is the set of all insects. This taxonomy is complete for nouns, but makes little sense for most other types of words.

The plan

Initially, I planned to extract various metrics per word:

  • Word type. Nouns are straightforward, but verbs, adjectives and adverbs should probably be excluded entirely.
  • Number of connections to other words. Well-connected words might be easier.
  • Depth in the taxonomy. Deeper words are more specific so might be harder.
  • Word length. Longer words might be harder to guess.
  • Whatever else I can come up with.

Then I’d manually classify a few hundred words on a scale of 1-5 (easy to impossible) and correlate the results with the metrics above. If I found a strong correlation, I’d have a way of estimating the difficulty of a word without tedious manual work. I can even generate separate list for easy/medium/hard games.

Tricky tricky

As it turned out, this approach did not work very well, because most graph algorithms from the otherwise excellent NetworkX library were too slow to run on a graph this size. And from what I learned afterwards, I don’t think it would have worked well even if they had been fast. For example, “eagle” would have been a fun word, but it is similar in metrics to “Kamchatkan sea eagle” which… whell, you get the point.

A problem is that WordNet is full of jargon. A large portion of it consists of Latin plant names, which are fortunately easy to filter because they trace back to a common hypernym. But then there are chemical elements, diseases, legal terminology… the list goes on.

But all these words are quite rare. If we look at how many lemmas (representations in common speech) a synset has, would that not give a notion of how well-known the synset is? Well, yes. Here are some of the synsets which have the largest sets of synonyms: “die”, “besotted”, “buttocks”, and of course “sleep together”. Again, you get the point.

A problem is that some words are overly specific, like “Kamchatkan sea eagle”. By looking at the distance of a synset to the root synset “entity”, could we avoid getting too specific? Well, yes. Then we get words like “physical entity”, “thing” and “object”. On the other end of the spectrum are “striped marlin”, “king mackerel” and “florida pompano”, which are obviously way too specific for Average Joe unless he’s a fisherman. The taxonomy is deeper for some concepts than others, so absolute depth is not meaningful.

Final trade-off

As fully automatic classification was too hard, I turned to a somewhat more manual approach. The hypernym relation still seemed pretty useful, since it allowed me to disqualify large groups of words at a time. So I wrote a Python program that did the following:

  • Discard all words with a capital letter in them. These are often obscure places and people.
  • For each synset, compute the set of hyponyms (e.g. for “insect” we get the set of all insects).
  • From biggest to smallest set, show me the name of the synset, some example words, and let me decide if it should be kept or discarded.
  • When done, go through all synsets that remain, and use the first lemma from the synset in the word list. (This is generally the most common one.)

This top-down approach was still not perfect, but it allowed me to cut out large chunks of useless words quickly, and keep going to refine the set. The conversation goes something like this:

  • “Entity? 74347 hyponyms, like pinball, bitterness, finisher, trinectes…”
    Keep, or we’ll have nothing left.
  • “Physical entity? 39556 hyponyms, like agamete, mealybug, four-pounder, down…”
    Generally keep; we’ll cut the crap later.
  • “Abstraction? 38669 hyponyms, like perdicidae, tophus, rebuke, conceivableness…”
    Hell no!
  • “Object? 29581 hyponyms, like barroom, whitlow grass, amphibian, bunny…”
    Looks good!

And so on, all the way down to specific categories like “bread” (you’d be amazed how many types of bread there are!). I also had it show me synsets that were already excluded, so I could choose to exclude all animals except mammals, for instance. (Not that I did this. “Barbary ape”, “entullus” and “proboscidean” hardly make for a fun game either.)

The final list contains 19643 words and is still far from perfect. Many words that would be fun have been excluded, but I doubt anyone will miss them. I’ll provide a “skip” button for hopeless cases. Also, the wordlist will get better over time, because of…

Adaptivity

Every game will be logged. Was the word guessed? How long did it take? How many guesses? Was it skipped entirely? Given enough rounds with a particular word, we’ll have a pretty good idea of how hard this word is to guess, and we can move it around to a word list of another difficulty level if appropriate. With about 20,000 words to go through, and on average (say) 20 seconds per word, it’ll take less than five days of continuous play before each word has been seen once.

Given enough data, we can even correlate difficulty with the user’s geographical location. If there’s a strong correlation, we might be looking at some kind of cultural bias and the word should probably be included in a regional word list only.