Around The World, Part 19: Constructing languages

I’m having a bit of an “off” day today, so let’s do something fun, even if it’s rather low priority for the game: generating foreign languages.

The idea is that the player starts out in a region where they speak the local language, which will be represented as English (or some other real-world language, if the game ever gets translated). However, as the player ventures out and discovers new civilizations, they’ll run into a language barrier. Text like quest descriptions and gossip will be shown in incomprehensible gibberish instead, until the player’s character learns the local language, or hires someone to translate. At that point, the foreign text is partially or entirely replaced by English. Of course, I don’t want to use real-world languages in my fictional world, so: we need an algorithm to produce languages!

A more immediate reason to write a language generator is for naming cities, seas, coasts, islands and the like. Right now, city names are just drawn from a list of real-world ports, which is a bit weird at best.

Prior work

A fairly well known generator for place names is due to Martin O’Leary. He has a great description on his website, and published the source code on GitHub.

I’ve experimented with O’Leary’s algorithm in the past, but I was never entirely happy with the results. One problem might have been that I was trying to do two things at once:

  1. Generating languages
  2. Generating words in a generated language

Repeatedly, I ran into the problem that as I added more diversity to words within a language, each language became less distinctive and recognizable, resulting in a uniform soup of words.

But a more fundamental issue was that generated words looked too artificial to me, which is probably due to the rigid phoneme structure that the algorithm uses. So instead, I’ll start with a more classic approach.

Markov chains

Unlike Martin O’Leary’s generator, a Markov chain needs to be created from input data, for example a large amount of text. If we feed it English text, it’ll generate words that look more or less like English (or, in many cases, are English). Since we don’t have any text in a language we haven’t yet generated, this sounds useless, but bear with me – I have some Ideas.

For our purposes, you can think of a Markov chain as a table that contains, for each combination of preceding letters of some fixed length, the probability distribution of the letter that follows. For example, one row in the table could look like this:

...
ba | 50% d, 20% g, 10% r, 10% t, 10% y
...

This particular table is based on counting trigrams, i.e. combinations of three letters. In this case, the trigrams bad, bag, bar, bat and bay were encountered in the input text. The left column contains the leftmost two letters of the trigram, and the right column indicates how frequently each next letters was encountered. There are special indicators for start-of-word and end-of-word, which can be treated just like other letters.

Creating the chain

To produce the Markov chain, we need some input data. I want a diverse set of languages, so after some searching, I settled on the UDHR corpus, which contains the Universal Declaration of Human Rights translated into 372 languages. This text is rather short: only 1783 words in English. For a normal Markov chain word generator, this would be pretty limiting. But for my Ideas, it shouldn’t matter.

Here’s the full table for the English version, if we count bigrams (combinations of two letters):

^ | a 270, t 256, o 168, i 101, r 96, s 96, e 93, h 92, f 89, p 82, b 69, w 58, c 57, d 43, n 41, m 32, l 27, u 21, g 17, j 7, v 4, k 3
a | n 174, l 128, r 93, t 92, s 68, c 30, $ 20, m 17, v 16, g 16, i 13, d 9, w 9, b 8, y 7, u 2, f 1, p 1, k 1
b | e 52, l 18, y 13, i 7, a 5, o 4, j 4, s 3, u 2, r 1
c | l 45, o 43, e 43, t 38, h 32, i 30, a 19, $ 12, u 10, r 9, c 5, k 5, y 1
d | $ 181, e 45, i 34, o 23, u 15, a 12, l 3, s 3, v 2, g 2, r 2, y 1, h 1
e | $ 363, r 144, n 106, d 84, s 66, c 51, v 40, l 39, e 38, a 32, m 23, t 21, q 16, i 13, f 12, p 7, x 7, o 6, g 3, y 3, k 2, b 1
f | $ 98, r 39, o 31, e 16, u 16, a 11, f 9, i 4
g | h 61, $ 33, e 22, a 12, n 10, r 8, i 8, u 4, s 3, o 3, t 1
h | e 174, a 75, i 60, t 56, $ 36, o 25, u 14, r 3, m 1, y 1, n 1
i | n 135, o 109, t 92, g 71, s 69, c 66, e 25, m 24, v 22, a 22, l 19, r 14, f 8, d 6, b 6, p 5, z 4, $ 1, h 1
j | u 6, o 5, e 4
k | $ 7, i 4, e 3, s 3, n 1
l | $ 120, e 74, l 55, i 47, a 29, y 23, o 15, d 12, t 8, u 7, f 4, v 2, s 2
m | e 49, a 30, i 23, $ 21, o 17, s 17, p 13, m 8, b 7, u 4
n | $ 171, d 142, t 83, a 59, e 58, g 41, i 34, c 34, s 32, o 24, y 19, j 4, h 2, n 2, u 2, l 2, m 2, k 1, v 1, f 1
o | n 189, $ 97, f 95, r 94, m 48, u 36, t 35, c 22, p 19, l 13, d 9, o 8, s 8, y 7, g 6, w 6, v 6, b 4, i 2, h 1, k 1
p | r 43, e 43, l 16, o 12, u 11, a 8, i 6, m 4, $ 4, p 3, t 1, s 1
q | u 16
r | e 128, i 100, $ 85, t 53, y 53, o 43, a 42, s 25, d 11, n 11, m 9, r 8, v 6, b 5, g 5, k 5, u 4, f 4, p 4, c 4, l 2, h 2
s | $ 214, e 47, t 46, h 33, o 32, s 24, u 16, i 15, a 11, p 11, c 9, l 3, d 2, r 1, y 1
t | h 199, i 161, $ 121, o 90, e 75, y 37, a 35, s 33, r 24, l 11, u 10, t 5, w 1, m 1
u | n 42, r 29, m 17, a 17, l 16, t 16, c 14, s 11, b 9, d 6, i 6, p 5, g 4, e 3, f 1
v | e 76, i 10, a 8, o 5
w | h 22, i 18, o 12, $ 10, e 6, a 5, n 2
x | i 2, p 2, e 2, $ 1
y | $ 127, o 31, m 3, r 1, w 1, l 1, s 1, i 1
z | a 3, e 1

I’m using ^ to indicate the start of the word and $ for the end of the word.

Generating words

Now let’s generate a word using this table, picking the most likely letters each time:

  • From the start-of-word indicator ^, the most likely next letter is a.
  • From a, the most likely letter is n, forming an.
  • From n, the most likely letter is $, so we are done.

Indeed, an is a rather common English word. If we had started with t instead, can you see where we are most likely to end up?

Now let’s write some code to automate this process, and also randomly select the next letter according to the weighting, instead of just picking the most common one. Here are 10 words produced by the table above:

no
ofessinyoory
tofen
thal
frtitopof
as
eraricolalees
con
berthigior
derthirghtiredendacorouly

… Okay. Not exactly English. What if we use trigrams instead of bigrams? The table is too large to show in full, but here’s some output:

and
the
beitradefoull
mation
ance
weedomine
in
artion
torights
a

Better. Four of those are actual English words, the rest look like parts of words glued together, but that’s fine. Let’s try it on some more languages. I’ve added capitalization and punctuation manually.

LanguageGenerated sentence
DutchDat van welken eeft sond overwel echijn hetzen stikerrichtens opvolkel.
SpanishQue la hagará destaral mérionesu plese humad nal humanacionalgual pentodada.
FinnishTei kaiskosa ja ja mihmisteettä arti la ja liseen ja.
SerbianI člata privo poranjudje zemljšavo mednosti ste i ovešanda ne.
ZuluNo kuzwenhlama iswakekhelulundle nomba basivikathwazo olokwayimigazilele isizinke isizwe wona nombenza.
Chinese育 方 他 努 所 普 照 定 家 任.

Looks quite reasonable to me. I don’t think I’ll use non-Latin-based alphabets, but it’s neat to see that the algorithm correctly inferred that Chinese words mostly consist of a single character.

(Fun fact: DeepL translates the generated mock Chinese text as “The United States of America is the only country in the world that has a universal mandate for the protection of human rights.” So my code, in all its simplicity, is already on par with modern LLMs when it comes to hallucinating! But the sentence is clearly ambiguous, because Google instead translates it as “He strives to make the best use of his talents and to fulfill his family responsibilities”… Chinese is as wonderful as it is mysterious to me.)

The first Idea

Now, how do we get from these Markov chains for real languages to Markov chains for generated languages, that don’t really exist? Here, my Idea is to not create a Markov chain of letters, but of letter classes: vowels and different types of consonants. When generating words, we substitute each class with a letter from that class. This should reduce the resemblance to the input text, and hopefully combine the strengths of Martin O’Leary’s algorithm (pronouncable, plausible words) with those of Markov chains (natural word structures).

I’ll start with these letter classes, grouped roughly using my very limited knowledge of phonetics:

LabelLetters
Vvwf
Mmnñ
Ttd
Ppb
Ll
Sszjščćž
Kckq
Ggrhx
Aaeiouyáéíóúàèìòùâêîôûäëöüï

Linguists will recoil in horror at this table: I’m assuming a 1:1 correspondence between spelling and pronunciation here, which is very wrong in general. But since the generated languages will be fictional, nobody will know how to pronounce them anyway.

Now before we create the Markov chain from our input text, we first replace each letter in the text by its class label. This yields a much shorter table, which hopefully captures the word structure (phonotactics) of the input language. For English:

PrecedingFollowing
AAM 176, T 53, L 43, G 29, S 24, $ 22, K 11, P 11, V 10, A 2
AGA 168, $ 86, G 79, T 64, M 30, S 25, P 11, V 10, K 9, L 2
AKA 81, L 39, T 36, G 23, $ 14, K 10
^SA 50, G 30, T 14, P 4, L 3, K 2
^TG 160, A 139
^VA 93, G 58
^^A 653, T 299, G 205, P 151, V 151, S 103, M 73, K 60, L 27

Of course, the quality of generated words is pretty hard to judge now:

GAPSAMKA
GAAT
AMT
TG
PGA
AVALTAM
TGALAALA
VAGA
AGAGTA
AVAL

To map these class labels back to actual letters, we use the occurrences of letters in the source language as weights. For example, in English, ‘e’ is the most common vowel, so it’ll have the largest probability of being chosen if class ‘A’ is encountered.

GAPSAMKA -> repsancu
GAAT -> hiet
AMT -> unt
TG -> dr
PGA -> pha
AVALTAM -> evaltom
TGALAALA -> thiloela
VAGA -> vere
AGAGTA -> ehirda
AVAL -> owel

It certainly doesn’t look like English anymore, but it does look somewhat like a real language.

Can we still distinguish our input languages from each other?

LanguageTranslated sentence
EnglishHepsanco ongs ofe pa ren rot reco beind elesois ba.
DutchHovt fheegerelinoden ta vigste den sen en vaen ieetin iam.
SpanishPocrancoenicin proenpracon depdaruehosunil lo ensdes inena ne si o por.
FinnishSehviltaan jolnaisee soin ara aos vili alli votu takaon ioldervyte.
SerbianJvhća ci dake ejan tgutzeni jve prapag o jen avirlosovo.
ZuluZeantahla aselolalu na omgulinpe ekgemte nha ezujwezini oswancywy la wuntgokuto.

We still get some the double vowels typical of Dutch and Finnish, and the rhythm of the Spanish words also seems familiar. English is all over the place, but that’s fair, considering it’s something of a bastard language.

However, I’m not thrilled about phrases like “pocrancoenicin proenpracon depdaruehosunil”. Those are altogether too many long words in a sequence.

Word lengths

Fortunately, I don’t want to generate entirely random sentences; I want to “translate” English into these generated languages. That lets us use the original English sentence structure as a basis, which doesn’t have such word sequences. So let’s start with a famous English sentence:

It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife.

For each word, we’ll generate a word in our target language that has the same length. I’m just doing this brute-force by generating words until one of the right length comes up (allowing a little leeway if this takes too many attempts):

LanguageTranslated sentence
EnglishDr bo en fenec tholerefend fritemyh, tach mah peamni tga uh hisedehtaos na li arer treseut, roun he ri thes en jer er.
DutchGe in ca aweri everdied entermekgtih, tico en ainven ofd hol ollosejhan di av bisn injalto, acht an an folj in in eint.
SpanishRa lu ó cecom cenpinamenete esórvecre, dede o dochoi dos ce laprecopiál ta e ale vinstaz, heho se ne con la am lasal.
FinnishAn os tä kakina parsikejta kiimeittevyson, esin ke käatan han nos tiklunnäsvyn hoam te etun ehdanta, ahten ta ut naat on dän keyn.
SerbianI an i jvono drjulmarlen epstiti, nuvo u ojonsu sui vo sugide na i ali miluvok, slodo a eno nina si ta pidu.
ZuluLe wu ki inhu ecazikekelwy kwytshakhabu, ongo eo akwale eenwu na unbhedhutvi kwy mu okga aswoche, aswa ma nha ano esi awo langi.

In real languages, of course, word lengths are not the same between languages. Some, like Greenlandic, are famous for their long words – the longest Greenlandic word in our input text is “suliffissaaruttoornaveersaartitaanissamut” at 41 letters. Other languages might consist mainly of shorter words. The distribution of word lengths is a significant part of each language’s characteristic look and feel.

To incorporate word length into our generator, we’ll start by tallying how often each word length occurs in the input text:

Chart of word length for English and Greenlandic words

Now, rather than just drawing random samples from this distribution, I would prefer that short words in the input sentence remain short in the translation, and long words remain long. So we want to somehow “remap” word lengths from English to Greenlandic. We can do that by looking at quantiles. For example, say we want to know the appropriate word length for the translation of the 5-letter word “truth” into Greenlandic. First we calculate: what proportion of English words are shorter than 5 letters? Let’s say it’s 30%. We then generate a word such that (roughly) 30% of Greenlandic words are shorter than that word. Let’s see if it works:

LanguageTranslated sentence
EnglishDr bo en fenec tholerefend fritemyh, tach mah peamni tga uh hisedehtaos na li arer treseut, roun he ri thes en jer er.
DutchRen sen ca tikgap werivelcanekeahenbeg entermekgtih, hienr ep oevenr eem ofd jirveredelazde rie lat vejn eateewd, vomjm wra ran oafel nen vom hemhe.
SpanishRa lu ó irduro cenpinamenete intesteea, qorda o dochoi baro on pehtelicil ne a toin dadroccóe, aserud la am tyine de i upah.
FinnishSyan enui an eramion attinellodaselu ogtemkoättistima, seenoin ha sästaetta allaäte han tiklunnäsvyn hoam suen joenjam pätassaltta, ehdanta ahten oon teunnat naat on oulten.
SerbianI an i iparat šadodisamevnu majbesnapejte, nuvo o ikasta jrinto ti etnujanajesi šo a potća bravirač, nina in am žlijo ku a moko.
ZuluNahle mhulule inhu ecazikekelwy mzamhaputhakwuna kwytshakhabu, enhylali lola oaudhule ajekwonka ikacge asuwezwumgle aakula axlu emebagte nokuthenpha, iswucusu lelolu bosanga kvemhike nhalu awo ecalwenhu.

It seems to do the trick. In particular, mock Finnish and mock Zulu have become considerably more representative of their origins.

Supercalifragilisticexpialidocious

Oh, did someone say Greenlandic? Alright, let’s have it:

LanguageTranslated sentence
GreenlandicTiaqarnit inatenraaotila unat eijsataluunmuit alitiunimniasagsuritriluanakait qaemnetirsinasasesuitugsiitaissi, tinmgofvaimalli senuq pernuiffainmiinaqqanmaq iittiuvfaqkit qaniaremi inmappamnainnainniqalluat pississok pasat uppagaiteautaanulit ugniqissannassinnut, anmriunnatsiniq anuitirtaq pasat turanissiinmut tiavfakirat mutit pasjanikiaraaq.

However, there’s a subtlety that I’m ignoring here: languages with long words typically pack more meaning into each single word. For example, the aforementioned “suliffissaaruttoornaveersaartitaanissamut” appears to mean “protection against unemployment”, so it does the work of three English words all by itself. Even then, Greenlandic is relatively verbose: the entire Universal Declaration of Human Rights text contains 58% more characters than its English counterpart.

I don’t want to get into such merging of words, but I do want to scale the word length back a bit, so that on average a generated Greenlandic word is 58% longer than the original English. It seems to work – words are still long, but no longer outrageously so:

LanguageTranslated sentence
EnglishDr bo en fenec tholerefend fritemyh, tach mah peamni tga uh hisedehtaos na li arer treseut, roun he ri thes en jer er.
GreenlandicAkanit kalla mi tigninnik eessutnakusinngat isunalilliirutonamu, aronnrit ania taqirfuk umagtat qigat tiqalarirnililli pasat taq pissutet anuitirtaq, anaila inniat mutit nakugeemut tiutat tiak samakcali.

The second Idea

For each input language, we now have:

  1. a Markov chain for letter classes,
  2. the probability of each letter within each letter class,
  3. the distribution of word lengths.

The second Idea is that we can mix and match these between languages. For example, this could produce a language with the rhythm of Italian, the letters of Turkish, and the word lengths of Greenlandic.

For that to work well, we need some more languages in the mix. So I hand-picked over 30 languages that can reasonably be rendered in an extended Latin alphabet, trying to cover a broad spectrum of language families as based on this website. For Abkhaz, Arabic, Chinese, Greek, Japanese, Russian and Ukrainian, I used the uroman Python package and cleaned up the output manually.

After implementing the random mixing and matching, we have over 50,000 possible fictional languages. We can now also make each language generate its own name!

LanguageOriginTranslated sentence
SashhomuzAbkhaz + Hani + UzbekAe ti es eshi uaahahu uhutalido, aee yi dgoq ydeo cho onashulexds aqx ti aaio ilaqhu, ojny yim din oshi ea eb eliq.
DrliHakha Chin + Serbian + EnglishMe um i orto rmokinraoldo domrmir, nuk or rnoika ćai lo rmecno li i lavr kaimok, airr li li duuk ng a nrad.
LotolisHungarian + Russian + FinnishNic mon vo elazuro otelidenecmeoll satihasapamily, bolicc a zsosihen tisnyn sa kudisetereht kit i viledes yllantaho, osatsyk ilu lin ssen lin suh ventab.
BrvahIrish + Turkish + UzbekIr eys ü ardrem ırseedheas ütarrriihsethıç, krin ğü ketril bian emi bıekrenhse hkr t iıne ereeh, beıl lyn bra anti rus an fart.
NespastoQuechua + English + LithuanianOsfes hoam as asllenam cicremaocope cieuacoscillis, llasko to vifehonto toenes com boscaaetetu tece os ronuma nanacal, cenan min cobi moiene es cona llince.
MahgiEdo + Xhosa + KurdishEi ao en wanfan nege agpanvan, kan e kaghe ohaen ni ohaqho kgy e uxeu inwigpo, uabe su ta awbe no u inwen.
TahiwkHani + Huasteco + KikongoMá nal jac baaik piulnákkoil noaanák, cyal bé céewik caaw ka jilpa bá bi pyjyl maéwc, lakhha ia la táawk i sa lakjek.

I like it! These languages are fairly plausible, yet reasonably distinctive.

It would be possible to mix even harder, for example by choosing each character class from a different language, or by constructing a new Markov chain from the trigrams of multiple languages merged together. But I’ll leave it at this for now.

Place names

Now that we can generate languages, and words within each language, it’s a small step to produce names of cities. Most of the time, these will be single words, but sometimes we’ll want to generate a composite name like “Dar es Salaam” or “The Hague”. I did have to impose a limit of 20 letters for this not to get out of hand.

LanguageOriginGenerated city names
SashhomuzAbkhaz + Hani + UzbekZnexkha, Aiylolo, Omyiheop, Yeyhi, Amimeh, Esunsou, Aihegai-Amuihui, Ieno-Ilejhe, Om-Osamimaala, Ipshi
DrliHakha Chin + Serbian + EnglishRlenkrim, Tiasojal, Krot-Mre-Ninta, Kgaor-Denrkan, Liolla, Kaomrtir, Diulrni, Kop Jijela, Nintumr, Zoldin
LotolisHungarian + Russian + FinnishZsynintak, Zzihoga, Sayhek, Venkiom, Sasso, Va-Valtosas, Ellak, Zssat, Nentos, Tok-Kolehik
BrvahIrish + Turkish + UzbekRcrtriymta, Us-Rımtram, Kht-Iaimy, Ahkitr, Iıhfarb, Bhaapha, Deutr, Leene, Lianmü, Etr-Rğenna
NespastoQuechua + English + LithuanianSebayni, Ogane, Lloes, Cintoe-Ates, Ebeshiscoe, Lenes, Comeno Sas Cafpeo, Cgoon-Llonim, Mynocanaj, Chinemim
MahgiEdo + Xhosa + KurdishGoaemwon, Idymnwyna, Fewon, Unnwun-Iwboe, Ovonwan, Uluhu, Agaunkufu, Aewba, Hhum Nizukheke, Igei-Iwenfaunae
TahiwkHani + Huasteco + KikongoJsal, Siak-Cáwk, Nekkawc, Xial, Kakyw-Naek-Keaow, Aac Lácáyl, Baol, Aí Keuw, Jakbail Taw Naáktak, Jjylcea

I can just imagine the player finally learning Sashhomuz in the city of Aiylolo, only to be confounded when the next city over, Sayhek, turns out to speak Lotolis instead…