Listing cities for Orbital Express

For my game Orbital Express, I need a list of cities that can serve as targets for the player to aim at. We’re trying to select cities that…

  • … are somewhat uniformly spread across the globe. That way, you won’t end up always having to aim for the same densely populated areas like Europe.
  • … players are likely to have heard of. Otherwise, they feel cheated.

First, I had to find a list of candidate cities. That turned out to be easy enough: geonames.org offers a text file with all “cities” over 15000 inhabitants. It contains some 23000 entries, each with latitude, longitude and population. Cities can also be tagged as capitals of primary or secondary regions (roughly, countries or provinces).

Attempt 1: Population

Large cities have more people, so they must be more well-known, right? So as a first try, let’s take the most populous cities. Here’s the top-10 by that metric:

22315474 Shanghai
13076300 Buenos Aires
12691836 Mumbai
12294193 Mexico City
11716620 Beijing
11624219 Karachi
11174257 İstanbul
11090314 Tianjin
11071424 Guangzhou
10927986 Delhi

Not too bad, right? However, it turns out that China is full of very populous places whose names I couldn’t even pronounce. We want at least 100 cities, so let’s look a bit further down the list. Here are 90-100:

3144909 Casablanca
3129228 Zibo
3121275 Zhongshan
3120282 Durban
3093980 Changsha
3043532 Kabul
3029372 Ürümqi
3000000 Caracas
2935744 Pune
2894504 Sūrat

Now maybe it’s just me, but half of those I’ve never even heard of.

Perhaps it helps if we take into account whether a city is a capital. The problem here is: a capital of what? If we take only capitals of countries, that means only one city will be listed in all of the US, all of China and all of Russia, while Europe as a patchwork of little countries gets overrepresented. So we want to include some secondary capitals as well, but only if they’re big enough. I fiddled around with such rules a bit, until it was as good as I thought it would get.

But there was also the consideration of a uniform spread across the world, which was obviously not served by this approach. So I added a culling step based on position, according to this algorithm: while the number of cities is greater than 100, find the two cities closest to each other, and eliminate the least populous of them. This worked fairly well, but in sparsely populated areas it still tended to leave cities that were unlikely to be known.

What we really want is not population, but “well-knownness”. And what better source for this than the internet? Maybe I could get the number of Google search results for each city, or the PageRank. Unfortunately, Google Search doesn’t have an API, and screenscraping it is against the terms of service (and they will block you for it; I tried in a distant past).

But there is a free and open source of data available: Wikipedia. You can download all of Wikipedia easily via BitTorrent, although it’s a bit large. Unfortunately, the data doesn’t include the number of inbound links per page. Fortunately, somebody has preprocessed and uploaded exactly this.

So how does the list look when we consider just the number of inbound links from other Wikipedia pages? Here’s the top 10:

1237 Mandalay
986 Tungi
984 Oberasbach
968 Zvolen
918 Plovdiv
839 Monheim am Rhein
839 Marmagao
839 Brasília
781 Atwater
780 Svobodnyy

I don’t know what is going on there, but clearly there is something very wrong with this approach. And it’s exactly the same problem that made search engines in the 90s so bad: spamming links to your page would boost its popularity, regardless of the quality of those links.

Attempt 3: Wikipedia PageRank

In the late 90s, some clever folks figured out how to do better, by also taking into account the ranking of the pages that link to the page in question. This is the idea behind the PageRank algorithm that made Google successful. Could we apply that to Wikipedia?

Of course, someone already has, and made the data available. I was concerned that a top-10000 would not be enough, but my fear was ungrounded; the intersection between the GeoNames dataset and the PageRank list still contained some 400 entries. Curious about the top 10?

-3.639 London
-3.658 New York City
-3.761 Mexico
-3.814 Paris
-3.909 Los Angeles
-3.996 Chicago
-4.013 Ontario
-4.034 Hong Kong
-4.057 Rome
-4.07 Singapore

Now we’re talking! Many of these are definitely cities I would want to include in the game. All but two.

Fixing it up

You may have spotted Ontario in the list above. But isn’t Ontario a province, not a city? Well, yes and no. Turns out, there are 11 places named Ontario in the US alone. Since the Wikipedia dataset just contains page titles, without describing the type of object, one of these Ontarios) got a ride on the popularity of the Ontario, Canada page. Similarly, Mexico is not Mexico City, as you might have assumed.

It gets worse. I ran into apparently famous places named Bear, Euclid, Salt, University and – my favourite – Nokia.

And although there is only one place in the database by the name of Ontario, there are no fewer than six San Franciscos. Fortunately this case is easy: we can just take the most populous one. But for fixing up other Ontario-like cases, I had to add a lower bound of 200,000 inhabitants, which seemed to filter out most of the anomalies.

And, of course, this dataset comes from the English Wikipedia only. You may have been wondering whether that will create a balanced view of the world. Indeed it doesn’t. The list produced after filtering out smaller cities still contains a great many in the United Kingdom, like Bristol, Southampton and Leeds. To counteract this effect, I finally bit the bullet and wrote some visualization code, so I could see where the hot spots were. As expected: the UK, the US, and to a lesser degree, all of Europe. I fixed this by requiring some regions on the map to have a slightly higher PageRank to qualify.

And now I finally had a list that made me happy.

Creating levels

As I described previously, the game will consist of five difficulty levels, implemented as “rings” around the player’s location. The initial plan was to do something like:

Level 1: up to 1000 km
Level 2: 1000-2000 km
Level 3: 2000-4000 km
Level 4: 4000-8000 km
Level 5: 8000 km and beyond

However, what if you live in a remote place like Iceland? There may not even be another city within 1000 km. So I tackled this differently.

First, we remove all cities that are too close to you. There are two reasons for this. Firstly and perhaps counterintuitively, they’re too hard to hit, because the throttle control works by holding it longer for more power, and these require so little that even a light tap would be too much. Secondly, because the visualization shows the entire Earth, you wouldn’t be able to see what you’re doing anyway. (I can zoom in only slightly, because of the limited resolution of the Earth texture I’m using.)

After removing the nearby cities, the remaining ones are sorted by increasing distance, and grouped into percentiles instead of distance bands:

Level 1: 0%-20%
Level 2: 20%-40%
Level 3: 40%-60%
Level 4: 60%-80%
Level 5: 80%-100%

The upshot of this is that each level always contains the same number of cities, regardless of where you live in the world. The drawback is that the game might be a bit easier or harder, depending on where you are. I think I can live with that.