Performance measurements

After a week partly filled with lots of food and other Christmas celebrations, I’m back on track. I’ve done some polishing to make the game more game-like.

There’s now a neat countdown at the start of the game. You can start racing before the countdown ends, but there’s a little door in the way, like at the entrance of a real supermarket. If you start too early, you’ll slam into it and lose your momentum. Start too late, and you’ll lose precious time.

Each opponent now has a label showing the amount of cash they’ve collected. If the opponent is off the screen, an arrow points you to him, again with the cash label. This should help you target your efforts, and weapons.

If you finish while some of your opponents are still racing, the text “Wait …” appears in the centre of the screen. At that point, I want to make the camera follow random opponents that haven’t finished. At the end of the game, the text “Race over” appears for a few moments, before you go to the results screen.

You will see all this next time I make a movie, but for now, I’d like to talk about something else: performance.

Current framerates are between 30 fps and 40 fps. More than enough to be playable, but keep in mind that the Nexus One is one of the faster phones out there. I figured I need 60 fps to still be playable on slower devices; that comes down to about 17 ms (milliseconds) per frame.

The Android SDK comes with a neat profiling tool: TraceView. In the code, you simply write Debug.startMethodTracing(). This starts writing a file to the SD card, which you can copy off to the PC for offline analysis. Here’s what the timeline for the game looks like:

TraceView timeline

You can see there are four threads, one of which is active at a given moment. All the work is currently done on the GLThread. In between, you see some context switching to other threads for event handling. This happens about once every 20 ms, and takes a whopping 2 ms. This is a major pain in the ass for Android game developers in general, and fortunately Google listened: Android 2.3 (Gingerbread) has supposedly improved this. Not that anyone is running Gingerbread yet, though … heck, my phone came straight from Google and even I am still on 2.2.

The green blocks are buffer swaps. These are probably synchronized to the screen’s refresh rate, so there’s not much I can do about them.

The purple blocks next to those are calls to the native code of the physics engine. These take 2 to 5 ms each frame. Nothing to worry about, look how small the blocks are! But wait … tracing slows down the code to about 6 fps! These blocks would look a lot bigger if the game had been running at full speed, because native code is not slowed down. It seems we’re getting a skewed picture here.

This is why I decided to put in some instrumentation of my own. Modelled after a thing I’d previously written in C++, I built a nice class that allows me to time a section of code like this:


Besides a string, it also accepts anything with a toString method. And it works with nested sections, showing and sorting these correctly. All with no noticeable impact on the game’s framerate: most processing is done at the end, while writing to the file. The output looks something like this:

Timing of 1293 frames, average 33.3 f/s, 30.01 ms/f
 1293 c    1 c/f  33971.2 ms  26.27 ms/f  26.27  ms/c  onDrawFrame
 1293 c    1 c/f  27843.2 ms  21.53 ms/f  21.53  ms/c    MasterRenderPhase
 1292 c    0 c/f  23679.5 ms  18.31 ms/f  18.33  ms/c      WorldRenderPhase
 1292 c    0 c/f  18701.3 ms  14.46 ms/f  14.47  ms/c        NormalRenderPhase
 1292 c    0 c/f  16758.2 ms  12.96 ms/f  12.97  ms/c          renderAll
16960 c   13 c/f   3893.1 ms   3.01 ms/f   0.23  ms/c            c_shelves
17468 c   13 c/f   3570.3 ms   2.76 ms/f   0.20  ms/c            tile4
 8516 c    6 c/f   3058.7 ms   2.37 ms/f   0.36  ms/c            can
 1292 c    0 c/f   1005.8 ms   0.78 ms/f   0.78  ms/c          cull
 1292 c    0 c/f    899.1 ms   0.70 ms/f   0.70  ms/c          sort
 1292 c    0 c/f   2023.8 ms   1.57 ms/f   1.57  ms/c        OverlayRenderPhase
 1292 c    0 c/f   1662.9 ms   1.29 ms/f   1.29  ms/c          renderAll
 3876 c    2 c/f    340.0 ms   0.26 ms/f   0.09  ms/c            arrow_opponent
 1292 c    0 c/f   1805.9 ms   1.40 ms/f   1.40  ms/c        BlendingRenderPhase
 1293 c    1 c/f   3525.0 ms   2.73 ms/f   2.73  ms/c    physics
 1293 c    1 c/f   2147.6 ms   1.66 ms/f   1.66  ms/c    timeSteps

Interesting! So 22 ms of the 30 ms frame time is eaten up by rendering; actual simulation takes about 4 ms, and the rest is probably event handling overhead. Nothing new so far. But the timings for individual game models, shown below renderAll, are more informational.

OpenGL draw calls can either be vertex-bound or fill-bound. If they are vertex-bound, the bottleneck is in processing vertices; if fill-bound, the bottleneck is in colouring the actual pixels. Android devices are generally heavily fill-bound. However, the actual processing doesn’t have to happen while glDrawArrays (or similar) is executing; there is usually a pipeline that messes up the measurements.

To interpret the values above, we need some knowledge about these models:

  • c_shelves are the brown boxes, consisting of 10 triangles, with a visible area of 2 square metres.
  • tile4 are floor tiles, 2 triangles each, with visible area of 16 square metres.
  • can are the little tin cans, 48 triangles (yikes), with only a tiny visible area.

It would seem that, the more vertices, the longer the call takes. Let’s make a quick’n’dirty plot out of this:

Rendering time as a function of triangle count

Does that look like a straight line, or what? We can deduce a few things from this:

  • The overhead for a call to glDrawArrays is about 0.18 ms, so we’d do well to batch our geometry and send it off in a single call. Especially static geometry (floors, walls) would benefit from this. Since for simple models, the overhead is more than 80%, this should give a huge speedup.
  • The visible area hardly matters at all for the time it takes glDrawArrays to complete. This particular call appears vertex-bound. However, if the time is not spent there, where is it? Some of it might be in the buffer swap, but that takes no more than 3 ms for the entire frame. Probably the GPU goes off on its own to fill the pixels in the background, which is good news.
  • See that anomalous little mark down at 0.1 ms? That’s the arrow_opponent model (2 triangles), which is in a different render phase. That means it’s drawn with lighting disabled, which saves a few cycles per vertex on lighting calculations.

Interesting stuff. Now, if you’ll excuse me, I have some geometry to batch.