Optimization story

The fluid simulation was beginning to approach results of decent quality. However, it was still far too slow. Most of the screenshots I’ve shown so far were done on a 64x64 grid, which barely ran in real-time even on my fast Intel i7 machine. For a full-screen game, I’d need at least 128x128 and preferably 256x256. As I noted before, a doubling of the grid size requires about ten times as much computational power. Clearly, some optimizations were in order.

The main problem with the type of solver that I’m using is the CFL condition. It roughly dictates that fluid must not flow any faster than one grid cell per time step. It means that the time step is bounded by the velocity we expect: if I want fluid to traverse the entire 256x256 grid in a second (which is not very fast), I need timesteps no larger than 1/256 second!

Worse, if this “speed limit” is broken, the simulation explodes, which is quite unacceptable in a game. So my first step was to build a speed limiter. If the velocities approach the maximum speed, they are gently nudged down, so the maximum speed is never reached. This works in the sense that it prevents explosions, but the fluid still breaks up in pieces. I figured it’s better than nothing; preventing large velocities should be a part of the level design.

At the start of my work, the simulation ran at 59 frames per second (fps) for a 128x128 grid. The simulation time step of 0.001 second required that I somehow crank this up to about 1000 fps. I had some work to do…

The bottleneck

I started by using a profiler (gprof, though now I know better and use callgrind) to determine the performance bottlenecks. Unsurprisingly, the main slowdown was the pressure solver, which accounted for about 80% of the running time. The pressure solver computes pressures that result from the fluid motion; these pressures are then applied to the fluid motion to keep the fluid from compressing or expanding. This is an essential step and cannot be circumvented, so it needs to be optimized.

The solver I began with was a pretty simple-minded iterative Gauss-Seidel solver. Essentially, it races through the matrix one cell at a time, adjusts the cell in the right direction based on the neighbouring cells, and repeats this until the changes become sufficiently small. The code I started with looked as follows:

float delta = 0.0f;
for (size_t y = 1; y < d_ny - 1; ++y) {
	for (size_t x = 1; x < d_nx - 1; ++x) {
		float diff = d_b(x, y)
			- d_x(x, y)
			- d_w(x, y) * d_x(x - 1, y) - d_e(x, y) * d_x(x + 1, y)
			- d_s(x, y) * d_x(x, y - 1) - d_n(x, y) * d_x(x, y + 1);
		d_x(x, y) += diff;
		delta += diff * diff;
	}
}

This was repeated until delta was sufficiently small.

How could this be improved? The computations are all fairly trivial, but they access a lot of data. My hunch was that the bottleneck would be memory bandwith. However, some quick calculations showed that all data involved would fit easily into the CPU cache, so no relatively slow round-trips to main memory were required. Anyway, if this were true, I thought there would be little I could do: all memory accesses were needed.

The small stuff

So I turned to micro-optimizations. The first step was to get rid of expressions like d_x(x, y), which access individual matrix elements. Computing the address from x and y indices required a multiplication and an addition. Instead, I used linear indices. My code (with delta removed) now looked like this:

size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
	for (size_t x = 1; x < d_nx - 1; ++x) {
		d_x[ic] = d_b[ic]
				- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
				- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
		++ic; ++iw; ++ie; ++is; ++in;
	}
	ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}

This got me up to 70 fps, but I was running out of options. So I asked on StackOverflow. That question quickly became my most popular question ever, and it resulted in many great answers.

One pretty straightforward improvement was to replace the array indexing by a pointer straight into the matrix memory. This brought me up to 76 fps:

float const
	*b = &d_b(1, 1),
	*w = &d_w(1, 1), *e = &d_e(1, 1), *s = &d_s(1, 1), *n = &d_n(1, 1),
	*xw = &d_x(0, 1), *xe = &d_x(2, 1), *xs = &d_x(1, 0), *xn = &d_x(1, 2);
float *xc = &d_x(1, 1);
for (size_t y = 1; y < d_ny - 1; ++y) {
	for (size_t x = 1; x < d_nx - 1; ++x) {
		*xc = *b
			- *w * *xw - *e * *xe
			- *s * *xs - *n * *xn;
		++b;
		++w; ++e; ++s; ++n;
		++xw; ++xe; ++xs; ++xn;
		++xc;
	}
	b += 2;
	w += 2; e += 2; s += 2; n += 2;
	xw += 2; xe += 2; xs += 2; xn += 2;
	xc += 2;
}

Pipelining

Another interesting improvement was suggested by a friend of mine, who has more experience with this kind of stuff. Since we traverse the cells in “reading order”, and read from neighbouring cells, there is a pipelining issue here. Look at this statement:

		*xc = *b
			- *w * *xw - *e * *xe
			- *s * *xs - *n * *xn;

Now I know that the pointer xw always points one element behind the xc that we’re writing to. So we’re reading from a value that has just been written. If the write is still in the pipeline, the read has to be stalled until it is processed. A simple rearranging of the terms in this expression brought me from 76 fps to 91 fps!

		*xc = *b
			- *e * *xe
			- *s * *xs - *n * *xn
			- *w * *xw /* has just been written to:
                          process last for faster pipelining */;

OpenMP

But wait, I have a multicore machine! Many people do, nowadays. My Intel i7 CPU has 4 cores, but with hyperthreading actually simulates even 8 cores. Why not use some more of its power? So I turned to OpenMP. I was pleasantly surprised: I could parallelize this code by compiling with the gcc compiler flag -fopenmp and add one line before the first for loop:

#pragma omp parallel for num_threads(4)

However, there is a problem here. Because the Gauss-Seidel method both reads from and writes to the same matrix, the different threads will get in each other’s way. The simpler Jacobi method does not have this problem: it reads from one matrix, and writes to another. However, Jacobi converges twice as slow. But on four cores, it should still be faster, right? Unfortunately not. I don’t know why, but this parallelization resulted in a net win of nearly 0.

Work smarter, not harder

Another suggestion was to use so-called over-relaxation, turning the solver into a Successive Over-Relaxation or SOR method. The idea is that you exaggerate your adjustments with a certain factor ω, for example 1.4, to achieve faster convergence. The optimal value of ω depends on the matrix, which changes every time step. However, it is like walking close to a cliff: if you exaggerate too much, the solver blows up. But it seemed that I could get faster convergence without explosions for conservative values, which was a big win: I could now get away with 10 iterations where I previously needed 50. This brought the framerate up to 97 fps.

Pointer aliasing

There’s another issue with pipelining when you’re also using pointers. Since the compiler does not know that two pointers actually point to different memory, it has to assume that a modification through one pointer can be reflected through another pointer. This is called aliasing, and it prevents caching of values in registers.

Both gcc and Visual C++ support the __restrict keyword to prevent this. Adding the __restrict modifier to a pointer declaration tells the compiler: “Hey, nobody else is going to write to the memory I’m pointing to, optimize!” Since I knew that all my pointers point to different locations in the matrices, I slapped __restrict on some pointer declarations. This gave me an immediate speed boost to 117 fps.

Denormalized values

For my speed measurements, I had temporarily disabled the convergence check of the solver: it now always ran for a fixed number of iterations. This, I figured should give me a steady framerate that should not depend on the particular state of the simulated fluid. Imagine my surprise when I noticed that the simulation started at 117 fps, but slowly dropped over time. Intrigued, I let it run for a while. After some time, the framerate was no higher than 10 fps…

I was at a loss, so I turned to StackOverflow once more. As it turned out, floating-point values can have something called a “denormalized value”: a value that is very close to 0, but not set to exactly 0 in order to prevent later division by 0. These denormalized values are orders of magnitude slower to process.

Since my solver was doing so many iterations, it would continue after have converged to within floating point accuracy, causing more and more almost-zero values to creep into in the computations.

Luckily, there is an instruction to change the CPU’s behaviour: instead of storing denormalized values, these can simply be flushed to 0. Unfortunately, there is no standard library function for this. On Visual C++, we can do this:

_controlfp(_MCW_DN, _DN_FLUSH);

On gcc, we need some inline assembly. This was my first x86 assembly ever:

int mxcsr;
__asm__("stmxcsr %0" : "=m"(mxcsr) : :);
mxcsr |= (1 << 15); // set bit 15: flush-to-zero mode
__asm__("ldmxcsr %0" : : "m"(mxcsr) :);

Although this did not result in any increase of the initial framerate, now it remained nicely constant over time.

With some more small tuning here and there, I brought the framerate up another notch, to 123 fps.

More pipelining

Reading up on matrix solvers, I stumbled into another variant of the SOR solver I was using. I had been running through the matrix row by row, updating the values. But apparently, it is also possible to update in a checkerboard pattern: first run through all the white squares, then all the black. This completely removes the write/read dependency I ran into earlier. With this so-called red/black SOR scheme into place, the framerate increased to 134 fps.

Convergence checking

I re-added the check whether the solver had converged, aborting the procedure if the change was sufficiently small. I knew that this would give a big speed boost, but since I’d removed this check at the beginning, it doesn’t really count.

I also removed some superfluous rendering. At the 59 fps I started with, it is not a problem if you render every frame. However, rendering more than 60 fps is useless, since the average TFT monitor cannot go any faster (not to mention the human eye).

After these obvious improvements, the final result ran at around 450 fps.

Conclusion

“Work smarter, not harder” is the first rule of optimization, and I have noticed this once again. Even though pathological things like denormalized floats can make a huge difference, these are corner cases. The real speed boost comes from better algorithms. If you have less to do, it matters less how fast you do it!