Fluid solver on the GPU
Work on this project has been standing still for some time while I was working on another project. But this week I picked up work where I left off: making the fluid simulation even faster. Since the SOR solver I was using lends itself well to parallelization, and video cards are good at running parallel programs, I tried to run the solver on the video card (GPU).
Because I knew I was going to need all this eventually, I wrote some nice classes to encapsulate OpenGL objects like textures, shaders and framebuffers. I then loaded the solver’s matrices into textures and wrote a GLSL fragment shader to represent the inner loop body. Because a shader cannot write to a texture that it is also reading from, I had to abandon the SOR algorithm and go back to the more simple-minded Jacobi method that I’d previously been running on the CPU. Jacobi parallelizes better, but converges more slowly. However, I expected the much faster floating-point units of the GPU to compensate for this setback.
To my surprise, the GPU version did not make much of a difference. The likely cause is the convergence check: to know whether the algorithm converged, we need to keep track of the sum of the changes over the entire matrix. I wrote these into a separate texture. We then want to compute the sum of all the values in that texture. This can be done on the GPU with a ping-pong approach, reducing the texture size by a factor 2 in each iteration, eventually ending up with a single pixel that you download to the CPU. Alternatively, we can download the entire initial texture to the CPU and compute the sum there. I found that the CPU version was actually faster than the ping-ponging.
Unfortunately, all this business with downloading stuff from the GPU slowed down the algorithm so much that the speed gain of running on the GPU was negated. Running everything on the CPU was less of a hassle and equally fast, so I reverted this change. Still, not all time was wasted: this experiment resulted in a nice clean framework that I could reuse both for calculations and for rendering.