N-Body Simulation

This was written to demonstrate and test the compute toolchain/implementation I'm currently working on for my master's thesis (https://github.com/a2flo/floor). With this it is possible to compile the same C++17 code to CUDA/PTX, OpenCL/SPIR/SPIR-V, Metal and Vulkan/SPIR-V, thus running on a multitude of GPUs and CPUs on different platforms. To achieve this, I'm using a modified clang/llvm/libc++ 4.0 toolchain and a layer of host and device side code that makes it possible to address everything the same way. This demo in particular shows the use of local/shared memory buffers, local memory barriers, OpenGL buffer sharing, loop unrolling and that high performance computing is indeed possible with this toolchain.

The N-body simulation is largely based on http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html with some additional optimizations.

More information on N-body simulations: https://en.wikipedia.org/wiki/N-body_simulation

Code for this demo: https://github.com/a2flo/floor_examples/tree/master/nbody

Current performance stats (in benchmark mode):
* P6000: ~8400 gflops (--count 262144 --tile-size 512)
* GP100: ~7600 gflops (--count 262144 --tile-size 512)
* GTX 970: ~2770 gflops (--count 131072 --tile-size 256)
* GTX 780: ~2350 gflops (--count 131072 --tile-size 512)
* GTX 1050 Ti: ~1675 gflops (--count 262144 --tile-size 256)
* R9 285: ~850 gflops (--count 131072 --tile-size 64)
* GTX 750: ~840 gflops (--count 65536 --tile-size 256)
* GT 650M: ~375 gflops (--count 65536 --tile-size 512)
* HD 530: ~242 gflops (--count 65536 --tile-size 128)
* HD 4600: ~235 gflops (--count 65536 --tile-size 80)
* i7-6700: ~195 gflops (--count 32768 --tile-size 1024)
* HD 4000: ~165 gflops (--count 32768 --tile-size 128)
* iPhone A10: ~131 gflops (--count 32768 --tile-size 512)
* i7-5820K: ~105 gflops (--count 32768 --tile-size 8)
* i7-4770: ~80 gflops (--count 32768 --tile-size 8)
* i7-3615QM: ~38 gflops (--count 32768 --tile-size 8)
* i7-950: ~29 gflops (--count 32768 --tile-size 4)
* iPhone A8: ~28 gflops (--count 16384 --tile-size 512)
* iPad A7: ~20 gflops (--count 16384 --tile-size 512)

Stats from this video:
* N = 131072, damping = 0.9983, softening = 0.01
* since this is an O(n^2) algorithm, this results in 131072^2 = 17179869184 body/body interactions per iteration
* the initial body setup is a hollow sphere (or on-sphere), with body velocities set to the center
* with rendering and video capturing, performance is degraded a little and one iteration of this simulation took about 175ms (w/o rendering/capturing it would be ~155ms)
* with N = 65536 this runs in realtime on a GTX 780 (~38ms per iteration with rendering)
* the 1x runtime of this video is slightly above 1 hour, the video is shown in 16x speed-up, with camera rotations at 3x (to not cause that much confusion ;))
Hüseyin Tuğrul BÜYÜKIŞIK : Cool compute video. Would you mind telling that how did you count flops in kernel code? For example, did you count rsqrt/fma as 1 flop? Did you count only compute time or did you include drawing time too? How many flops per particle-particle? I'm just trying to compare my RX550 with OpenCL nbody to this and to know if I'm calculating right.
Resbi Zhou : Amazing work! I‘ve been doing something similar since last year......
PinkySuavo : It looks like galaxies superclusters, wtf. Aren't these simulations giving us a look on how the universe was created?
TruncateCar3 : Very cool! Question for the creator: are you taking Dark Matter and Dark Energy into account or are you just using Newtonian physics? Thanks!
Javier S. : This video deserves millions of views. Thanks for sharing.

