In this post I will test several compiler options and switches that could make the particle system run faster.
Read more to see how I've reached around 20% of performance improvement!
The Series
- Initial Particle Demo
- Introduction
- Particle Container 1 - problems
- Particle Container 2 - implementation
- Generators & Emitters
- Updaters
- Renderer
- Introduction to Software Optimization
- Tools Optimizations (this post)
- Code Optimizations
- Renderer Optimizations
- Summary
Plan
- Start
- Floating-point semantics mode
- Enable enhanced instruction set
- Additional switches
- Something missing
- Bonus: GCC (mingw)
- Wrap up and What's next
Start
We are starting with those numbers (Core i5 Sandy Bridge):
count | tunnel | attractors | fountain |
---|---|---|---|
151000 | 229.5 | 576.25 | 451.625 |
161000 | 465.813 | 727.906 | 541.453 |
171000 | 527.227 | 790.113 | 582.057 |
181000 | 563.028 | 835.014 | 617.507 |
191000 | 596.754 | 886.877 | 653.938 |
Core i5 Ivy Bridge:
count | tunnel | attractors | fountain |
---|---|---|---|
151000 | 283.5 | 646.75 | 527.375 |
161000 | 555.688 | 812.344 | 629.172 |
171000 | 628.586 | 879.293 | 671.146 |
181000 | 670.073 | 932.537 | 710.768 |
191000 | 709.384 | 982.192 | 752.596 |
(time in milliseconds)
The above results come from running 200 'frames' of particle system's update method. No rendering, only CPU work. count
means number of particles in a given system. You can read more about this benchmark in the previous post.
And the Visual Studio configuration:
- Optimization: /02
- Inline Function Expansion: Default
- Favor Size or Speed: Neither
- Whole program optimization: Yes
- Enable enhanced instruction set: not set
- Floating point model: /fp:precise (default)
Of course, we are interested in making the above results faster. Also, I am wondering what Visual Studio's compiler options give potential performance improvements.
Floating-point semantics mode
By default Visual Studio uses /fp:precise
floating-point semantics mode. It produces quite fast, but safe and accurate results. All calculations are done in the highest available precision. The compiler can reorder instructions, but only when it does not change the final value.
In particle system simulation we do not need so much precision. This is not a complex and accurate physics simulation, so we could trade precision for performance. We use floats only and small errors usually won't be visible.
By using fp:fast
compiler relaxes its rules so that more optimization can be applied automatically by the compiler. Computation will be performed in, usually, lower resolution, so we do not lose time on casting from and to 80-bit precision. Additionally, the compiler can reorder instructions - even if it changes the final result a bit.
By switching from fp:precise
to fp:fast
I got the following results:
Core i5 Sandy Bridge
count | tunnel | attractors | fountain |
---|---|---|---|
171000 | 497.953 | 700.477 | 535.738 |
181000 | 533.369 | 744.185 | 569.092 |
191000 | 565.046 | 787.023 | 601.512 |
Core i5 Ivy Bridge
count | tunnel | attractors | fountain |
---|---|---|---|
171000 | 597.242 | 823.121 | 635.061 |
181000 | 635.53 | 872.765 | 675.883 |
191000 | 674.441 | 924.721 | 713.86 |
So around 5%...or even 11% of improvement.
Enable enhanced instruction set
Since SIMD instructions are available for a quite long time it would be wise to use those options as well. According to wiki:
- SSE2 appeared in Pentium 4 - 2001 or in AMD's Athlon 64 - 2003
- SSE4 appeared in Intel Core microarchitecture - 2006 or in AMD's K10 - 2007
- AVX are available since Sandy Bridge (2011) or AMD's Bulldozer (2011)
Unfortunately in my case, adding /arch:SSE2
does not make difference. It appe
But when I've used /arch:avx
the timings were a bit better:
Core i5 Sandy Bridge
count | tunnel | attractors | fountain |
---|---|---|---|
171000 | 429.195 | 608.598 | 460.299 |
181000 | 460.649 | 647.825 | 490.412 |
191000 | 489.206 | 688.603 | 520.302 |
Core i5 Ivy Bridge
count | tunnel | attractors | fountain |
---|---|---|---|
171000 | 529.188 | 746.594 | 570.297 |
181000 | 565.648 | 792.824 | 605.912 |
191000 | 593.956 | 832.478 | 640.739 |
This time this is around 20% of improvement on Sandy Bridge and around 15% on Ivy Bridge. Of course, /fp:fast
is also enabled.
BTW: When I used /arch:AVX2
the application crashed :)
Additional switches
I've tried using other compiler switches: Inline Function Expansion, Favor Size or Speed, Whole program optimization. Unfortunately, I got almost no difference in terms of performance.
Something missing?
Hmm… but what about auto vectorization and auto parallelization? Maybe it could help? Why not use those powerful features as well? In fact, it would be better to rely on the compiler that should do the most of the job, instead of manually rewriting the code.
In Visual Studio (since VS 2012) there are two important options /Qvec
and /Qpar
. Those options should, as names suggest, automatically use vector instructions and distribute tasks among other cores.
I do not have much experience using those switches, but in my case they simply do not work and I got no performance improvement.
To know what is going on with `auto' switches you have to use /Qvec-report and /Qpar-report additional compiler options. Then, the compiler will show what loops were vectorized or parallelized, or in which places it had problems. On MSDN there is a whole page that describes all the possible issues that can block 'auto' features.
Definitely, I need to look closer to those 'auto' powerful features and figure out how to use them properly.
BTW: What is the difference between auto vectorization and enable enhanced instruction set options?
Bonus: GCC (mingw) results
Although compiling the full particle demo (graphics) in a different compiler would be quite problematic, there is no such problem with 'cpuTest'. This benchmark is only a simple console application, so I've managed to rebuilt it using GCC (minGW version). Here are the results:
32bit, Ivy Bridge
GCC 4.8.1, -march=native -mavx -Ofast -m32 -std=c++11 -ffast-math
count | tunnel | attractors | fountain |
---|---|---|---|
151000 | 230.000 | 508.000 | 415.000 |
161000 | 439.500 | 646.750 | 494.375 |
171000 | 493.688 | 694.344 | 531.672 |
181000 | 534.336 | 748.168 | 568.584 |
191000 | 565.792 | 798.396 | 613.198 |
64bit, Ivy Bridge
-march=native -mavx -Ofast -m64 -std=c++11 -ffast-math
count | tunnel | attractors | fountain |
---|---|---|---|
151000 | 251.000 | 499.500 | 406.750 |
161000 | 459.875 | 622.438 | 473.719 |
171000 | 505.359 | 672.180 | 510.590 |
181000 | 539.795 | 714.397 | 546.199 |
191000 | 576.099 | 764.050 | 579.525 |
It seems that GCC optimizer does a much better job than Visual Studio (764.050ms vs 832.478ms)!
Wrap up & What's Next
This was quite fast: I've tested several Visual Studio compiler switches and it appeared that only floating point mode and enhanced instruction set options improved the performance in a visible way.
Final results:
CPU | count | tunnel | attractors | fountain |
---|---|---|---|---|
Sandy | 191000 | 489.206 (-18.02%) | 688.603 (-22.36%) | 520.302 (-20.44%) |
Ivy | 191000 | 593.956 (-15.66%) | 832.478 (-14.77%) | 640.739 (-15.15%) |
In the end there is around 20% of speed up (for Sandy Bridge), 15% for Ivy Bridge. This is definitely not a huge factor, but still quite nice. It was only several clicks of the mouse! ;)
Question: Do you know other useful Visual Studio/GCC compiler options that could help in this case?
Next time, I will try to show how to further improve the performance by using SIMD instructions. By rewriting some of the critical code parts we can utilize even more of the CPU power.
Want Help and Test?
Just for an experiment, it would be nice to compile the code with gcc or clang and compare the results. Or use also a different CPU. If you want to help here is the repository here on github and if you have the timings please let me know.
The easiest way is to download exe files (should be virus-free, but please double check!) and save the results to a txt file.