Quantcast
Channel: Bartek's coding blog
Viewing all articles
Browse latest Browse all 325

Flexible particle system - Tools optimization

$
0
0
Tools optimization

In this post I will test several compiler options and switches that could make the particle system run faster.

Read more to see how I've reached around 20% of performance improvement!

The Series

Plan

Start

Visual Studio Logo

We are starting with those numbers (Core i5 Sandy Bridge):

counttunnelattractorsfountain
151000229.5576.25451.625
161000465.813727.906541.453
171000527.227790.113582.057
181000563.028835.014617.507
191000596.754886.877653.938

Core i5 Ivy Bridge:

counttunnelattractorsfountain
151000283.5646.75527.375
161000555.688812.344629.172
171000628.586879.293671.146
181000670.073932.537710.768
191000709.384982.192752.596

(time in milliseconds)

The above results come from running 200 'frames' of particle system's update method. No rendering, only CPU work. count means number of particles in a given system. You can read more about this benchmark in the previous post.

And the Visual Studio configuration:

  • Optimization: /02
  • Inline Function Expansion: Default
  • Favor Size or Speed: Neither
  • Whole program optimization: Yes
  • Enable enhanced instruction set: not set
  • Floating point model: /fp:precise (default)

Of course, we are interested in making the above results faster. Also, I am wondering what Visual Studio's compiler options give potential performance improvements.

Floating-point semantics mode

By default Visual Studio uses /fp:precise floating-point semantics mode. It produces quite fast, but safe and accurate results. All calculations are done in the highest available precision. The compiler can reorder instructions, but only when it does not change the final value.

In particle system simulation we do not need so much precision. This is not a complex and accurate physics simulation, so we could trade precision for performance. We use floats only and small errors usually won't be visible.

By using fp:fast compiler relaxes its rules so that more optimization can be applied automatically by the compiler. Computation will be performed in, usually, lower resolution, so we do not lose time on casting from and to 80-bit precision. Additionally, the compiler can reorder instructions - even if it changes the final result a bit.

By switching from fp:precise to fp:fast I got the following results:

Core i5 Sandy Bridge

counttunnelattractorsfountain
171000497.953700.477535.738
181000533.369744.185569.092
191000565.046787.023601.512

Core i5 Ivy Bridge

counttunnelattractorsfountain
171000597.242823.121635.061
181000635.53872.765675.883
191000674.441924.721713.86

So around 5%...or even 11% of improvement.

Enable enhanced instruction set

Since SIMD instructions are available for a quite long time it would be wise to use those options as well. According to wiki:

  • SSE2 appeared in Pentium 4 - 2001 or in AMD's Athlon 64 - 2003
  • SSE4 appeared in Intel Core microarchitecture - 2006 or in AMD's K10 - 2007
  • AVX are available since Sandy Bridge (2011) or AMD's Bulldozer (2011)

Unfortunately in my case, adding /arch:SSE2 does not make difference. It appe

But when I've used /arch:avx the timings were a bit better:

Core i5 Sandy Bridge

counttunnelattractorsfountain
171000429.195608.598460.299
181000460.649647.825490.412
191000489.206688.603520.302

Core i5 Ivy Bridge

counttunnelattractorsfountain
171000529.188746.594570.297
181000565.648792.824605.912
191000593.956832.478640.739

This time this is around 20% of improvement on Sandy Bridge and around 15% on Ivy Bridge. Of course, /fp:fast is also enabled.

BTW: When I used /arch:AVX2 the application crashed :)

Additional switches

I've tried using other compiler switches: Inline Function Expansion, Favor Size or Speed, Whole program optimization. Unfortunately, I got almost no difference in terms of performance.

Something missing?

question mark

Hmm… but what about auto vectorization and auto parallelization? Maybe it could help? Why not use those powerful features as well? In fact, it would be better to rely on the compiler that should do the most of the job, instead of manually rewriting the code.

In Visual Studio (since VS 2012) there are two important options /Qvec and /Qpar. Those options should, as names suggest, automatically use vector instructions and distribute tasks among other cores.

I do not have much experience using those switches, but in my case they simply do not work and I got no performance improvement.

To know what is going on with `auto' switches you have to use /Qvec-report and /Qpar-report additional compiler options. Then, the compiler will show what loops were vectorized or parallelized, or in which places it had problems. On MSDN there is a whole page that describes all the possible issues that can block 'auto' features.

Definitely, I need to look closer to those 'auto' powerful features and figure out how to use them properly.

BTW: What is the difference between auto vectorization and enable enhanced instruction set options?

GNU logo

Bonus: GCC (mingw) results

Although compiling the full particle demo (graphics) in a different compiler would be quite problematic, there is no such problem with 'cpuTest'. This benchmark is only a simple console application, so I've managed to rebuilt it using GCC (minGW version). Here are the results:

32bit, Ivy Bridge

GCC 4.8.1, -march=native -mavx -Ofast -m32 -std=c++11 -ffast-math
counttunnelattractorsfountain
151000230.000508.000415.000
161000439.500646.750494.375
171000493.688694.344531.672
181000534.336748.168568.584
191000565.792798.396613.198

64bit, Ivy Bridge

-march=native -mavx -Ofast -m64 -std=c++11 -ffast-math
counttunnelattractorsfountain
151000251.000499.500406.750
161000459.875622.438473.719
171000505.359672.180510.590
181000539.795714.397546.199
191000576.099764.050579.525

It seems that GCC optimizer does a much better job than Visual Studio (764.050ms vs 832.478ms)!

Wrap up & What's Next

This was quite fast: I've tested several Visual Studio compiler switches and it appeared that only floating point mode and enhanced instruction set options improved the performance in a visible way.

Final results:

CPUcounttunnelattractorsfountain
Sandy191000489.206 (-18.02%)688.603 (-22.36%)520.302 (-20.44%)
Ivy191000593.956 (-15.66%)832.478 (-14.77%)640.739 (-15.15%)

In the end there is around 20% of speed up (for Sandy Bridge), 15% for Ivy Bridge. This is definitely not a huge factor, but still quite nice. It was only several clicks of the mouse! ;)

Question: Do you know other useful Visual Studio/GCC compiler options that could help in this case?

Next time, I will try to show how to further improve the performance by using SIMD instructions. By rewriting some of the critical code parts we can utilize even more of the CPU power.

Want Help and Test?

Just for an experiment, it would be nice to compile the code with gcc or clang and compare the results. Or use also a different CPU. If you want to help here is the repository here on github and if you have the timings please let me know.

The easiest way is to download exe files (should be virus-free, but please double check!) and save the results to a txt file.

References


Viewing all articles
Browse latest Browse all 325

Trending Articles