Finally, I managed to finish the adventure with my particle system! This time I’d like to share some thoughts about improvements in the OpenGL renderer.
Code was simplified and I got little performance improvement.
The Series
- Initial Particle Demo
- Introduction
- Particle Container 1 - problems
- Particle Container 2 - implementation
- Generators & Emitters
- Updaters
- Renderer
- Introduction to Software Optimization
- Tools Optimizations
- Code Optimizations
- Renderer Optimizations
- Summary
Plan For This Post
Where we are?
As I described in the post about my current renderer, I use quite a simple approach: copy position and color data into the VBO buffer and then render particles.
Here is the core code of the update proc:
glBindBuffer(GL_ARRAY_BUFFER, m_bufPos);
ptr = m_system->getPos(...);
glBufferSubData(GL_ARRAY_BUFFER, 0, size, ptr);
glBindBuffer(GL_ARRAY_BUFFER, m_bufCol);
ptr = m_system->getCol(...)
glBufferSubData(GL_ARRAY_BUFFER, 0, size, ptr);
The main problem with this approach is that we need to transfer data from system memory into GPU. GPU needs to read that data, whether is is explicitly copied into GPU memory or read directly through GART, and then it can use it in a draw call.
It would be much better to be just on the GPU side, but this is too complicated at this point. Maybe in the next version of my particle system I’ll implement it completely on GPU.
Still, we have some options to increase performance when doing CPU to GPU data transfer.
Basic Checklist
- Disable VSync! - OK
- Quite easy to forget, but without this we could not measure real performance!
- Small addition: do not use blocking code like timer queries too much. When done badly it can really spoil the performance! GPU will simply wait till you read a timer query!
- Single draw call for all particles - OK
- doing one draw call per a single particle would obviously kill the performance!
- Using Point Sprites - OK
- An interesting test was done at geeks3D that showed that points sprites are faster than geometry shader approach. Even 30% faster on AMD cards, between 5% to 33% faster on NVidia GPUs.
- Of course point sprites are less flexible (do not support rotations), but usually we can live without that.
- Reduce size of the data - Partially
- I send only pos and col, but I am using full FLOAT precision and 4 components per vector.
- Risk: we could reduce vertex size, but that would require doing conversions. Is it worth it?
The numbers
Memory transfer:
- In total I use 8 floats per vertex/particle. If a particle system contains 100k particles (not that much!) we transfer 100k * 8 * 4b = 3200k = ~ 3MB of data each frame.
- If we want to use more particles, like 500k, it’ll be around 15MB each frame.
Computation:
In my last CPU performance tests I got the following numbers: one frame of simulations for each effect (in milliseconds).
particle count | tunnel | attractors | fountain |
---|---|---|---|
500k | 5.02 | 6.54 | 5.15 |
Now we need to add the GPU time + memory transfer cost.
Our Options
As I described in details in the posts about Persistent Mapped Buffers (PMB )I think it’s obvious we should use this approach.
Other options like: buffer orphaning, mapping, etc… might work, but the code will be more complicated I think.
We can simply use PMB with 3x of the buffer size (triple buffering) and probably the performance gain should be the best.
Here is the updated code:
The creation:
const GLbitfield creationFlags = GL_MAP_WRITE_BIT |
GL_MAP_PERSISTENT_BIT |
GL_MAP_COHERENT_BIT |
GL_DYNAMIC_STORAGE_BIT;
const GLbitfield mapFlags = GL_MAP_WRITE_BIT |
GL_MAP_PERSISTENT_BIT |
GL_MAP_COHERENT_BIT;
const unsigned int BUFFERING_COUNT = 3;
const GLsizeiptr neededSize = sizeof(float) * 4 *
count * BUFFERING_COUNT;
glBufferStorage(GL_ARRAY_BUFFER, neededSize,
nullptr, creationFlags);
mappedBufferPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0,
neededSize, mapFlags);
The update:
float *posPtr = m_system->getPos(...)
float *colPtr = m_system->getCol(...)
const size_t maxCount = m_system->numAllParticles();
// just a memcpy
mem = m_mappedPosBuf + m_id*maxCount * 4;
memcpy(mem, posPtr, count*sizeof(float) * 4);
mem = m_mappedColBuf + m_id*maxCount * 4;
memcpy(mem, colPtr, count*sizeof(float) * 4);
// m_id - id of current buffer (0, 1, 2)
My approach is quite simple and could be improved. Since I have a pointer to the memory I could pass it to the particle system. That way I would not have to memcpy
it every time.
Another thing: I do not use explicit synchronization. This might cause some issues, but I haven’t observed that. Triple buffering should protect us from race conditions. Still, in real production code I would not be so optimistic :)
Results
Initially (AMD HD 5500):
GPU | Tunnel | Attractors | Fountain |
---|---|---|---|
500k | 77.5 | 79.9 | 99.6 |
1mln | 36.3 | 29.4 | 43.2 |
After:
GPU | Tunnel | Attractors | Fountain |
---|---|---|---|
500k | 81.5 (~5.2%) | 82.2 (~2.9%) | 107.2 (~7.6%) |
1mln | 39.9 (~9.9%) | 31.8 (~8.2%) | 47.2 (~9.3%) |
Reducing vertex size optimization
I tried to reduce vertex size. I’ve even asked a question on StackOverflow:
How much perf can I get using half_floats for vertex attribs?
We could use GL_HALF_FLOAT
or use vec3
instead of vec4
for position. And we could also use RGBA8
for color.
Still, after some basic tests, I did not get much performance improvement. Maybe because I lost a lot of time for doing conversions.
What’s Next
The system with its renderer aren’t that bad. On my system I can get decent 70..80FPS for 0.5mln of particles! For 1 million particle system it drops down to 30… 45FPS. which is also not that bad!
But definitely, the plan is to move to the GPU side.
Resources
- Persistent Mapped Buffers - my two recent posts:
- From “The Hacks Of Life” blog, VBO series:
- Double-Buffering VBOs - part one
- Double-Buffering Part 2 - Why AGP Might Be Your Friend - part two
- One More On VBOs - glBufferSubData - part three
- When Is Your VBO Double Buffered? - part four