In part 2 of the article about persistent mapped buffers I share results from the demo app.
I've compared single, double and triple buffering approach for persistent mapped buffers. Additionally there is a comparison for standard methods: glBuffer*Data and glMapBuffer.
Note:
This post is a second part of the article about Persistent Mapped Buffers,
see the first part here - introduction
Demo
Github repo: fenbf/GLSamples
How it works:
- app shows number of rotating 2D triangles (wow!)
- triangles are updated on CPU and then send (streamed) to GPU
- drawing is based on glDrawArrays command
- in benchmark mode I run this app for N seconds (usually 5s) and then count how many frames did I get
- additionally I measure counter that is incremented each time we need to wait for buffer
- vsync is disabled
Features:
- configurable number of triangles
- configurable number of buffers: single/double/triple
- optional syncing
- optional debug flag
- benchmark mode (quit app after N seconds)
Code bits
Init buffer:
size_t bufferSize{ gParamTriangleCount * 3 * sizeof(SVertex2D)};
if (gParamBufferCount > 1)
{
bufferSize *= gParamBufferCount;
gSyncRanges[0].begin = 0;
gSyncRanges[1].begin = gParamTriangleCount * 3;
gSyncRanges[2].begin = gParamTriangleCount * 3 * 2;
}
flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
glBufferStorage(GL_ARRAY_BUFFER, bufferSize, 0, flags);
gVertexBufferData = (SVertex2D*)glMapBufferRange(GL_ARRAY_BUFFER,
0, bufferSize, flags);
Display:
void Display() {
glClear(GL_COLOR_BUFFER_BIT);
gAngle += 0.001f;
if (gParamSyncBuffers)
{
if (gParamBufferCount > 1)
WaitBuffer(gSyncRanges[gRangeIndex].sync);
else
WaitBuffer(gSyncObject);
}
size_t startID = 0;
if (gParamBufferCount > 1)
startID = gSyncRanges[gRangeIndex].begin;
for (size_t i(0); i != gParamTriangleCount * 3; ++i)
{
gVertexBufferData[i + startID].x = genX(gReferenceTrianglePosition[i].x);
gVertexBufferData[i + startID].y = genY(gReferenceTrianglePosition[i].y);
}
glDrawArrays(GL_TRIANGLES, startID, gParamTriangleCount * 3);
if (gParamSyncBuffers)
{
if (gParamBufferCount > 1)
LockBuffer(gSyncRanges[gRangeIndex].sync);
else
LockBuffer(gSyncObject);
}
gRangeIndex = (gRangeIndex + 1) % gParamBufferCount;
glutSwapBuffers();
gFrameCount++;
if (gParamMaxAllowedTime > 0 &&
glutGet(GLUT_ELAPSED_TIME) > gParamMaxAllowedTime)
Quit();
}
WaitBuffer:
void WaitBuffer(GLsync& syncObj)
{
if (syncObj)
{
while (1)
{
GLenum waitReturn = glClientWaitSync(syncObj,
GL_SYNC_FLUSH_COMMANDS_BIT, 1);
if (waitReturn == GL_ALREADY_SIGNALED ||
waitReturn == GL_CONDITION_SATISFIED)
return;
gWaitCount++; // the counter
}
}
}
Test Cases
I've created a simple batch script that:
- runs test for 10, 100, 1000, 2000 and 5000 triangles
- each test (takes 5 seconds):
- persistent_mapped_buffer single_buffer sync
- persistent_mapped_buffer single_buffer no_sync
- persistent_mapped_buffer double_buffer sync
- persistent_mapped_buffer double_buffer no_sync
- persistent_mapped_buffer triple_buffer sync
- persistent_mapped_buffer triple_buffer no_sync
- standard_mapped_buffer glBuffer*Data orphan
- standard_mapped_buffer glBuffer*Data no_orphan
- standard_mapped_buffer glMapBuffer orphan
- standard_mapped_buffer glMapBuffer no_orphan
- in total 5*10*5 sec = 250 sec
- no_sync means that there is no locking or waiting for the buffer range. That can potentially generate a race condition and even an application crash - use it on your own risk! (at least in my case nothing happened - maybe a little bit of dancing vertices :) )
- 2k triangles uses: 2000*3*2*4 bytes = 48 kbytes per frame. This is quite small number. In the followup for this experiment I'll try to increase that and stress CPU to GPU bandwidth a bit more.
Orphaning:
- for
glMapBufferRange
I addGL_MAP_INVALIDATE_BUFFER_BIT
flag - for
glBuffer*Data
I call glBufferData(NULL) and then normal call toglBufferSubData
.
Results
All results can be found on github: GLSamples/project/results
100 Triangles
GeForce 460 GTX (Fermi), Sandy Bridge Core i5 2400, 3.1 GHZ
Wait counter:
- Single buffering: 37887
- Double buffering: 79658
- Triple buffering: 0
AMD HD5500, Sandy Bridge Core i5 2400, 3.1 GHZ
Wait counter:
- Single buffering: 1594647
- Double buffering: 35670
- Triple buffering: 0
Nvidia GTX 770 (Kepler), Sandy Bridge i5 2500k @4ghz
Wait counter:
- Single buffering: 21863
- Double buffering: 28241
- Triple buffering: 0
Nvidia GTX 850M (Maxwell), Ivy Bridge i7-4710HQ
Wait counter:
- Single buffering: 0
- Double buffering: 0
- Triple buffering: 0
All GPUs
With Intel HD4400 and NV 720M
2000 Triangles
GeForce 460 GTX (Fermi), Sandy Bridge Core i5 2400, 3.1 GHZ
Wait counter:
- Single buffering: 2411
- Double buffering: 4
- Triple buffering: 0
AMD HD5500, Sandy Bridge Core i5 2400, 3.1 GHZ
Wait counter:
- Single buffering: 79462
- Double buffering: 0
- Triple buffering: 0
Nvidia GTX 770 (Kepler), Sandy Bridge i5 2500k @4ghz
Wait counter:
- Single buffering: 10405
- Double buffering: 404
- Triple buffering: 0
Nvidia GTX 850M (Maxwell), Ivy Bridge i7-4710HQ
Wait counter:
- Single buffering: 8256
- Double buffering: 91
- Triple buffering: 0
All GPUs
With Intel HD4400 and NV 720M
Summary
- Persistent Mapped Buffers (PBM) with triple buffering and no synchronization seems to be the fastest approach in most tested scenarios.
- Only Maxwell (850M) GPU has issues with that: slow for 100 tris, and for 2k tris it's better to use double buffering.
- PBM width double buffering seems to be only a bit slower than triple buffering, but sometimes 'wait counter' was not zero. That means we needed to wait for the buffer. Triple buffering has no such problem, so no synchronization is needed.
- Using double buffering without syncing might work, but we might expect artifacts. (Need to verify more on that).
- Single buffering (PBM) with syncing is quite slow on NVidia GPUs.
- using glMapBuffer without orphaning is the slowest approach
- interesting that glBuffer*Data with orphaning seems to be even comparable to PBM. So old code that uses this approach might be still quite fast!
TODO: use Google Charts for better visualization of the results
Please Help
If you like to help, you can run benchmark on your own and send me (bartlomiej DOT filipek AT gmail ) the results.
Windows only. Sorry :)
Go to benchmark_pack
and execute batch run_from_10_to_5000.bat
.
run_from_10_to_5000.bat > my_gpu_name.txt
The test runs all the tests and takes around 250 seconds.
If you are not sure your GPU will handle ARB_buffer_storage
extension you can simply run persistent_mapped_buffers.exe
alone and it will show you potential problems.