Buffer Streaming in OpenGL

I spent a bit of time recently designing a sprite batching system using the (more) modern OpenGL 3.x core functionality rather than the old immediate mode stuff. I thought I would share some of my observations about working with it. My first approach to the problem consisted of allocating a single, large dynamic vertex buffer object, which I mapped via glMapBuffer() to copy in data for each batch of sprites. I used a static index buffer that I locked once and filled at initialization time, since I only ever needed to draw quads. This worked fine, except it was wicked slow.

For one, MetaVoxel uses alpha blending very heavily, requiring everything to be drawn back to front. As you might expect, this results in an output sensitive batch size, because it relies on the next quad requiring an identical rendering state. Sorting by texture isn’t possible, unfortunately. Additionally, each time I mapped the buffer, I would map the entire thing all at once–even for small batches. Anyway, it didn’t take much to bring things to a crawl. Profiling revealed a vast majority of time spent in the driver, waiting on glMapBuffer.

Clearly, I was doing something wrong. The key issue turned out to be buffer synchronization between the CPU and GPU. I found some great resources below that go into detail explaining how to optimize this.

http://hacksoflife.blogspot.com/2012/04/beyond-glmapbuffer.html

http://www.opengl.org/wiki/Buffer_Object_Streaming

I won’t rehash the details of how buffer orphaning works, see the above articles if you’re interested. The basic idea is that the driver is very conservative and will happily stall waiting for a buffer to flush to the GPU before letting you write to it again. You either need to coax the driver to allocate a new buffer for you, or implement a ring buffer scheme yourself. I ended up implementing the following techniques that got things running smoothly.

    I used glMapBufferRange to map only the required amount of data per batch, specifying the GL_MAP_INVALIDATE_RANGE_BIT.

  • I created a ring buffer of vertex buffer objects, mapping each one in sequence with each new batch. Rather than coaxing the driver to give me a fresh set of data each time, I implemented it myself.
  • I specified GL_STREAM_DRAW as the driver hint.

When I tested this on several machines, I saw vastly better performance–indicating less synchronization between the CPU and GPU.

Advertisements