Some rough iPad OpenGL Performance and Techniques notes…

I took notes on various tests I was doing trying to figure out the best way to do a large map of grid squares.  I did a lot of tests and it was a good introduction to openGL.  But, while these experiments were fun, Geek was getting impatient to actually create a game already, and besides, the general advice seems to be to use a game engine for your first few games; don’t write your own game engine (amusing how few game developers follow this advice though! :-)).

So I decided to investigate Cocos2d-iphone since we are only playing around with 2d games at this point. Since this post is coming out after Geek’s tutorial on cocos2d-iphone tilemaps (a part 3 to Ray Wenderlich’sgreat set of tutorials), any reader following along already knows that we’ve found cocos2d-iphone to be pretty fun and capable so far.

Anyway, there are some useful bits of information in these notes and so rather than trash them because I’ve moved on and don’t feel like polishing it into a smooth blog post, I’m just going to post the raw notes here (so I can refer back to them if nothing else).  If you only want to read polished blog posts, you might want to skip this one.  🙂  If you like openGL nitty gritty tidbits, FPS trade-offs, and the like, this might be of interest.

So in a previous post on OpenGL I’d said this:

Some performance data:

Just as a test, I set my test app to allocate 4800 of the 32 x 32 pixel uncompressed GL_RGBA, GL_UNSIGNED_BYTE  textures.  Oh, and one big texture atlas that is 128 x 128 pixels, same depth etc.

This test is drawing 64,516 Point Sprites (point size is 16) using 8 different textures at 45 Core Animation Frames Per Second (CAFPS). Note that only 768 of them are visible on screen, the rest are offscreen and culled by OpenGL.  The points are x,y as GL_SHORTs in a single large vertex buffer and grouped by texture.  They are then drawn with eight glDrawArrays calls each frame (with the appropriate texture bound before each call) and GL_COORD_REPLACE_OES is true so the textures are drawing mapped over the points.

Actually, just for fun I made that selectable and without the VBO use the frame rate goes down to 40 CAFPS.  That’s good to know.

Showing 24-32MB Resource Bytes (Debug/Release, ?) and 69% Resource utilization.  Huh, that doesn’t change even when I only allocate 8 of the 32×32 bit textures..  odd.

Anyway, ES 1.1, obviously.  iPhone OS 3.2 on Wi-Fi iPad.

Data from some older test code:  Well I just ran a piece of other code that drew the 32×32 pixel “points” as two-triangle triangle-strips via glDrawElements, and setting the color with a call to glColor4ub only when it changed.  Same 254×254 map with same number actually on screen.   This is with the VBO for the vertexes, but not the triangles nor the colors.

Only 6 CAFPS!  Yow!  Only 4% RDU, but 100% CPU usage and 96% of that is in the glDrawElements call…   Clearly having the colors in a buffer would help some, but it’s interesting to see just how bad it can be…   Since we are drawing individual elements, we can actually be smart and only draw the ones that are visible on screen – making that change gets us back to 59/60 CAFPS.   Interesting set of tradeoffs.  Drawing with a mechanism that draws batches of things is faster per thing, but harder to draw only what is needed.  Drawing individual elements is much slower, but it is much easier to limit to just those things that are on-screen.  Also, limiting it to the things on screen seems to have dropped the CPU usage to essentially zero… which is suspicious.   Somehow it _knows_ that nothing is changing visually?  magic! hmm.  more to learn, clearly!

So I wonder if it’s just making fewer calls to gl* routines (i.e., drawing a bunch of elements with a single call, either via GL_POINTS or GL_TRIANGLE_STRIP) or having the data in the GPU?  hmm.  There is also the difference between glDrawElements vs glDrawArrays…   more testing is called for!

Converted this last sample to use glDrawTexture for each grid-square and we are down to 5 CAFPS and 100% CPU usage, even drawing just the ones on screen.  Calculating the vertexes and texture coordinates that are visible every time is clearly expensive.



I’ve tried a bunch of different openGL approaches now including GL Point Sprites (fast, but some serious limitations, as noted on earlier post), glDrawTexture calls for each grid square (slow due to passing all the data across to the video card/chip as noted above), glDrawElements with triangle strips with vertexes in buffers (VBOs) (fast, but it seems like you can’t texture map onto triangle strips due to shared vertexes, which I didn’t realize at first), and glDrawElements with triangles (but you have some map size limitations due to the elements index to the call only being a short). Oh, and tests with entire map in gl buffers and drawing all of it with one call, versus drawing just the visible portion, often requiring multiple calls.

The best so far that I’ve found for this large map of grid squares and texture mapping for each grid and supporting zoom to 25K triangles visible on screen:  using glDrawArrays with a dynamically constructed set of vertexes and texture coordinates stored in vector buffers (or VBOs) and only updating those when the scroll position (or zoom) changes.  One call to glDrawArrays for the whole visible map.  Textures are in a texture atlas and being careful about how many openGL state changes and I can get 55-60 fps at natural size of 1500 triangles on screen (32×32 grid size, 2 triangles for each grid square), 45-60 fps at zoomed state with 6K triangles on screen (16×16), and 30-40 fps with 25K triangles on screen (8×8) but nearly 50 fps when the map is stationary.  Lower end of ranges are when scrolling because I have to update the VBO data on the card, which I’m doing by updating the buffer and not by allocating new buffers.

Unfortunately, in my current most successful approach I’m getting texture bleed (when part of the next atlas image gets incorporated into the edge of the current one) which I haven’t solved yet – possibly I need to put space between the textures on the atlas and/or do something about mip-mapping.  Might also have to do with changing scale of the texture space to be 0-128 from 0 to 1.0 so I could use shorts as texture coords instead of floats -> saving 2 bytes per texture coordinate! For the 25K triangle test case, that means 300K of data that doesn’t have to go across to the graphics card or into buffers!  (25,000 triangles * 3 vertexes/triangle * 2 coords/vertex * 2 bytes per coord saved).


One other thing I thought to try was to render-to-texture and use glDrawTexture once for the whole visible background instead of dynamically filling the VBO buffers and calling glDrawArrays with triangles and texture coordinates to draw lots of texture mapped triangles.  During scrolling/zoom will be the push – might be slower because we’d be drawing once to the big texture, and a second time that texture to the visible surface… Seems like that’s effectively two copies.  I’m guessing that it’ll be faster otherwise because the render should be faster (one big quad or two triangle triangle-strip instead of potentially 25K texture mapped triangles).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: