I was recently optimizing some OpenGL ES 2.0 shaders for iOS/Android, and it was funny to see how performance tricks that were cool in 2001 are having their revenge again. Here’s a small example of starting with a normalmapped Blinn-Phong shader and optimizing it to run several times faster. Most of the clever stuff below was actually done by ReJ, props to him!

Here’s a small test I’ll be working on: just a single plane with albedo and normal map textures:

I’ll be testing on iPhone 3Gs with iOS 4.2.1. Timer is started before glClear() and stopped after glFinish() that I added just after drawing the mesh.

Let’s start with an initial naïve shader version:

  #ifdef VERTEX
  attribute vec4 a_position;
  attribute vec2 a_uv;
  attribute vec3 a_normal;
  attribute vec4 a_tangent;
  uniform mat4 u_mvp;
  uniform mat4 u_world2object;
  uniform vec4 u_worldlightdir;
  uniform vec4 u_worldcampos;
  varying vec2 v_uv;
  varying vec3 v_lightdir;
  varying vec3 v_viewdir;
  void main()
  	gl_Position = u_mvp * a_position;
  	v_uv = a_uv;
  	vec3 bitan = cross (a_normal.xyz, a_tangent.xyz) * a_tangent.w;
  	mat3 tsprotation = mat3 (
  		a_tangent.x, bitan.x, a_normal.x,
  		a_tangent.y, bitan.y, a_normal.y,
  		a_tangent.z, bitan.z, a_normal.z);
  	vec3 objLightDir = (u_world2object * u_worldlightdir).xyz;
  	vec3 objCamPos = (u_world2object * u_worldcampos).xyz;
  	vec3 objViewDir = objCamPos - a_position.xyz;
  	v_lightdir = tsprotation * objLightDir;
  	v_viewdir = tsprotation * objViewDir;
  #ifdef FRAGMENT
  precision highp float;
  uniform vec4 u_lightcolor;
  uniform vec4 u_matcolor;
  uniform float u_spec;
  varying vec2 v_uv;
  varying vec3 v_lightdir;
  varying vec3 v_viewdir;
  uniform sampler2D u_texcolor;
  uniform sampler2D u_texnormal;
  void main()
  	vec4 albedo = texture2D (u_texcolor, v_uv) * u_matcolor;
  	vec3 normal = texture2D (u_texnormal, v_uv).rgb * 2.0 - 1.0;
  	vec3 halfdir = normalize (normalize(v_lightdir) + normalize(v_viewdir));
  	float diff = max (0.0, dot (normal, v_lightdir));
  	float nh = max (0.0, dot (normal, halfdir));
  	float spec = pow (nh, u_spec);
  	vec4 c = albedo * u_lightcolor * diff + u_lightcolor * spec;
  	gl_FragColor = c;

Should be pretty self-explanatory to anyone who’s familiar with tangent space normal mapping and Blinn-Phong BRDF. Running time: 24.5 milliseconds. On iPhone 4′s Retina resolution, this would be about 4x slower!

What can we do next? On mobile platforms using appropriate precision of variables is often very important, especially in a fragment shader. So let’s go and add highp/mediump/lowp qualifiers to the fragment shader: 

15 milliseconds! But… the rendering is wrong; everything turned white near the bottom of the screen:

Turns out PowerVR SGX (the GPU in all current iOS devices) is really meaning “low precision” when we want to add two lowp vectors and normalize the result. Let’s try promoting one of them to medium precision with a “varying mediump vec3 v_viewdir”:

16.3 milliseconds, not too bad! We still have pow() computed in the fragment shader, and that one is probably not the fastest operation there…

Almost a decade ago, a very common trick was to use a lookup texture to do the lighting. For example, a 2D texture indexed by (N.L, N.H). Since all lighting data would be “baked” into the texture, it does not necessarily have to be Blinn-Phong even; we can prepare faux-anisotropic, metallic, toon-shading or other fancy BRDFs there, as long as they can be expressed in terms of N.L and N.H. So let’s try creating 128×128 RGBA lookup texture and use that: