This week I would like to give a brain dump about shader compilers. The word shader is quite ambiguous, and can refer to either an algorithm that implements a BRDF/BSDF, or to a program that’s executed by a graphics API (OpenGL/DirectX) on graphics hardware. The article title refers to the former, even though the latter is still massively important.
The canonical work on shaders is Shade Trees by Cook[pdf]. It describes a used supplied AST for shading calculations, rather than a preset collections. Shaders are supplied as source code, and compiled by a compiler with special intrinsics for graphics (dot(), specular(),…).
Building Block Shaders gives some hints for implementing shade trees.
Specializing Shaders presents optimizations of shaders written in a C-like language targeting CPU. Details on caching, precomputation of uniforms. Ironically, a lot of these things later came back later with DirextX’s preshaders, precalculating of LUT’s, … in the SM 2.0 era.
Abstract shade trees presents a system that “weaves” independent code fragments implementing different systems (shadowing, shading) into a runnable fragment program. It has a node/graph interface.
Shader Metaprogramming describes a high level system that emitted low-level GPU code. For instance, if/else could be emulated by multiplication by 0 on platforms that did not support it natively.
Libsh was a metaprogramming environment for shader that could generate GPU programs for different hardware (assembly level GPU at that time)
User-Configurable Automatic Shader Simplification generates simpler, less correct versions of fragment programs for LODing by transforming the original AST. The important thing for me is that we are also interested in optimazations that do not produce “exactly the same output, but “perceptually similar but faster”, which violates an important presupposition of compiler optimization (to good effect).
Frostbite terrain rendering[pdf] shows an ad hoc approach to generating gpu programs, in this case for terrain layers. Significantly different from Ubershaders, as the generation was driven from cpp code, rather than a driver preprocessor.
The entire SIGGRAPH 2011 Course: Compiler Technology for Rendering is very valuable
ispc is a low-level SPMD (single program, multiple data) compiler, running on LLVM.The assumption that a function runs many times on adjacent data allows for better vectorization.
Lunarglass shows reuse of compiler optimizations for different source languages (GLSL, ?…) to different backends (GLSL,CPU).
Ogre3D has shader generation capabilities.
Renderants, the GPU reyes renderer, has to use shader analysis to compute memory requirements for temporary storage.
Rant
The main problem seems to be that we want to program a lot of functions “that will run a bunch a bunch of times on a lot of data” on different hardware (GPU vs CPU, cache line sizes,scalar vs vector…). One problem is that the data layout is highly dependent on how we want to call the functions, and we have to reference that layout in the code of our functions. The whole Data Driven movement shows us how important this stuff is for speed. Sometimes, we want Arrays of Structs (Aos), sometimes Struct of Arrays (SoA), depending on complex criteria such as branch prediction, data access (“hot” vs “cold” parts of data). And, of course we want “oh, you can just cram a normalized vector in 2 16 bit values by storing only z’s sign, but don’t bother with that if we have an unused extra 16 bits lying around somewhere”, “Well, right now we want AABB’s defined by min max, later centre/halfdimensions, maybe. Only here though.”. If we use c/c++, this data problem translate in code problems, as AoS can use SIMD on some parts of the code, while SoA uses a directly translated SIMD version for doing multiple paths of scalar (ignoring branches), and for GPU we need something else entirely.
To me, it seems that c/c++/glsl is a pretty poor match for what we’re trying to do, and we cope with it because at the very least it allows us to access the low-level details and manage it ourselves. What really surprised me is that the one of the big advantages of deferred rendering is nice decoupling of shader responsibilities: eg there’s a GPU programs for animating and storing normals, a GPU program for evaluating lights, a GPU program for evaluating BRDF’s. Before that, *shudder* we had ubershaders. Of course, this was not the reason for choosing deferred rendering (you can easily see it as a form of early z without rendering geometry twice, as it only evaluates lighting and BRDF for visible pixels, or as a way to not write any code that checks which lights affect an object), but it’s a damn nice property in hindsight.
Of course, there’s no silver bullet. It seems a lot of people are exploring how we can run shaders through a compiler (LLVM being a winner here) and “magically” transform it to something we want (eg AoS vs SoA). I think this works well for 2 reasons:
- The compiler (and optional) runtime are limited because it’s very domain specific. What we’re (very simplified saying) is : this function takes some uniforms, some values interpolated from vertices, might access some well-defined data (texture), might call some intrinsic functions with no side effects, and writes 32 bytes. It doesn’t write any other data, and no other thread will stomp around the memory it’s trying to access.
- Well, the shader is written in something *like* c, so please, parse it, run some target independent optimizations, transform the AST into the form that uses the data as we want it stored (and maybe some other stuff, that probably inspects the shader’s AST and reasons about it) , run the target dependent optimizations, codegen it and run. Thank you LLVM for not letting me code any of this myself, except for the parts that apply to my particular case.
I’m mostly interested in playing with this problem in the context of Domain Specific Languages and Active Libraries with LLVM. Expect a series of posts on #AltDevBlogADay in the nearish future.
This post is rife with mistakes and omissions, feel free to comment about them, or contact me directly, and they will be corrected.
This was supposed to be the second post on loose octrees, but because I’m not happy with it yet it will be postponed another 2 weeks