My new side-side project with DirectX 11 is something that has been bothering me for awhile. Chaining compute shaders with variable length output. Chaining compute shaders that output the same amount of work is trivial, in fact, you’d likely just merge the code into one uber compute shader.
One thing shader authors routinely take advantage of is the fact that they can do some simple work to filter out data points in the vertex shader by placing them outside the frustum. Giving them a cheap and effective way to remove all that data that would otherwise go to the pixel shader or show up as a branch if this were all happening in a single shader. By being able to have multiple stages the shader cores are all able to stay busy instead of several of them idling from a discard branch.
Which brings me to the main question of this blog post,
How can I chain compute shaders to consume variable amounts of data to duplicate this same staging capability?
So here is our scenario, you have a FilterComputeShader (FilterCS) and a HardWorkComputeShader (HardWorkCS). The FilterCS will use the AppendStructuredBuffer to output some amount of data less than or equal to some incoming amount of data. Now, you place one or more branch statements in your FilterCS that way by the time you get to the HardWorkCS you’ve got all your shader cores being utilized and no branching there (hopefully)
After appending some amount of data onto this buffer, how will we know how many threads or thread groups to spawn after the FilterCS runs?
ID3D11DeviceContext::CopyStructureCount
You can use it to copy the count from the buffer into some other buffer on the GPU. Now you could then read back that value from the GPU and use it in another call to Dispatch, but forcing a sync point between the CPU and GPU isn’t a good idea. Which is why there is this.
ID3D11DeviceContext::DispatchIndirect
You can pass that same buffer you just copied that size of the append buffer into as input into this function to subsequently spawn some number of thread groups.
For this to work, your compute shaders would all need to be defined with [numthreads(1, 1, 1)]. That way the number of threads matched the number of thread groups.
However, part of utilizing our shader cores effectively will be making sure we aren’t defining our compute shaders with [numthreads(1, 1, 1)]. Ideally the number of threads matches or is some multiple of the size of the GPU’s wavefront. If you don’t do this you’ll be vastly underutilizing the GPU.
Crap…
So how do we get around this without adding a sync point between the CPU and the GPU?
Another compute shader! :D
RWBuffer dispatchBuff : register(u0); [numthreads(1, 1, 1)] void UpdateIndirectDispatchBuffer() { dispatchBuff[0] = (uint)ceil(dispatchBuff[0] / NumThreadGroupSize); }
I know I said [numthreads(1, 1, 1)] was bad, but here it makes sense, we’re only going to spawn a single thread group and thread to do make this quick fixup to the buffer we’re going to use for DispatchIndrect so that instead of telling it the number of threads to dispatch, we tell it the number of thread groups to dispatch, based on the number of threads in our thread group for the x thread group dimension.
So now the code flow looks like this,
- Dispatch(FilterCS);
- CopyStructureCount():
- Dispatch(UpdateIndirectDispatchBuffer);
- DispatchIndirect(HardWorkCS);
I haven’t finished the side project this is part of so I don’t know if the overhead of using the AppendStructuredBuffer + 2 additional Dispatch calls is actually better than doing it all in a single compute shader with a branch. My current assumption is that this method will be far superior in cases where you’re filtering.
If the branch were simply doing hard work X or hard work Y, there’s obviously no win. But given the choice hard work X or no work. Your best bet is to be able to handle each stage in a separate Dispatch, or so I think. The performance results of the side project this is part of will probably make for a good post.