OpenCL interop tutorial: update the world transform of instanced meshes on the GPU

Instapaper Text

OpenCL interop tutorial: update the world transform of instanced meshes on the GPU

Previously I blogged here about convex collision detection using the separating axis test and Sutherland Hodgeman clipping. In a few upcoming blog postings I want to discuss how to perform the collision detection and simple particle/rigid body dynamics on the GPU for fun.

The result of rigid body dynamics is a world transform for each object: the position and orientation. The world position and the orientation (represented as a quaternion) can both be stored as a float4/Vector4 when we use 16-byte alignment.

We could render individual objects on the GPU by making calls to the OpenGL or Direct3D API from the CPU side, updating the world transform for each object and rendering them one by one.

If we run the particle or rigid body simulation on the GPU, this means that we have to transfer the world transform from GPU to main memory, making the calls on CPU, which transfers those same transforms back to GPU. There is a more efficient way to do this, and this is called OpenCL interop: sharing a single buffer between graphics API and OpenCL. DirectCompute and Direct3D has a similar feature, but in this article we use OpenCL and OpenGL.

Preparation:

Store all the transforms in a Vertex Buffer Object (VBO) on the GPU
A GLSL shader that renders instanced meshes using the transforms from this VBO
Initialize OpenCL

During the rendering of each frame:

call glFinish to make sure the GPU doesn’t access the VBO
Acquire the OpenGL VBO as an OpenCL buffer
Execute some OpenCL kernels that compute and update the transforms in this buffer
Release the OpenCL buffer back to OpenGL
Render all instances using a single call using glDrawElementsInstanced

I’ll briefly go over important bits in the C/C++ implementation for the OpenCL/GL interop, there is link at the end of this article where you can download all code.

Vertex Buffer Object

All instances share the same mesh data, a vertex buffer and index buffer, while each of them has its own transform. For this example I store all this data in a single VBO with a memory layout as follows:

The code to initialize this VBO is

glGenBuffers(1, &cube_vbo);
 
  glBindBuffer(GL_ARRAY_BUFFER, cube_vbo);
 
  char* dest=  (char*)glMapBuffer( GL_ARRAY_BUFFER,GL_WRITE_ONLY);
 
  memcpy(dest,cube_vertices,sizeof(cube_vertices));
 
  float* positions = (float*)(dest+sizeof(cube_vertices));
 
  float* orientations = (float*)(dest+sizeof(cube_vertices) + POSITION_BUFFER_SIZE);
 
  //initialize the positions/orientations
 
  glUnmapBuffer( GL_ARRAY_BUFFER);

GLSL vertex shader

For the GLSL vertex shader we need to apply the world transform, position and quaternion. As we use a single VBO and the positions and orientations are stored after the vertices, we need to specify the memory layout so the shader know where to find the position/orientation. For OpenGL this offset can be passed on using glVertexAttribPointer and layout. For OpenCL you can pass this offset as a kernel parameter.

There is a separate project in the solution called GLSL_instancing, to test this instancing. The implementation is in InitShaders();

Initializing OpenCL

Initializing OpenCL happens as usual, except for interop we need to pass the current OpenGL context:

glCtx = wglGetCurrentContext();
 
  glDC = wglGetCurrentDC();
 
   
 
  cl_context_properties cps[7] = {0,0,0,0,0,0,0};
 
  cps[0] = CL_CONTEXT_PLATFORM;
 
  cps[1] = (cl_context_properties)platform;
 
  cps[2] = CL_GL_CONTEXT_KHR;
 
  cps[3] = (cl_context_properties)pGLContext;
 
  cps[4] = CL_WGL_HDC_KHR;
 
  cps[5] = (cl_context_properties)pGLDC;
 
  int ciErrNum;
 
  cl_context retContext = clCreateContextFromType(cps, CL_DEVICE_TYPE_GPU,NULL,NULL,&ciErrNum);

There is a separate project called OpenCL_initialize to show the initialization. In the OpenCL_GL_interop project, the initialization happens in the InitCL method.

Acquiring and releasing the buffer

For each frame, before rendering all the instances, you can acquire the graphics buffer, run the OpenCL kernels and release the buffer with the following code snippet:

glFinish();
 
  cl_mem clBuffer = g_interopBuffer->getCLBUffer();
 
  cl_int ciErrNum = CL_SUCCESS;
 
  ciErrNum = clEnqueueAcquireGLObjects(g_cqCommandQue, 1, &clBuffer, 0, 0, NULL);
 
  oclCHECKERROR(ciErrNum, CL_SUCCESS);
 
  int numObjects = NUM_OBJECTS;
 
  int offset = (sizeof(cube_vertices) )/4;
 
  ciErrNum = clSetKernelArg(g_interopKernel, 0, sizeof(int), &offset);
 
  ciErrNum = clSetKernelArg(g_interopKernel, 1, sizeof(int), &numObjects);
 
  ciErrNum = clSetKernelArg(g_interopKernel, 2, sizeof(cl_mem), (void*)&clBuffer );
 
  size_t	numWorkItems = workGroupSize*((NUM_OBJECTS + (workGroupSize-1)) / workGroupSize);
 
  ciErrNum = clEnqueueNDRangeKernel(g_cqCommandQue, g_interopKernel, 1, NULL, &numWorkItems, &workGroupSize,0 ,0 ,0);
 
  oclCHECKERROR(ciErrNum, CL_SUCCESS);
 
  ciErrNum = clEnqueueReleaseGLObjects(g_cqCommandQue, 1, &clBuffer, 0, 0, 0);
 
  oclCHECKERROR(ciErrNum, CL_SUCCESS);
 
  clFinish(g_cqCommandQue);

Note that there is a more efficient way of synchonizing the buffers instead of using glFinish and clFinish using the ARB_cl_event extension. I’ll implement and talk about that another time.

The OpenCL Kernel

The OpenCL kernel for this example is extremely simple, it just adds 0.01 to the y component of all positions:

__kernel void interopKernel( const int startOffset, const int numNodes, __global float *g_vertexBuffer)
 
  {
 
  	int nodeID = get_global_id(0);
 
  	if( nodeID < numNodes )
 
  	{
 
  		g_vertexBuffer[nodeID*4 + startOffset+1] += 0.01;
 
  	}
 
  }

This will be replaced by particle or simple rigid body dynamics with collision detection in an upcoming posting.

The screenshot at the beginning of the article is taken from the sample code: a OpenGL/glut demo that can render 125.000 instanced cubes in realtime, around 70 frames per second on my laptop with Radeon 6570M with GPU drivers supporting OpenCL 1.1. When increasing this to 1 million cubes it still runs at ‘interactive’ rates of a few frames per second. The demo also works on my NVIDIA GTX 260 desktop with OpenCL 1.0.

Download

You can download the full source code and precompiled executables from my experiments repo at github. The code is available under the permissive zlib license. To build it, click on buildvs2008.bat to generate Visual Studio 2008 project files using premake4 and then open the buildmsvc2008MySolution.sln. The glut demo is only tested on Windows 7 with Visual Studio. Feel free to contribute a port for your favorite OpenCL platform, such as Linux or Mac OSX.

#AltDevBlogADay

Erwin Coumans

OpenCL interop tutorial: update the world transform of instanced meshes on the GPU