Recently, I have been working on a vector math library for Dart. Boringly, I named it Dart Vector Math. The latest version can be found on github. My two biggest goals for Dart Vector Math are the following:

  • Near 100% GLSL compatible syntax. This includes the awesome vector shuffle syntax, and flexible construction of vectors and matrices.
  • Performance in terms of both CPU time and memory usage / garbage collection load.

Aside from a couple quirks, Dart Vector Math is GLSL syntax compatible. It is possible to copy and paste GLSL code into Dart and after making a couple tweaks have it compile with Dart Vector Math. This makes debugging shader code easy.

Since Dart is a garbage collected language, to be optimal in terms of space you want to avoid creating lots of objects. In order to facilitate that, Dart Vector Math offers many functions that work directly on already allocated vectors and matrices.

This weekend I started to look at CPU performance of Dart Vector Math versus glMatrix.dart (a port of glMatrix from JavaScript to Dart, the current champ of JavaScript vector math libraries). The initial results are heavily in favour of Dart Vector Math:

Matrix Multiplication
Avg: 14.59 ms Min: 10.161 ms Max: 22.927 ms (Avg: 14590 Min: 10161 Max: 22927)

Matrix Multiplication glmatrix.dart
Avg: 283.353 ms Min: 272.062 ms Max: 287.988 ms (Avg: 283353 Min: 272062 Max: 287988)

mat4x4 inverse
Avg: 28.289 ms Min: 21.019 ms Max: 34.891 ms (Avg: 28289 Min: 21019 Max: 34891)

mat4x4 inverse glmatrix.dart
Avg: 318.909 ms Min: 315.435 ms Max: 325.831 ms (Avg: 318909 Min: 315435 Max: 325831)

vector transform
Avg: 4.324 ms Min: 2.811 ms Max: 14.859 ms (Avg: 4324 Min: 2811 Max: 14859)

vector transform glmatrix.dart
Avg: 144.431 ms Min: 138.263 ms Max: 153.798 ms (Avg: 144431 Min: 138263 Max: 153798)

The code for 4×4 matrix multiplication in Dart Vector Math and glMatrix are practically identical, so on closer inspection the above numbers didn’t make much sense. There is one key difference- Dart Vector Math uses a native Dart object to store the matrix while glMatrix uses a Float32Array as storage. Digging into the disassembly I discovered that indexing into a Float32Array is a slow path for the VM right now, skewing the results against glMatrix.dart. Not that big of a deal, Dart is a new language and the VM needs time to mature.

Once the performance issue with Float32Arrays is fixed I want to have Dart Vector Math use them for two reasons. First, they take up 50% less space (single vs. double precision floats). Second, WebGL needs Float32Arrays for uniform data which means the matrix is going to eventually end up inside a Float32Array, might as well keep it in one the whole time. There is no CPU performance benefit from using Float32Array as storage because all operations result in the floats being promoted to doubles, operated on, and then stored back as floats.

My intention to move to Float32Array got me thinking and I ended up asking myself: Why doesn’t the browser offer an API for common vector math operations on Float32Array implemented efficiently with SIMD instruction sets? Well, I’m not sure why it is not offered, but I ended up spending the weekend implementing it for the Dart VM.

The API follows:

class SimdFloat32Array {
   static matrix4Inv(Float32List dst, int dstIndex, Float32List src, int srcIndex, int count);
   static matrix4Mult(Float32List dst, int dstIndex, Float32List a, int aIndex, Float32List b, int bIndex, int count);
   static transform(Float32List M, int Mindex, Float32List v, int vIndex, int vCount);

I do not want anyone to get hung up on the specific API or naming convention (let’s avoid bikeshedding). My three biggest goals for this API are the following:

  • Offer the important operations used by vector math libraries
  • Operate directly on floats instead of promoting to doubles
  • Design for bulk processing

So far I have exposed three of the important operations, but there are many more. Each of those functions is backed by an SSE implementation that operates directly on the Float32Array data. Notice that each of the methods take a count variable, this allows a single call to do bulk work.
The results of my implementation were very encouraging:

Matrix Multiplication SIMD
Avg: 8.702 ms Min: 8.475 ms Max: 9.217 ms (Avg: 8702 Min: 8475 Max: 9217)

mat4x4 inverse SIMD
Avg: 7.107 ms Min: 6.89 ms Max: 7.754 ms (Avg: 7107 Min: 6890 Max: 7754)

vector transform SIMD
Avg: 6.415 ms Min: 6.204 ms Max: 7.006 ms (Avg: 6415 Min: 6204 Max: 7006)

Aside from the vector transformation operation (I think my SSE vector transform code is just slow), I got speedups between 2x and 4x.

Does this have legs? I hope so, but it’s not my call. If you see value in exposing this acceleration architecture into the browser, speak up.

Anticipating some questions:

What about JavaScript? The API would be easy to expose in JavaScript.

What about hardware without SIMD instruction sets? Probably not an issue since ARM, x86, and PPC have excellent SIMD instruction sets. Other platforms can implement the API using scalar floating point instructions.

What about other browsers? Again, this API would be easy to expose if it gained support.

Fast vector math operations are a requirement if we are going to start writing amazing games in the browser, I hope my proposal can make this possible.