In this generation of consoles motion controls have come to prominence. The most ambitious implementation is Microsoft’s Kinect. Marketed with the phrase “You are the controller,” its aim is to free us from the tyranny of the button by providing a more natural way to interact with our entertainment.

Using a camera as an input device creates some interesting new opportunities for developers. Acting out context sensitive movements, like climbing a rope, can add an additional layer of immersion to a game. It also introduces some new challenges for developers. The lack of tactile feedback can make the fact that you’re miming out movements even more apparent.

For my first post at #AltDev I will walk through creating a simple interaction using a camera with only the image data. By this I mean I’m only working in 2D and not accounting for depth nor attempting to make a skeleton. Computer vision is a rather hefty subject so I’ll be glossing over the finer points, treating this as a primer to the material. The main goal is to talk about how the interaction is created rather than delving into vision algorithms, which I can go into in a future post, so let me know in the comments if that’s something you’re interested in.

So with that said lets begin.

Determining the Foreground

So you’re sitting at your desk reading this article. You peer over your monitor and see a coworker standing off in the distance. How did you know he wasn’t there the entire time? Well last time you looked he wasn’t there so he must have shown up sometime between now and the last time you looked.

Background subtraction is a camera applying this same logic. The gist of the operation is entirely in the name. If you take an image of a space with no people in it you have your background. You then compare this image to the current image and any difference between the two indicates that something has appeared in the foreground. How that difference is determined is dependent upon the algorithm used, as implementations vary on their speed and accuracy.

Example background subtraction

After background subtraction is applied the results need to be filtered. This can be done by ignoring small changes, thresholding the differences. Eroding and dilating the image can also be used to clean up the image. Once complete we have an image containing the persons in view of the camera.

Collision Detection

Imagine you have a cup of coffee in front of you, feel free to substitute for your beverage of choice. The catch is every time you go to grab for it your hand passes right through it. How would that make you feel? Frustrated to say the least.

This isn’t something you’d expect to happen in the real world and those expectations carry forward in the virtual. If you can’t interact with the world you don’t feel like you’re a part of it. You’re essentially a ghost at that point.

Adding collision detection and response grounds the person in the world. The actual means of the collision system is beyond the scope of this article, but it shouldn’t allow an object to enter the silhouette of the person.

At this point an interaction like this should be possible with the person using their silhouette to manipulate the object.

Holding up an object

But what should occur when something like this happens?

Striking a ball

Collision detection doesn’t tell the full story here. To react properly in a situation like this the movement of the silhouette needs to be accounted for.

Optical Flow

Like a person in a room with a strobe light, a camera views the world in discreet time steps. While we can’t determine everything that happened between the two frames, we do have enough information to make an educated guess. In the above example we can see that the person’s hand moved, and by examining its change in position we can extrapolate how it moved. What we are computing is known as the optical flow.

Unseen intermediate frame

The optical flow is a measure of the motion between two frames. By examining how the individual pixels of the image change between the two images movement can be detected and velocities assigned. The means by which the flow is computed is dependent upon the algorithm chosen. Like background subtraction there are multiple methods to estimate the flow which vary in their speed and accuracy.

Applying flow to the object

By applying the velocity from the optical flow the ball behaves as one would expect to after being struck.

Putting it Together

Collision detection gives a sense of presence in the world, while optical flow captures larger movements, mimicking their effect on the objects in the scene. Combining the two provides an interaction that feels realistic despite not having anything physical to interact with. The results of which can be seen in the video below.