The first time I stepped in front of a depth camera was almost a year ago now. We had a reference version of a PrimeSense camera that is heavily related to the final hardware that went into Kinect. The first thing I got to do was make a stick figure guy move around on the screen. It was very captivating to see him match my movements, even with the occasional arm through my chest Kali-Ma style.

Those first days were filled with lots of experimentation because everything was new to us in this world of full body motion gaming. Which reminds me…if you ever want to see a cool effect, go grab a large mirror and hold it in front of a depth camera at an angle; now you’re really playing with portals!


With all the time I’ve spent around these cameras I wanted to capture some thoughts on some problems developing games and software driven by full body motion input.

Unnatural User Input

Lately I’ve come to find the statement “Natural User Input” a bit of a misnomer. There are still many technological and human hurdles that have to be overcome with time and good ideas before the interaction is truly natural. The problem with natural is that it’s different for everyone, which generally forces you to make it unnatural for some group of people. Also with the limitations of the current technology you will often find yourself making unnatural concessions to make something work.

A great example of this is getting detected by the camera, often referred to in the office as “Doing the Truffle Shuffle”. Some skeleton SDKs require a pose or gesture to be detected as an active user. For example, OpenNI has the “Psi” pose. Some ask you to wave your hand. Some just work, like Kinect but even so many games have logic layered on top about when a user can join that is highly varied and currently unnatural because there isn’t one consistent way yet.

Another good example of this is turning. If I asked you to turn, how would you do it?

Q: Would you naturally turn, away from the TV?
A: No, then you couldn’t see the TV.

Q: So would you turn your whole body and continue to face the TV, or just your shoulders?
A: If you turn your body naturally you’ll occlude half of your body, making it harder to detect other actions simultaneously (Walk + Turn). Also many skeleton SDKs have varying levels of success tracking shoulder angle and occlusion of the shoulder usually causes them to move around.

Q: What about if we let the hands determine turning, moving them left to move left, right to move right?
A: Good in some contexts, like skiing and horseback riding. It’s very unnatural when walking around. It also prevents you from using the hands to do other things at the same time. It’s also very hard to hold for long periods of time if you have to keep them there.

Q: How about leaning left to turn left, leaning right to turn right?
A: It’s great from a technological standpoint. It won’t ever occlude any part of the body. Very easy to do for all users. Very easy to hold for long durations. Can be combined with many other actions. However, it’s completely unnatural.

The best advice I can give here is to get people to test out your ideas. I can’t tell you how many times I’ve thought to have solved a problem only to see a tester or coworker break it almost immediately. If you can help it, find new people to try out the game. We refer to them as untrained users around the office. For these systems you’ll find that over time the system trains you back. You learn just the right movements without thinking about it, which will lead to a false sense of improvement in your gesture detection code.

I haven’t seen it happen yet, but I suspect many motion games in the future will actually ship with multiple ways of handling the same input and users will select the one they prefer. In the same way we have inverted controls and different control schemes.


The cameras are not perfect and they’re mapping a physical space to some finite number of pixels. Surfaces that poorly reflect infrared, other infrared sources (like the sun), and even the manner in which the cameras define a contiguous surface can cause variations from frame to frame leading to lots of jaggy shifting edges on objects. This jitteriness influences the volume of an object and thus the calculated positions of bones in a skeleton are shifting too.

So you’ve got to find a way to smooth out the data without adding lag to the propagation of player movements onto the character. The best way we’ve found is with a predictive filter. They average in old frames with the current frames data, but are simultaneously predicting N (usually 1) frames forward in time. The only drawback is they end up over and undershooting the actual curve of motion because it’s predicting the motion is going to continue in the same direction. Luckily this largely goes unnoticed by users.

Generally Avoid

The amount you should avoid each of these varies across cameras and skeleton SDKs, but generally speaking this is my own list of things you should try to avoid.

  • Small Motions – Detecting them is very difficult, they are very easy to confuse with noise.
  • Holding hard poses – It’s hard to hold your arms out for extended periods of time.
  • Motions near the body – Occlusion problems, bone loss.
  • Fast motions – Most of the consumer grade depth cameras right now are running at 30 FPS. It’s very easy to move faster than the segmentation / skeleton prediction code is willing to bet you’ve moved and will happily ignore your motion.
  • Extreme poses – Poses most people would have trouble making. Not just because people have trouble making them, but because most of the skeleton SDKs are not trained for unusual body positions.
  • Sitting – It’s is generally not handled well across skeleton SDKs. The overall skeleton becomes a lot less trustworthy.

That’s Normal Right?

All the skeleton SDKs I’ve used so far don’t generally return you anything other than the rawest of the raw bone positions. Which is generally a good thing; you wouldn’t want them to hide the raw data from you. However, this will tends to result in moments when your hand will penetrate your chest, your knee will flip backwards and you’ll have your leg behind your back.

So it becomes important to try and avoid these events by using joint constraints. Even though the skeleton SDKs usually have bone confidence numbers, they’re not comparing confidence based upon how a normal human can move. It’s based on can they clearly see something they think is a body part. If so they will report things like, 100% confident your leg has driven itself up into your chest.


Timing is very difficult. The user has to predict how long he is going to take to move, while at the same time accounting for how long the avatar will take to move, plus how long the gesture detection will take to detect his action. Making it very hard for him to predict when he has to jump or duck or move to the side.

In these situations feedback that he has done the right thing, as well as how long he has left to do the right thing can be important. One handy trick when compressing timespans to play back animation is Bullet-Time. Imagine a player running and jumping hurdles. There’s this unknown zone that once entered there will not be enough time to playback the animation to jump the hurdle without it looking sped up bizarrely fast. However with bullet time, if you detect the gesture just in time, you can slow down time long enough to play back the animation and also indicate to the user, “Hey, you almost missed that one”. Bullet-Time is also handy for just giving the user more time to make a split second decision, and then as soon as they’ve made it, speed back up.

Just a Little Bit Closer…

Depth perception is another frustrating problem. Users have really poor perception about how far away objects are from their avatar that they can interact with. Luckily there are many ways around this problem.

  • Depth Cues – shadows do a great job of helping to show distance as you get closer to an object
  • UI Visual Cues – Visual feedback that you can now interact with the object is important. If I’m playing a volleyball game, changing the halo around the ball from red to green to indicate I can now jump and hit it can be valuable feedback, because it’s hard estimating how high my character can jump, or when they can jump.
  • Camera angle is everything. Having the right angle to the object can make it much easier to tell depth.
  • Audio Cues – I don’t see these get used very often, but sound is a great way to indicate action is required, or success or failure on the user’s part.