Sound Designers are faced with a difficult proposition. Add sound effects, music and dialogue to a game, set rules that determine how those sounds are panned, attenuated and processed in real-time (both individually and with respect to one another), and then hope that the player will hear the resulting ‘mix’ over a system capable of reproducing the spatial accuracy, dynamic range and frequency response of the original content. It’s a difficult proposition… at best!
During my time on the engine team at Insomniac, I spent a considerable amount of time and effort trying to ‘fix’ the mix. Building an advanced ducking system, improving distance attenuation and panning curves, expanding dynamic range and listener focus using an ‘HDR Audio’ algorithm, introducing near-field rendering techniques to improve presence and envelopment, improving the reverberation and occlusion algorithms to enhance the perception of indirect sound in geometric spaces, and creating a live mixing system to allow sound designers to move continuously between mixer snap-shots based on gameplay context. All of these tools added to the sound designer’s palette, giving them the ability to make the choices they needed to make in order to bring the sound ‘experience’ to the player.
But something critical was still missing…
Sound source post-processing in the game world relies on an accurate mapping to the real-world ‘listening’ space. And simply put… this doesn’t exist for audio in games (at least, not without rendering a binaural mix to headphones). Whereas image content can be mapped to a television screen (with very little loss in the world-space-to-screen-space translation), audio content is ultimately being piped through a set of misplaced, misdirected low-quality loudspeakers.
So, we need some way to guarantee at least some amount of consistency between 2D and 3D sound localization in the game world and 2D and 3D sound rendering in the playback environment, both for sound designers and for players.
We can do this by introducing a loudspeaker analysis and pre-processing step, prior to audio engine initialization, mixing and playback. The idea here is to capture the number, location and radiation pattern of loudspeakers in 3D space (azimuth, elevation, distance and orientation), as well as the dynamic range, frequency/phase/magnitude response, and impulse response of the loudspeakers in the playback environment. Given these parameters (which could be captured in real-time prior to initializing the audio engine, or stored in a user profile and reloaded on-demand), the audio engine can then be fed the appropriate information it needs to properly ‘correct’ (‘skew’) source rendering parameters before mixing the results to the destination loudspeakers.
Without these parameters, the audio engine is forced to render to an abstract, idealized set of perfectly positioned, perfectly responding loudspeakers. Any variation in the real (physical) layout from the idealized layout is unaccounted for. Unfortunately, this is how game audio engine engines work today. This leaves much to be desired, both for the player and the sound designer.
The simplest solution we’ve implemented is to fill out a user-generated profile with the loudspeaker positions, orientations and response patterns provided by a television manufacturer. Our more advanced solution synthesizes an all-pass autocorrelation sequence (a probe signal), plays the probe signal through each loudspeaker in series, and extracts the 3D location and response of each loudspeaker by de-convolving the probe with the room response captured by a microphone. The resulting analysis data includes AOA (angle of incidence) per loudspeaker (translated to azimuth and elevation angles), relative distance (with respect to the other loudspeakers), frequency/phase/magnitude response, impulse response and polarity. Furthermore, we can determine which loudspeakers are full-range, and which loudspeakers are low-frequency emitters (aka subwoofers).
The azimuth angles are used to initialize the panning algorithm in the audio engine. This means that misplaced loudspeakers are actually accounted for when panning to the output. For example, let’s say the front-right loudspeaker is placed +40 degrees to the right. In the idealized layout, the front-right loudspeaker is assumed to be located +30 degrees to the right (a 10 degree difference in physical layout). If the panning algorithm uses the idealized layout, and a sound source is panned to a position that translates to +30 degrees to the right, then the sound will emanate from the single loudspeaker, which is placed at +40 degrees to the right. This is incorrect. If, however, the panning algorithm uses the real (physical) layout, then the sound source will properly sit at +30 degrees, panned 3/4 of the way to the right-front loudspeaker. While a 10 degree difference in position may not seem so significant, the case becomes more apparent when a sound source smoothly pans across the plane in the direction of the loudspeaker. In this case, panning without speaker correction will result in doppler shift for a constant-velocity pan, and a skewed mix of the entire sound scene in the direction of the misplaced loudspeaker. Add to this the misplacement of multiple loudspeakers in a surround setup and you get a completely skewed mix. Sadly, the idealized layout is the current state-of-the-art in game audio rendering.
The elevation angles are used to correct for loudspeakers which may be placed at unequal heights with respect to the rest of the layout. For example, if the player has a center loudspeaker that is lower or higher in elevation than the left-front and right-front loudspeakers (e.g., the loudspeaker is on the bottom of the television set), the correction system will perform height virtualization to elevate the center loudspeaker signal back to the idealized position. Failing to do this would result in skewed panning. For example, a sound which should pan on the azimuth plane from front-left to front-right would incorrectly elevate toward the center channel (either up or down with respect to the azimuth plane), independent of it’s physical position. With correction, a sound source panned to 0 degrees front-center would seem to emanate from that exact location in space… un-elevated.
The relative speaker distances are used to apply a fixed delay to the output of each loudspeaker channel. These subtle delays correct for asymmetries in speaker distance. In essence, these delays place loudspeakers onto the idealized unit sphere, where each channel is equidistant from the listener.
A related correction is for speaker orientation. Inconsistencies in the rotation of loudspeakers (e.g., speakers placed facing downward in a television enclosure) are corrected so as to conform to a directional, listener-centric phase response.
Relative loudspeaker gains are corrected using simple equalization per output channel. For example, if the front-left and front-right loudspeakers are bigger and louder than the center loudspeaker, then sounds placed in the center channel (e.g., dialogue) will be modified with equal-power gain compensation. These coefficients are also used by the panning algorithm. So if a sound pans through the quieter, smaller center loudspeaker, the adjacent loudspeakers will help contribute more focused gain to the sound as it moves across the plane.
The frequency response of each loudspeaker can also be corrected with respect to the room response. If, for example, a loudspeaker rings at a certain set of frequencies due to undesirable room modes, we can ‘whiten’ those sub-bands of the response to more closely match the loudspeaker to its natural response in an anechoic chamber (an echo-free room). Note that we’re not attempting the push a loudspeaker beyond it’s natural response; rather, we’re enabling the loudspeaker to output its natural response pattern, given an ideal set of room conditions.
Together, the loudspeaker analyzer and suite of correction routines combine to create a system that ensures accurate audio reproduction in the playback environment, both for sound designers and players. Integrating such a system into a game engine dramatically improves the accuracy of the sound designer’s intent.