A lot of this research was done at NASA Ames in Mountain View. That's how I became familiar with it, I met some of the people working on it (a very long time ago). They had a lab where you could sit at the center of a sphere of speakers where they ran experiments on spatial perception of audio. It was really interesting. Broadly speaking, headphones should outperform surround speakers for spatial perception in absolute terms but it is much easier to generate quality spatial perception with cheap speakers than cheap headphones because of how it interacts with biology (see below).
In short, two "microphones" (your ears) is not enough to place a sound in 3-space, which makes audio illusions and perception gaps possible. Beyond time-of-flight and amplitude differentials, the human ear acts like a notch filter where the notch frequency changes as a function of angle of incidence. We don't hear the notch but the brain uses it to infer angle in a plane. This has significant issues e.g. it doesn't work well for new sounds with novel spectral signatures because we can't discriminate between a natural notch frequency and one created by the ear. It is possible to synthesize audio that breaks this part of our brain by synthesizing a set of cues that violate the laws of nature -- it is pretty uncomfortable. Creating spatial perception through signal processing has a couple limitations:
First, every human ear has a unique notch filter pattern. Spatial audio over headphones, which partially bypass the notch filtering, works best when they literally insert a microphone into your ear canal and measure the unique notch filtering patterns using test patterns. This can be fed into the software algorithms to create more accurate spatial cues for your unique ears. The perceived result is qualitatively different. There is no universal algorithm that works for everyone.
Second, the relationship between your ears and the sound sources don't change with headphones. In nature, animals either change the orientation of their ears, to basically sweep the notch frequency cues (humans have vestigial biology for this) or in the case of humans we move our heads for both notch frequency and time-of-flight cues. With headphones, the sound sources turn with you, so it produces no cues.
To make natural sounding spatial audio work on headphones, the audio source needs to be able to detect changes in head orientation in real-time and apply appropriate DSP to the raw audio. This is less of a problem with surround sound speaker systems because head motion provides these spatial cues naturally. I haven't tried it out but Apple's real-time head tracking plausibly provides the necessary DSP inputs to produce a spatial model that tracks as good or better than external speakers. Where external speakers fall short is that, unless you are in a carefully acoustically treated space, the space itself injects all kinds of spectral, temporal, and amplitude artifacts that unpredictably degrade the spatial cues in the audio.
>. I haven't tried it out but Apple's real-time head tracking plausibly provides the necessary DSP inputs to produce a spatial model that tracks as good or better than external speakers.
Apple's spatial audio head tracking is unreliable at best. I can fool the system easily by rotating my head at a speed that is slower than a snap rotation, side to side.
Even when the tracking is working properly, it sounds like simulated surround sound from home theater receivers of 20 yr sago.
> It is possible to synthesize audio that breaks this part of our brain by synthesizing a set of cues that violate the laws of nature -- it is pretty uncomfortable
Do you need special equipment to do it, or can it be done on a PC with speakers? I'd be quite interested in listening to this
In short, two "microphones" (your ears) is not enough to place a sound in 3-space, which makes audio illusions and perception gaps possible. Beyond time-of-flight and amplitude differentials, the human ear acts like a notch filter where the notch frequency changes as a function of angle of incidence. We don't hear the notch but the brain uses it to infer angle in a plane. This has significant issues e.g. it doesn't work well for new sounds with novel spectral signatures because we can't discriminate between a natural notch frequency and one created by the ear. It is possible to synthesize audio that breaks this part of our brain by synthesizing a set of cues that violate the laws of nature -- it is pretty uncomfortable. Creating spatial perception through signal processing has a couple limitations:
First, every human ear has a unique notch filter pattern. Spatial audio over headphones, which partially bypass the notch filtering, works best when they literally insert a microphone into your ear canal and measure the unique notch filtering patterns using test patterns. This can be fed into the software algorithms to create more accurate spatial cues for your unique ears. The perceived result is qualitatively different. There is no universal algorithm that works for everyone.
Second, the relationship between your ears and the sound sources don't change with headphones. In nature, animals either change the orientation of their ears, to basically sweep the notch frequency cues (humans have vestigial biology for this) or in the case of humans we move our heads for both notch frequency and time-of-flight cues. With headphones, the sound sources turn with you, so it produces no cues.
To make natural sounding spatial audio work on headphones, the audio source needs to be able to detect changes in head orientation in real-time and apply appropriate DSP to the raw audio. This is less of a problem with surround sound speaker systems because head motion provides these spatial cues naturally. I haven't tried it out but Apple's real-time head tracking plausibly provides the necessary DSP inputs to produce a spatial model that tracks as good or better than external speakers. Where external speakers fall short is that, unless you are in a carefully acoustically treated space, the space itself injects all kinds of spectral, temporal, and amplitude artifacts that unpredictably degrade the spatial cues in the audio.