The main problem being that "soundstage" remains undefined
.
But I actually think that it's the one term that has the most chance of becoming operational defined in the future. Intuitively, we all know what "soundstage" is : if you're blindfolded, in a typical room, and someone talks to you, there's a pretty good chance you'll be able to move near that person's location just by ear.
The fundamental problem being that headphones are but a small part of the chain of elements needed to be able to re-construct all the cues we use in that case to locate that voice in that space.
I think that an interesting test would be to design a virtual space and ask people to move to a sound's location without any visual cues, and score them. Think blindfolded video game. But I'm not certain that this can become a reality without a combination of at least a few of the following - and perhaps all of them to be truly convincing : object based formats, individualised HRTFs, head-tracking, headphones with a predictable FR at someone's eardrum.