I just so happened to write a Reddit post about the topic of "resolution" and why it's not really a thing, and frequency response is the most important factor by far.
First of all, any good headphones produce distortion that is low enough not to be audible. This is helpful because it simplifies our reasoning. Now that we've put distortion aside for a second, your question can be divided into two parts:
1. Headphones/IEMs being minimum-phase systems means FR fully describes their output.
Minimum-phase is a mathematical concept that applies not just to loudspeakers. In this case, it means that at any time, the loudspeaker's movement (and the resulting sound pressure at the ear drum) is within one phase-cycle of the input signal, and does not lag behind. This means there is no such thing as "attack", "decay", "driver speed", etc because the driver is tracking the original signal perfectly in the time domain. Now, you can totally get a subjective sense of "this headphone has bad decay compared to this other headphone", but what you're describing is a difference between their FRs.
"Wait, you're saying the headphone tracks the input signal, how come different headphones sound different?" Well, how a headphone changes the input signal is called frequency response, but the point is the "not lagging behind" thing (the output is stable and time-invariant).
2. Since FR is what matters, this means FR is what determines the subjective sense of resolution and soundstage. How can that be?
So what does this mean for headphones like Stax, where the driver is super thin and lightweight? Surely the lighter membrane can accelerate faster, stop moving faster, and can therefore track every little nook and cranny of the analog waveform of the recording, thus extracting more details, right?
Well yes, but just about any driver can move fast enough to do that. They are minimum-phase, remember? The mass of the driver is important when designing a headphone because it affects FR, but that's something the headphone designers need to worry about, not us.
CD quality cuts off at 22.05 kHz, not to mention the hearing of most adults cuts off below that. This means any driver that can do that many wiggles per second (at a usable amplitude or volume) is fast enough to extract every "detail" in the recording. Even an entry-level headphone like a Koss or a cheap Grado can do this. You want your driver to go faster? Sure, there are headphones that can reach up to 40kHz or whatever, but it doesn't really matter because you can't hear details that would require more than 20kHz to describe.
"Ok, sure. If two drivers are playing a 20kHz tone at the same volume, their drivers are moving at the same speed. However, that's just a static sine wave, not a complex musical signal with multiple frequencies of varying amplitudes!"
How the driver reacts to a complex signal like music, which has multiple frequencies of varying amplitudes, that's called "frequency response"
When reviewers say X headphone is more detailed than Y, they are describing differences in FR, sometimes with price bias or other cognitive biases thrown in the mix. The classic example of this is Stax, not because the drivers are these super light membranes with extremely low distortion, but because they generally lack sub-bass and have exaggerated upper midrange and treble, things that get associated with the sense of "detail".
"But wait, we can EQ two headphones to have the same FR and they won't sound the same! You can't EQ Stax levels of detail into an HD650, therefore there must be something other than FR at work here!"
That's because we're not actually EQ'ing to the same FR. If two headphones have the exact same FR at the ear drum (not just on a measurement rig, but on your actual ear drums), they would sound the same. This is impossible to do in practice because a) the measurement rig's ears aren't shaped the same as your individual human ears, which affects FR of the treble, b) simply taking a headphone off your head and putting it back on will change the FR in the treble due to imprecise seating c) the bass response will be affected by how tight of a seal you can get on your head vs on the measurement rig. These are all frequency response differences, mind you.
Oratory1990 has mentioned a few things that a headphone needs in order to respond well to EQ:
- perform reliably, with repeatable seal across multiple users
- easily obtain the amount of seal that it was designed for (rip glasses users or people with large beards)
- have good quality control = little unit variation and no channel imbalance
- have a relatively smooth FR free from high-q artifacts (sharp peaks and dips)
- deform the pinna as little as possible
- have little reflections inside the earcup, especially those that lead to destructive interference. You can't fix a notch in the FR with EQ (non-flat excess group delay).
- have suitably low distortion
Most headphones do not meet all of these conditions which affect FR, so their FR will be a pain to EQ accurately. What I'm trying to explain is that there will always be a FR difference when comparing two headphones, even with EQ. Therefore, there doesn't "need" to be some other variable at play, and indeed if you do a blind test, FR tracks very closely with listener preference, but no other metric does.
With IEMs, everything I said applies, except it's even simpler. They don't interact with the outer ear, only with the ear canal. Depending on the shape of your individual ear canals, the treble response will be affected, which crinacle has covered in his article about interpretation of FR graphs. You can have quite significant variations in the treble response just by inserting IEMs differently. This may affect one's subjective notion of "detail" and "resolution". Are your ear canals identical to mine? Are you sure the two of us are not just hearing a different FR?
Now for the soundstage thing. When it comes to speakers in a room, soundstage size is determined by the directivity characteristics of the speaker, which in turn affect how the sound reflects around the room and back to the listener's ears.
In headphones, it's basically frequency response. the "room" is the interior of each ear cup, so does that mean the ear cup reflections are responsible for soundstage? Well... ear cup reflections affect group delay, which in turn affects FR, like oratory's quote explained above. There have been attempts at measuring soundstage, to questionable degrees of accuracy (see RTings' attempts). We know it is affected by the FR of the headphone and how your ears affect the FR that reaches your ear drum (called HRTF/PRTF), although there would also seem to be a "trick" where large earcups that do not touch your ears contribute to the perception of this effect (think HD800 vs HD650).