ABX has strengths in differentiating amplitude/frequency differences that can be easily revealed by instant A/B switching much like if we take two swatches of colour it's not so easy to tell them apart if they are not side by side in our vision but bring them together side by side & they can be easily differentiated. So successful ABX testing does rely on this quick A/B switching of sounds still in echoic memory (the duration of which is debatable & variable depending on cognitive load).
One of the issues I see with ABX testing is that, as Amir said, if we can't identify a specific short section of an audio track that reveals the difference, we are most likely not going to identify a difference. I've seen it stated that just going with our "feeling" of a difference between tracks can reveal a difference exists but I have my doubts about the efficacy of this approach.
Identifying a specific audio section for comparison is the visual equivalent of finding a colour swatch or segment somewhere across two videos that we can use to A/B. This is not simple for video & even more difficult for audio due to the way the perceptions differ - we can freeze a video & look at frame stills for comparison - we can't do that with audio - with audio we are being asked to find the difference section in a constantly changing audio stream.
So the first problem I see is zoning in on possible areas that might reveal differences & examining them - this requires prior knowledge/experience of the sound of distortions/artifacts/issues & the discipline to search for & find such differences in the two audio streams. The other problem related to this is that we are all limited in our experience of the sound of artifacts/distortions - hence, without training in what they sound like we are limited in what we can differentiate.
The next problem I see with ABX is that more of the differentiating issues between modern audio playback are not identifiable to short audio segments but are more about quality of the perceptual illusion. You may find what I have to say next is too controversial for here & it should be in "Fight Club" so I'll just mention it here & maybe do a full exposition of my position in Fight Club. I'm of the opinion, based on my experience, that modern audio playback's typical differences at the higher end of things, are not easily identified in a short segment - it is more to do with quality differences between the illusions created by the playback systems - thing like soundstage depth, solidity, etc. (I don't want to get into too much of controversial stuff here). My main point is that to differentiate between playback versions we need to listen to a longer segment of audio to judge these elements & hence it makes ABX an unsuitable tool for evaluating these differences
Sorry if this is the wrong place to post this, Amir, - I know it's not about ABX stats but then some of the others posts aren't either - it arose from the thoughts that came to me when I read the various posts in here - feel free to move it if you think it should be elsewhere