Limitations of blind testing procedures

oivavoi · Jan 13, 2017

Thought I'd start a new thread about this. For quite some time I've been intrigued by blind testing, and whether that's the end all be all when it comes to audible differences in audio (or preferences). A couple of years ago, when I was still struggling in the passive quagmire, this led me to sell off a very expensive amplifier, and buy a very cheap one. After all, there shouldn't be audible differences between amplifiers according to ABX-tests, right?

But somehow, the speakers didn't sound as right with the new and cheap amplifier as with the more expensive one. I was never sure whether to trust my own hearing on that, or assume that it was just a placebophile illusion. Then I got active monitors, and that rendered the whole discussion moot for the time, for me at least.

But now I've been thinking about it again, as I might change my system, and proably will buy some new electronics. Should one trust blind tests blindly - or not? The subjectivists, of course, have long been skeptical of blind tests because it invalidates basically all of their beliefs. And I guess that almost all of us on this forum share the belief that sighted listening can be unreliable. For me, that's something that goes without saying. I will never trust my own sighted listening without hesitation - neither do I trust the sighted listenings of others, unless it can be backed up by measurements and/or reasonable theories about what happens that don't invalidate any fundamental physical laws. But even though sighted listening is unreliable, does that mean that blind listening necessary is completely reliable? Lately I've been inclined to say no.

----------------------

The subjectivist criticism of blind testing has been that it is "unnatural", and is different from "normal" listening, and that this masks real and objective differences. Another line of criticism has been that the brain might confuse things - that the brain mixes everything into a diffuse sensory soup when trying to compare listening impressions analytically. This is the view of Siegfried Linkwitz, for example, and the blogger The Rational Audiophile. I would say that both of these theories are possible. But until recently, I wasn't aware of any systematic data to back this up. However, I recently came across this peer-reviewed article from 2013, by two French researchers:
https://hal-institut-mines-telecom....ile/index/docid/842647/filename/APAC_5172.pdf

In that article, they do an experiment where they compare different procedures for blind testing preferences about loudspeakers. One procedure is the most common one in ABX tests, in which short excerpts are played after one another. Procedure two is closer to normal listening - where long excerpts are played, and the listeners can switch themselves. Here, however, the volume was fixed and matched. In the third procedure, the listeners listened to long excerpts, AND could adjust the volume themselves. This procedure was deemed to be the closest to normal listening.

The very intriguing finding was that procedure 3 turned out to be the most discriminating. Under this procedure, there were statistically significant differences in ratings between loudspeakers that weren't there under procedure 1 and 2. This means that loudspeakers that it was not possible to discriminate between quality wise under the ordinary testing procedure, actually were perceived as different when the test resembled "normal" listening. The researchers also explain that this was not because of any loudness mismatch - the test subjects didn't use the volume button that much, even though they could. It was the mere possibility of changing the volume that made them able to discriminate more finely between the loudspeakers.

That raises the question: Are there other kinds of objective differences that may be heard under normal listening, but that disappear under some blind test procedures?

-----------------------

What's my take-away? Ok, assuming that this study is valid, of course - hopefully it will be replicated. I think blind tests are valuable when it comes to identifying the differences that are most obvious, and which might be the most important when listening. If a difference is objectively minimal, and never shows up on a ABX test, it's also an indication that it might no be that important. Still: Increasingly I'm taking a positive ABX result as a confirmation, but I don't necessarily take a negative ABX result as a disconfirmation.

So I no longer take for granted that an objective difference that has yet to be ABXed is inaudible. I am even more concerned with measurements and objective differences, and less so with the results of ABX tests which explore audibility. I also trust my subjective listening a tiny bit more than before, even though I still try to be aware of my biases.

I have therefore, for example, gone back to streaming from Tidal in hifi/CD quality, even though most ABX tests show that few people can ABX lossless from 320-bit. In case I will ever need to drive a speaker system with external amps again (with active crossovers of course), I think I'll probably get amps that measure really well (like Hypex), and not settle for the cheapest things I find in a garage. In a way I'm endulging myself with some audiophilia nevrosa.

-----------------------

Any comments or thoughts? I'd love input on this - I'm not very entrenched in my views on this matter.

Cosmik · Jan 13, 2017

Indeed, I too am highly sceptical of listening tests. At the end of the day, I notice that for all the listening tests in the world, we are still at the position of the basic aim of hi fi being the proverbial "straight wire with gain". Listening tests don't tell us anything that we can't simply work out on paper about amplifier design or speaker driver design etc. and are so fragile and unstable that they are likely just to confuse matters.

oivavoi · Jan 13, 2017

I think that listening tests can be useful to a certain degree, though. I think that the research program of Toole and Olive has been valuable, for example, and I've no doubt that the world of hifi is better off as a result. If there is a problem in the general world of hifi, it's not that there's too many listening tests, but rather that there's too few.

The problem is when one loses sight of the limitations of these kinds of listening tests. Like the thing with the time domain - assumed by many to be of no significance, partially on the basis of listening tests. But now, finally, more fine-tuned tests are showing that time domain actually may be very important. So my problem is not with listening tests or "audio science" per se, but rather that some of us objectivists (myself included) might have been too uncritical in dealing with the research that has been done so far.

Cosmik · Jan 13, 2017

oivavoi said:
The problem is when one loses sight of the limitations of these kinds of listening tests. Like the thing with the time domain - assumed by many to be of no significance, partially on the basis of listening tests. But now, finally, more fine-tuned tests are showing that time domain actually may be very important. So my problem is not with listening tests or "audio science" per se, but rather that some of us objectivists (myself included) might have been too uncritical in dealing with the research that has been done so far.

It was listening tests that perpetuated the whole "phase doesn't matter" meme; listening tests using passive speakers (DSP hadn't been invented) and which were based on an assumption that listening in mono is just as revealing as listening in stereo because other historical experiments based on passive speakers found that some small number of people couldn't tell the difference. These historical results are still used today to tell people that they can't possibly be hearing superior performance from modern systems because "phase doesn't matter".

The person who had never heard of these ancient experiments would not hesitate to design his modern system to minimise time domain/phase errors, and wouldn't need a listening test to justify it.

DonH56 · Jan 13, 2017

IME/IMO the biggest problems are trying to use a DBT to prove "better", when all they are really designed to do reliably is show "different", and the bias of the listeners. By the latter I mean that different listeners listen for different things, and trained listeners gave different results than untrained listeners, at least a bazillion or so years ago (1) when I last really delved into blind, double-blind, and ABX testing (i.e. running them, participating in them, and crunching the data).

FWIWFM - Don

(1) OK, maybe not quite that long ago, but everybody exaggerates.

amirm · Jan 13, 2017

Great question and great starter post.

One of the problems with ABX test is the hobby conducted ones where they make mistakes in protocol. Blind testing is not easy and requires a ton of effort to do right. As you mention likes of Olive and Toole do it well but the casual audiophile group often does not. As an example of where one goes wrong is selecting people without critical listening abilities or content that is revealing. It is trivial to arrive at negative outcomes by doing these which sadly is very common in hobby run tests.

The study you point (thank you!) confirms what I have experienced. I like to be in full control of volume control, length of clip being played, when to switch, etc. I do these things when I take ABX tests on my own computer. Sometimes I zoom into a quiet part (like a reverb trail) and need to compensate with much higher volume. I know for sure if you take away these options I will not be able to get a positive outcome otherwise.

Some people call foul on that. They say if it requires so much focus and volume fiddling, then the difference isn't worth it. I disagree. When we see these results we want them to apply to everyone. There are people who hear better than me so it is important that if there is an audible difference -- no matter how small -- we find it. And importantly fix it. Then we can be assured of transparency.

Ultimately though even the best tests have flaws. This is why it is important to triangulate with how things work and measurements. That three-way analysis then gives extremely high confidence answer.

Just like a court case, we cannot ever achieve certainty. That isn't the standard in a murder case so should not be here either. We simply need to rise to the level of "beyond reasonable doubt."

Blumlein 88 · Jan 13, 2017

I actually read this yesterday when looking for info about best methods to level match speakers.

Firstly the long presentation was only 30 seconds. The short one 5 seconds.

The ratings were for levels of preference between two speakers at a time. Like Don, I find ABX best to see if a difference is real and less effective for determining 'better' sound.

It was interesting one of their conclusions being that assessment of the speakers was stable over procedures and excerpts yet the data showed two of the three musical excerpts favored different speakers under two of the three assessment procedures used.

The main difference from their test to me is that having to listen to consecutive presentations completely, even short ones, is not as effective as being able to switch between them at any time. The two long presentations allowed that and the short one did not. If they had also done a short presentation allowing switching at the choosing of the listener it would have unraveled things much more clearly. I also think short is better to determine if a difference exists. I would not be surprised if a quality or preference judgement takes longer. I also noticed the orchestral musical excerpt was less discriminating than simpler music. Exactly what other research has shown and exactly opposite of the conventional audiophile wisdom.

The procedure where listeners got to choose volumes resulted in the listener on average choosing 2.2 db louder. Listeners were told to choose and match volumes before proceeding with the test of the musical excerpts. It doesn't surprise me that speakers found different under the long procedure with fixed volume could be found different with higher precision when slightly louder.

I also wonder if the higher rated speaker in procedure 3 (where volume was set by listeners) was also the naturally louder speaker. People can mess about with volume levels and get prejudiced by the naturally louder one pretty quickly. We also know setting by ear usually only gets you within 1 db or so. They said they recorded level settings for each test I wish they had charted those to show how much those did differ.

Blumlein 88 · Jan 13, 2017

Another comment about your experience with the amplifiers. If you don't mind what speaker and amp did you initially have, and which cheap amp did you switch over to using? Amps in my opinion still need matching with speakers. Speakers are not simple resistive loads (other than some Maggies) and therefore amps aren't always interchangeable based upon the simpler power and distortion specs. It can get a big more involved when connecting to transducers (speakers or headphones).

oivavoi · Jan 13, 2017

Cosmik said:
It was listening tests that perpetuated the whole "phase doesn't matter" meme; listening tests using passive speakers (DSP hadn't been invented) and which were based on an assumption that listening in mono is just as revealing as listening in stereo because other historical experiments based on passive speakers found that some small number of people couldn't tell the difference. These historical results are still used today to tell people that they can't possibly be hearing superior performance from modern systems because "phase doesn't matter".

The person who had never heard of these ancient experiments would not hesitate to design his modern system to minimise time domain/phase errors, and wouldn't need a listening test to justify it.

But isn't that the way science is supposed to function? You start to measure something, you build theories, and then it turns out that you were wrong because better measurements and better theories come along.

An example from an academic field that is somewhat closer to my own: Most research up until now has concluded that it may actually be healthy to drink a very moderate amount of alcohol every day. These moderate drinkers seemed to be better off, even when controlling for other things. The most recent research, that employs much more sophisticated statistical methods, has shown that these results were probably wrong. Alcohol consumption seems to have a net negative effect on health, even in moderate amounts. It is reasonable to assume that the early research on alcohol and health has led at least some people to consume more alcohol than they would have done otherwise. This has probably been moderately negative for their health, all things considered. Should we then conclude that it was negative that this research was done? No, I wouldn't say so. That's how science works, I think. Sometimes you're wrong. Then it's time to accept that, and move on

oivavoi · Jan 13, 2017

Blumlein 88 said:
I actually read this yesterday when looking for info about best methods to level match speakers.
...

Thanks, excellent comments! I need to give them some thought.

Concerning your question on the amp and speakers: It was a pair of quasi-omnis, Enigma 5 from Heed. No imaging or clarity to speak of, but very sociable and friendly in the living room. The amps: It was a Hegel H300, Norwegian company. Known for excellent quality. Works like a tank. Exchanged that with the cheapest of the cheap, A500 from Behringer. Which was fully functional, and the watt/price ratio was unbeatable.

oivavoi · Jan 13, 2017

amirm said:
The study you point (thank you!) confirms what I have experienced. I like to be in full control of volume control, length of clip being played, when to switch, etc. I do these things when I take ABX tests on my own computer. Sometimes I zoom into a quiet part (like a reverb trail) and need to compensate with much higher volume. I know for sure if you take away these options I will not be able to get a positive outcome otherwise.

Some people call foul on that. They say if it requires so much focus and volume fiddling, then the difference isn't worth it. I disagree. When we see these results we want them to apply to everyone. There are people who hear better than me so it is important that if there is an audible difference -- no matter how small -- we find it. And importantly fix it. Then we can be assured of transparency.

Thanks Amir, good comments. I think I agree. It seems like your approach is this then: Assuming some people can reliably hear a difference, it's worth it to pursue? I'm inclined to buy into that. It's possible that when you put together a lot of small differences/improvements, which might not be individually audible to most people in a blind test, the combined result of all these improvements might nevertheless be perceived as so good that it can differentiated in a listening test.

Of course, sometimes it's also a practical question of cost, and when things are good enough. Hi-res music, even though a very select group of people have been able to ABX it, leads to a considerable extra cost, and needs much more storage space. I also think that 16-bit, when done right, is very good sonically. But then there are other areas: Excellent amplifiers with close to zero distortion (Hypex etc) are available for not too much money these days. Why not use them? Time coherence in loudspakers, difficult to achieve in the passive era, is possible because of DSP. Even though it may not be noticed in some blind tests by many people. So why not just do it, when it's easily doable?

My reason for going from Tidal hifi to ordinary tidal, for example, was that I wasn't able to consistently ABX the difference between 16-bit and 320 mbs. Or, my results skewed towards identification, but it wasn't statistically significant. But now I'm thinking: What the heck. There IS a difference. And some people spot it. I would probably do it as well if I trained myself to do it. I'll just go for maximal transparency and quality. It costs me two beers a month.

Sal1950 · Jan 13, 2017

amirm said:
Some people call foul on that. They say if it requires so much focus and volume fiddling, then the difference isn't worth it.

How would that be relevant? Who's to say what's "worth it"? If you can reliably identify which is which you've proven they aren't identical. Now identifying which is the more accurate or "better" is a different kettle where proper measurements should be able to identify.
Or am I off course here?

oivavoi · Jan 13, 2017

Blumlein 88 said:
The procedure where listeners got to choose volumes resulted in the listener on average choosing 2.2 db louder. Listeners were told to choose and match volumes before proceeding with the test of the musical excerpts. It doesn't surprise me that speakers found different under the long procedure with fixed volume could be found different with higher precision when slightly louder.

I also wonder if the higher rated speaker in procedure 3 (where volume was set by listeners) was also the naturally louder speaker. People can mess about with volume levels and get prejudiced by the naturally louder one pretty quickly. We also know setting by ear usually only gets you within 1 db or so. They said they recorded level settings for each test I wish they had charted those to show how much those did differ.

I think you highlight a potential problem with the study here. It should be a simple matter of protocol for the researchers to say whether the preferred speaker was played louder than the others, though. It might be, however, that the other speakers distorted more easily when the volume increased.

Cosmik · Jan 14, 2017

If all it takes is a dB or two of volume increase to make an inferior speaker sound superior, then why don't we all just turn up our stereos by a dB or two?

And then that leads to the other thread about level matching speakers for listening tests. At moderate volumes, a speaker with a 'loudness curve' applied may sound 'better' than a flat speaker. So which is the better speaker? It's at that moment that the objectivists can swing into action with measurements to show that, without doubt, the flatter speaker is the better one. Surely a listening test will prove it etc....

March Audio · Jan 14, 2017

A timely thread.

I have just performed a blind listening test with a group of audiophiles with MQA material. To be honest the test didn't have enough samples to be genuinely statistically representative, however yet again it confirmed my views on individuals real abilities to discern differences with audio equipment.

This isn't the first test I have done, but it yielded the same result. People are no-where near as good at discriminating as they think they are, and that the differences that people claim to hear are greatly exaggerated (maybe speakers excluded). As soon as you put the most unobtrusive controls in, remove the visual clues the audiophile abilities evaporate.

I think the point for me is that if you have to go to the Nth degree to hear the difference, you know you need months of exposure and be in a Zen like state, in familiar surroundings with fair Maidens soothing you, then said difference cant be particularly large/significant. Often this is the sort of excuse that is trotted out why people fail listening tests. Of course its up to the individual to decide whats significant, but still..........

BTW the MQA results were interesting. This was a 2L recording, Hoff Ensemble, 96kHz 24bit V the MQA version played on a Meridian Explorer2. rest of the system high end PS Audio / Audio Physics. ABX wise - all pretty random choices from the 10 people - except for one guy who got it 100% right. Again I have to stress there wasn't enough samples to draw statistically confident conclusion. A second question was asked, which one they preferred. A solid preference for the non MQA!

Blumlein 88 · Jan 14, 2017

BE718 said:
A timely thread.

I have just performed a blind listening test with a group of audiophiles with MQA material. To be honest the test didn't have enough samples to be genuinely statistically representative, however yet again it confirmed my views on individuals real abilities to discern differences with audio equipment.

This isn't the first test I have done, but it yielded the same result. People are no-where near as good at discriminating as they think they are, and that the differences that people claim to hear are greatly exaggerated (maybe speakers excluded). As soon as you put the most unobtrusive controls in, remove the visual clues the audiophile abilities evaporate.

I think the point for me is that if you have to go to the Nth degree to hear the difference, you know you need months of exposure and be in a Zen like state, in familiar surroundings with fair Maidens soothing you, then said difference cant be particularly large/significant. Often this is the sort of excuse that is trotted out why people fail listening tests.

BTW the MQA results were interesting. This was a 2L recording, Hoff Ensemble, 96kHz 24bit V the MQA version played on a Meridian Explorer2. rest of the system high end PS Audio / Audio Physics. ABX wise - all pretty random choices from the 10 people - except for one guy who got it 100% right. Again I have to stress there wasn't enough samples to draw statistically confident conclusion. A second question was asked, which one they preferred. A solid preference for the non MQA!

Let me clarify what I think you are saying. So other than one guy, random results of MQA vs non-MQA for a hearing a difference. But for hearing a preference the group scored above random levels?

March Audio · Jan 14, 2017

Blumlein 88 said:
Let me clarify what I think you are saying. So other than one guy, random results of MQA vs non-MQA for a hearing a difference. But for hearing a preference the group scored above random levels?

Yes

tomelex · Jan 14, 2017

I do not see how anyone who understands the amplifier speaker connection on an electrical level can say that all amps sound the same, I could agree that all amps within a certain style of output electronics, class a, class a/b with feedback, push pull vs single ended will be harder to discern, there are technical reasons they behave differently in controlling a real world loudspeaker vs a test resistor. Anyhow, a spectrum test clearly shows how they are different for all to see with just one or two tones, and in the real audio world people test with multiple tones etc.

I have done amateur blind tests atleast 50 times. Everything makes a difference its just a matter of how loud you play the music or what the device is. And level matching speakers to one another is virtually impossible, so comparing speakers sounding different is easy, electronics, that gets damn hard. However, power amps among the electronics make a noticeable difference. And I say this when comparing comparable designed stuff, clearly we are not comparing a radio shack archer stereo with an upscale Yamaha or Boulder amp or electronics.

Hearing differences for me is easier with switching about every 10 seconds at the most. The repeat on the cd player is great, I can repeat one part of a track over and over, makes it real easy to hear differences. As has been said, hearing differences does not make one a more brilliant audiophile, and it does not automatically mean that you prefer one over the other. The thing about abx, if doing electronic and not speakers, is for me, if you cant identify a difference within 10 seconds switches, you are never going to hear the differences. I can always hear a difference within that time frame.

when I test changes to my audio devices internal components, I for example can get my wife to come in and hook up my switch for me, so I don't know which way is which device, then I just decide in the end which position of the switch sounds better to me. I tell you, the most expensive part does not always equate to a preference in sound.

the short answer, for me, is that speakers are a preference thing, no way to do a level match on them across the band, and while you can hear differences, the difference you prefer, is best to listen over some period of time once you are down to the two speakers you like. That's because our mood, amount of sleep we had, stress, etc, all play into how we hear, and over time, we can get averaged out so that the actual speakers we choose from (which are not really changing in any real way of concern) will be the ones we can always listen to. Above all, taking away sighted bias absolutely takes away a whole load of things that affect our thought process and allows for test on merits not price or bling. If you really are pursuing the best sound for you, really the best sound, really, really, then you have to listen blind or you will fool yourself. AND, if you don't know what you want or like, if you really don't know yourself, you are fooling yourself again. The experience, like a lot of things in life, is knowing yourself first, only then can you SATISFY yourself....cheers.

March Audio · Jan 14, 2017

amirm said:
Great question and great starter post.

One of the problems with ABX test is the hobby conducted ones where they make mistakes in protocol. Blind testing is not easy and requires a ton of effort to do right. As you mention likes of Olive and Toole do it well but the casual audiophile group often does not. As an example of where one goes wrong is selecting people without critical listening abilities or content that is revealing. It is trivial to arrive at negative outcomes by doing these which sadly is very common in hobby run tests.

The study you point (thank you!) confirms what I have experienced. I like to be in full control of volume control, length of clip being played, when to switch, etc. I do these things when I take ABX tests on my own computer. Sometimes I zoom into a quiet part (like a reverb trail) and need to compensate with much higher volume. I know for sure if you take away these options I will not be able to get a positive outcome otherwise.

Some people call foul on that. They say if it requires so much focus and volume fiddling, then the difference isn't worth it. I disagree. When we see these results we want them to apply to everyone. There are people who hear better than me so it is important that if there is an audible difference -- no matter how small -- we find it. And importantly fix it. Then we can be assured of transparency.

Ultimately though even the best tests have flaws. This is why it is important to triangulate with how things work and measurements. That three-way analysis then gives extremely high confidence answer.

Just like a court case, we cannot ever achieve certainty. That isn't the standard in a murder case so should not be here either. We simply need to rise to the level of "beyond reasonable doubt."

Whilst I absolutely agree with your points regarding the folly of hobby testing, it has to be better than sighted and guaranteed biased comparisons

Blumlein 88 · Jan 14, 2017

BE718 said:
Whilst I absolutely agree with your points regarding the folly of hobby testing, it has to be better than sighted and guaranteed biased comparisons

Any controls help. While I doubt I fully convinced anyone I have pierced their opinions a few times just on insisting we carefully match levels. Two pieces of gear that were thought night and day difference once level matched even believers were taken aback and admitted ,"well might not be as large a difference as I thought". This sighted. It was with people who knew me and had some respect or at least familiarity with my opinions. Perhaps without that personal connection people would still cling heavily to their biased experiences.

Limitations of blind testing procedures

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Master Contributor

Founder/Admin

Grand Contributor

Grand Contributor

Major Contributor

Major Contributor

Major Contributor

Grand Contributor

Major Contributor

Major Contributor

Master Contributor

Grand Contributor

Master Contributor

Addicted to Fun and Learning

Master Contributor

Grand Contributor

Similar threads