Relevance of Blind Testing

MRC01 · Nov 12, 2020

magicscreen said:
... If there is a difference then you should make 20 out of 20.

For obvious differences, you certainly will. And for differences you can't really hear, results will be in the noise, resembling random guessing. Test sensitivity refers to how well the test can detect differences that are in between these extremes: results that are not 100%, yet still consistently better than random guessing. This is because our perception gets less reliable near minimum thresholds. There's a point where one can just barely hear the difference. It's so subtle you occasionally get it wrong but you get it right better than random guessing. The difficulty is, pure random guessing occasionally gets lucky too, so how to differentiate that.
PS: the answer is to do more trials, which tests whether results slightly better than guessing, are consistent, thus raising the confidence %. But these increased number of trials should be split across different sessions to prevent listener fatigue from impairing the results.

MRC01 · Nov 12, 2020

Python may be easier for people to use. For educational purposes some may want a naive derivation of the math, instead of just calling a library function. Here is the Python 3 version of the prior Java code:

Code:

import math

# Returns the confidence % of getting trialsCorrect of trialsTotal
def getPercentile(trialsCorrect, trialsTotal):
    pTile = 0.0
    if(trialsTotal <= 0):
        return pTile
    # Sum the probabilities of each correct trial
    # 1 of N, 2 of N, etc.
    for i in range(0, trialsCorrect):
        pTile += computePtile(i, trialsTotal)
    return pTile

# Compute the % of getting exactly n of total trials correct
# Note: exactly n, not "n or more"
# When each guess is 50% or 1/2
def computePtile(n, total):
    return math.factorial(total) / (math.factorial(n) * math.factorial(total - n)) \
            * math.pow(0.5, n) \
            * math.pow(0.5, total - n) \

Just paste that into a Python 3 interpreter and call it like this, passing any numbers you want:
>>> print(getPercentile(15,20))
0.9793052673339844
Which means getting 15 correct of 20 trials is 97.9% confidence, or 2.1% chance to do that well by guessing.

You can test the inner function like this:

Code:

>>> for i in range(0, 11):
...     print(computePtile(i,10))
... 
0.0009765625
0.009765625
0.0439453125
0.1171875
0.205078125
0.24609375
0.205078125
0.1171875
0.0439453125
0.009765625
0.0009765625

You can see the probabilities are symmetric, as they should be. It's a binomial distribution. The intuition behind this symmetry is that guessing all 10 wrong is as unlikely as guessing all 10 right, guessing exactly 1 right (meaning getting 9 wrong) is as unlikely as guessing exactly 1 wrong (meaning getting 9 right), etc. The center (5 right, 5 wrong) is the most likely because it has the greatest number of different combinations that lead to it. And the probabilities all sum to 1.0, as they must since each result is mutually exclusive.

magicscreen · Nov 14, 2020

Thanks.

MattHooper · Nov 16, 2020

polmuaddib said:
Recently, I watched a video of John Atkinson's RMAF 2018 presentation "50 years being an audiphile" and he told one interesting story among other things. He said that he attended DBT of amplifiers in the 70s, if i remember correctly, where they basically concluded that all amps sound the same. But after a while, having a regular Quad amp, he found that he didn't enjoy listening to music anymore. So, he got himself a new system, reignited the passion for music listening again and deduced that DBT has some flaws.
That story got me thinking on what happens after blind testing. I am not really questioning DBT, because i know it is an essential scientific tool.
But, let's say you have a sighted test of two amps (could be preamps, dacs... doesn't matter) and you like the sound of A amp better then B amp. Now you do the DBT of those same two amps, and most likely you can't hear the difference. They sound the same. Now when you do the sighted test again, will you like the same A amp sound better or not? Will you now be illuminated (for the lack of more appropriate word) and never have bias again when sighted testing? Or you will have bias, but not for those two particular amps?

Yeah, I've had the same thoughts about JA's story as many here have voiced. It's pretty question-begging.

However, you've raised the type of question I've brought up too: how does the individual handle the results of sighted/DBT listening results?

For me: knowledge of bias keeps me cautious about making strong objective claims based on subjective experience.

But then there is the issue of the decisions we make personally.

I use blind testing sometimes to very helpful effect. Other times I don't bother because it's a hassle, or I just don't care enough, or I actually don't mind whether what I'm hearing is placebo/bias or not.

For instance, decades ago I tried some sort of footers under my CD player, back when they all became the rage. I was skeptical, but seemed to perceive a slight, pleasing sonic difference. But, again, I was also skeptical. Nobody could give me a satisfying explanation as to why the footers ought to change the sound of a CD player at all. But, they were cheap and I already paid for them. Hence, I just left them under the player. I figured, yeah, probably a placebo-type effect but I'll take it. Not much skin off my nose. (I eventually sold my CD player, and wouldn't bother with any such tweak now).

Other times when it was something more expensive, I did some blind testing - e.g. high end AC cables, video cables etc. My experience led me to save money on high end cables

But then there is the issue of speakers which for me brings in your question more acutely. Like I've mentioned several times on the forum, I found myself enjoying certain speaker designs that would be unlikely to "win" in blind testing against something like a Revel speaker. And having auditioned Revel speakers numerous times and the other brands, I preferred the other brand. So what do I do with that? It's possible I'm in the outlier category where I would in fact prefer that speaker if blind tested. Or the odds are I would select the Revel speaker under blind conditions. If I presume that it's most likely I'd prefer a Revel speaker under blind conditions - something I can't test - do I buy the Revel? My personal decision is: No. I go with the one I prefer in sighted conditions. Because whatever the cause, I can't seem to shake the effect. I just find myself much more enthusiastic about listening to the other speakers. If part of that derives from some bias effects, so be it I guess. I mean, I either spend big bucks on the speaker that didn't seem to move me in sighted auditions, or the one that did. Seems a bit more of risk thinking "Well, under the conditions I'll actually use the speaker I preferred speaker B, but since I may prefer speaker A under blind conditions, I'll buy that one instead and hope to heck I'll suddenly like it more."

But this is a really personal take depending on one's experience-derived criteria and goals. Some people aren't so interested in paying attention to "the sound of a speaker" so much as they are interested in simply knowing a component is as low distortion/neutral as possible. And once they get that, they can just accept with however music sounds through the speaker. Which is completely reasonable too, given that mindset.

tmtomh · Nov 16, 2020

MattHooper said:
Yeah, I've had the same thoughts about JA's story as many here have voiced. It's pretty question-begging.

However, you've raised the type of question I've brought up too: how does the individual handle the results of sighted/DBT listening results?

For me: knowledge of bias keeps me cautious about making strong objective claims based on subjective experience.

But then there is the issue of the decisions we make personally.

I use blind testing sometimes to very helpful effect. Other times I don't bother because it's a hassle, or I just don't care enough, or I actually don't mind whether what I'm hearing is placebo/bias or not.

For instance, decades ago I tried some sort of footers under my CD player, back when they all became the rage. I was skeptical, but seemed to perceive a slight, pleasing sonic difference. But, again, I was also skeptical. Nobody could give me a satisfying explanation as to why the footers ought to change the sound of a CD player at all. But, they were cheap and I already paid for them. Hence, I just left them under the player. I figured, yeah, probably a placebo-type effect but I'll take it. Not much skin off my nose. (I eventually sold my CD player, and wouldn't bother with any such tweak now).

Other times when it was something more expensive, I did some blind testing - e.g. high end AC cables, video cables etc. My experience led me to save money on high end cables

But then there is the issue of speakers which for me brings in your question more acutely. Like I've mentioned several times on the forum, I found myself enjoying certain speaker designs that would be unlikely to "win" in blind testing against something like a Revel speaker. And having auditioned Revel speakers numerous times and the other brands, I preferred the other brand. So what do I do with that? It's possible I'm in the outlier category where I would in fact prefer that speaker if blind tested. Or the odds are I would select the Revel speaker under blind conditions. If I presume that it's most likely I'd prefer a Revel speaker under blind conditions - something I can't test - do I buy the Revel? My personal decision is: No. I go with the one I prefer in sighted conditions. Because whatever the cause, I can't seem to shake the effect. I just find myself much more enthusiastic about listening to the other speakers. If part of that derives from some bias effects, so be it I guess. I mean, I either spend big bucks on the speaker that didn't seem to move me in sighted auditions, or the one that did. Seems a bit more of risk thinking "Well, under the conditions I'll actually use the speaker I preferred speaker B, but since I may prefer speaker A under blind conditions, I'll buy that one instead and hope to heck I'll suddenly like it more."

But this is a really personal take depending on one's experience-derived criteria and goals. Some people aren't so interested in paying attention to "the sound of a speaker" so much as they are interested in simply knowing a component is as low distortion/neutral as possible. And once they get that, they can just accept with however music sounds through the speaker. Which is completely reasonable too, given that mindset.

This is all very reasonable in my view.

IMHO the crucial question with sighted comparisons is this: If sighted preferences are different than blind preferences, then that would suggest that something about the sighted preference is not "true," in the sense that it's nothing to do with an intrinsic aspect of the gear's sound reproduction.

Now, there's nothing wrong with that: if a speaker looks really cool, and that triggers some kind of reaction in your brain that makes you feel happier when you listen to them - or maybe even makes you focus every so slightly more on the sound when they're playing, leading you to actually hear more detail - then why not go for it and buy that speaker?

The question, though, is how durable is the sighted-listening preference? If the speaker's performance doesn't change over time, but you get used to its appearance and are no longer particularly tickled by it, then in the above example the sighted preference would disappear. Worse yet, you might never know it disappeared because you would have owned the speaker for several months or years and it would be a long time since you compared it with other speakers. Instead of the placebo effect wearing off, you might instead experience the common audiophile upgrade itch, without knowing exactly why, e.g. "I've always loved these speakers, and I still do, but I wonder what else might be out there - is there another step forward I can take in the journey?"

I am not saying you personally are in this scenario. I'm just saying it's where my question and concern is with sighted preferences.

magicscreen · Nov 16, 2020

MRC01 said:
For obvious differences, you certainly will. And for differences you can't really hear, results will be in the noise, resembling random guessing. Test sensitivity refers to how well the test can detect differences that are in between these extremes: results that are not 100%, yet still consistently better than random guessing. This is because our perception gets less reliable near minimum thresholds. There's a point where one can just barely hear the difference. It's so subtle you occasionally get it wrong but you get it right better than random guessing. The difficulty is, pure random guessing occasionally gets lucky too, so how to differentiate that.
PS: the answer is to do more trials, which tests whether results slightly better than guessing, are consistent, thus raising the confidence %. But these increased number of trials should be split across different sessions to prevent listener fatigue from impairing the results.

Listener fatigue
I always can do five or six good try in a row but after that I usually make a mistake and stop the session.
So I can add them up and I have a successful blind test instead of three failed?
So I can hear the difference between DACs after all:
5/6 + 6/7 + 6/7 = 17/20

Blumlein 88 · Nov 16, 2020

magicscreen said:
Listener fatigue
I always can do five or six good try in a row but after that I usually make a mistake and stop the session.
So I can add them up and I have a successful blind test instead of three failed?
So I can hear the difference between DACs after all:
5/6 + 6/7 + 6/7 = 17/20

Sure you can add them up like that. Assuming the test was done right. Double blind, level matched with a voltmeter etc. etc.

TSB · Nov 16, 2020

magicscreen said:
Listener fatigue
I always can do five or six good try in a row but after that I usually make a mistake and stop the session.
So I can add them up and I have a successful blind test instead of three failed?
So I can hear the difference between DACs after all:
5/6 + 6/7 + 6/7 = 17/20

Stopping your test based on a result totally invalidates it.

Blumlein 88 · Nov 16, 2020

Timon VDB said:
Stopping your test based on a result totally invalidates it.

I don't think he stopped it on a result. He stopped due to fatigue. Now if he started doing the test and kept going until he eventually hit a run of 5 straight and picked only those then yes that is cherry picking results.

Doing 5 in a row each day for a few days if that gets him 5 of 5 then adding those up is quite reasonable and nothing wrong with it.

TSB · Nov 16, 2020

Blumlein 88 said:
I don't think he stopped it on a result. He stopped due to fatigue. Now if he started doing the test and kept going until he eventually hit a run of 5 straight and picked only those then yes that is cherry picking results.

Doing 5 in a row each day for a few days if that gets him 5 of 5 then adding those up is quite reasonable and nothing wrong with it.

I can't square that interpretation with the description he gave: "after that I usually make a mistake and stop the session". The 5/6, 6/7, 6/7 (always one failure at the end) points in the same direction.

Blumlein 88 · Nov 16, 2020

Timon VDB said:
I can't square that interpretation with the description he gave: "after that I usually make a mistake and stop the session". The 5/6, 6/7, 6/7 (always one failure at the end) points in the same direction.

Nevertheless, if he did go each session until he gets a single miss, and if he did so for the three sessions as described those would pass over the threshold for significance of him hearing something different for real.

Now if the results were 1 of 2, 2 of 3, and 2 of 3 you'd be right. Do that long enough and it looks significant when it isn't. You are also correct if his method is to go until he misses one and stop then adding them together can provide a false picture.

SoundAndMotion · Nov 16, 2020

Blumlein 88 said:
Doing 5 in a row each day for a few days if that gets him 5 of 5 then adding those up is quite reasonable and nothing wrong with it.

Agreed.

Blumlein 88 said:
Nevertheless, ...

No. No nevertheless...
He was talking about stopping on a result, which you agreed was wrong:

Blumlein 88 said:
... if his method is to go until he misses one and stop then adding them together can provide a false picture.

As you said:

Blumlein 88 said:
Sure you can add them up like that. Assuming the test was done right. Double blind, level matched with a voltmeter etc. etc.

One of those etc.'s must include:

Timon VDB said:
Stopping your test based on a result totally invalidates it.

TSB · Nov 16, 2020

It's possible to conduct a single experiment where you terminate the test based on the first failure.
The probability of 5 correct guesses followed by 1 wrong guess is (0.5)^5 * (0.5) = 0.0156. That might be considered statistically significant in itself. Let me repeat: this assumes you do a single sequence of guesses only. It's your first and only experiment.

We run into problems when you conduct multiple series of tests. You cant add together the numbers for the different tests if you've used failure as a stopping criterium. Intuitively: because you've limited the number of failures per test to 1, adding these 1s together doesn't mean anything.
But: you're also not allowed to do multiple series of tests and then calculate probabilities using just 1 of them. That would be cherry-picking your data.

Conclusion: combining the probabilities for the tests that were stopped using the first failure as criterion is more complicated then just adding the numbers. You're not allowed to calculate using just one of the test.

All of this assumes you recorded all your test series. You're never allowed to discard data after looking at the results. Judging from the loose description I doubt that was the case here.

SoundAndMotion · Nov 16, 2020

Timon VDB said:
It's possible to conduct a single experiment where you terminate the test based on the first failure.
The probability of 5 correct guesses followed by 1 wrong guess is (0.5)^5 * (0.5) = 0.0156. That might be considered statistically significant in itself. Let me repeat: this assumes you do a single sequence of guesses only. It's your first and only experiment.

We run into problems when you conduct multiple series of tests. You cant add together the numbers for the different tests if you've used failure as a stopping criterium. Intuitively: because you've limited the number of failures per test to 1, adding these 1s together doesn't mean anything.
But: you're also not allowed to do multiple series of tests and then calculate probabilities using just 1 of them. That would be cherry-picking your data.

Conclusion: combining the probabilities for the tests that were stopped using the first failure as criterion is more complicated then just adding the numbers. You're not allowed to calculate using just one of the test.

All of this assumes you recorded all your test series. You're never allowed to discard data after looking at the results. Judging from the loose description I doubt that was the case here.

Using failure as a stopping criterion is a form of significance-chasing or p-chasing, which has been the topic of several recent papers. It is important to set the number of trials before starting, unless using an established adaptive method with accepted stopping criteria (e.g. up/down, PEST, Quest or Psi), which is not applicable to ABX.
There can be several reasons to take a fatigue break, but responses (results) is not allowed. We use:
-subject requests a fatigue break, and
-a forced short break after a specific time (15-30 min, depending on task)
-a forced day or more break after a specific time or block of trials(1-2hrs, depending on task)

TSB · Nov 16, 2020

SoundAndMotion said:
Using failure as a stopping criterion is a form of significance-chasing or p-chasing, which has been the topic of several recent papers. It is important to set the number of trials before starting, unless using an established adaptive method with accepted stopping criteria (e.g. up/down, PEST, Quest or Psi), which is not applicable to ABX.
There can be several reasons to take a fatigue break, but responses (results) is not allowed. We use:
-subject requests a fatigue break, and
-a forced short break after a specific time (15-30 min, depending on task)
-a forced day or more break after a specific time or block of trials(1-2hrs, depending on task)

Those are best practices but it's definitely possible to design an experiment using failure as stopping criterion.

SoundAndMotion · Nov 16, 2020

Timon VDB said:
Those are best practices but it's definitely possible to design an experiment using failure as stopping criterion.

Timon VDB said:
Stopping your test based on a result totally invalidates it.

TSB · Nov 16, 2020

SoundAndMotion said:

There is no contradiction here, his experiment is invalid but it's possible to design a valid experiment using failure as stopping criterion. If you want I can give you a little statistics 101

MRC01 · Nov 16, 2020

magicscreen said:
...I always can do five or six good try in a row but after that I usually make a mistake and stop the session.
So I can add them up and I have a successful blind test instead of three failed?
...

I was going to answer but the others beat me to it.

PS: did those 3 tests really fail? Each was better than random guessing. Whether they failed depends on your target confidence %tile.

Blumlein 88 · Nov 16, 2020

Up down testing is based upon finding the point of failure.

SoundAndMotion · Nov 16, 2020

I used up/down testing for some threshold studies, and finding the "point of failure" is non-sequitur. How did you mean it?

Relevance of Blind Testing

Major Contributor

Major Contributor

Senior Member

Master Contributor

Major Contributor

Senior Member

Grand Contributor

Active Member

Grand Contributor

Active Member

Grand Contributor

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Major Contributor

Grand Contributor

Active Member

Similar threads