Relevance of Blind Testing

MRC01 · Nov 18, 2020

andreasmaaan said:
...
Let me make one last argument. At the moment, we've arbitrarily broken this 21-trial test into 3 x 7-trial tests. Why not keep going, and break it down into 21 x 1-trial tests? What are the consequences of that according to the logic you're applying to the 3 x 7-trial tests?

If you agree with me that treating any n-trial test as n separate tests is the wrong approach, how can you justify in this case choosing 7 as the relevant number of trials in each sub-test? It seems completely arbitrary to me. And then impossible to avoid falling into a problem similar to Zeno's Paradox...

Yep, that's the same conundrum that I had in mind for the "independent events" approach.

Blumlein 88 said:
... Let us try this thought experiment. ... Does this make sense?

I agree the "aggregate the trials" approach seems reasonable. Here's the conundrum it leads to. Perhaps we can resolve this.

Scenario: On day 1, Mary takes an ABX test consisting of 7 trials and gets 5 correct. On day 2, she takes another ABX test consisting of 8 trials and gets 5 correct. On day 3, she takes another ABX test consisting of 6 trials and gets 5 correct.
All 3 tests were properly performed and had the same selections for A and B.

Now the test agency doesn't tell Mary how many she got right. They only tell her the confidence scores of 77.3%, 63.7% and 89.1%. She asks us, what is the overall confidence of her results across all 3 tests? According to the "aggregate the trials" approach, it is impossible for us to compute this. To do that we must know the test details: the trials and scores of each test.

So the conundrum is: we know the confidence of each test, yet we can't compute their joint probability? That doesn't seem right. Why can't we compute the joint probability? Event 1 is "Mary was guessing on test 1", its probability is 22.7%. Event 2 is "Mary was guessing on test 2", its probability is 36.3, etc. These events are independent and their joint probability should be .227 * .363 * .109 = 0.9% chance that Mary was guessing on all 3 tests, or 99.1% confidence that she was not guessing on at least 1 of the tests.

andreasmaaan · Nov 18, 2020

MRC01 said:
So the conundrum is: we have 3 different / independent events, why can't we compute the joint probability? Event 1 is "Mary was guessing on test 1", its probability is 22.7%. Event 2 is "Mary was guessing on test 2", its probability is 36.3, etc. These events are independent and their joint probability should be .227 * .363 * .109 = 0.9% chance that Mary was guessing on all 3 tests, or 99.1% confidence that she was not guessing on at least 1 of the tests.

Yes, but let's imagine this happened: Mary was guessing on one of the tests, but was not guessing on the other two.

Does this place us within the 3.9% probability that she was guessing on all three tests, or the 96.1% probability that she was not guessing on all three tests?

Blumlein 88 · Nov 18, 2020

MRC01 said:
Yep, that's the same conundrum that I had in mind for the "independent events" approach.

I agree the "aggregate the trials" approach seems reasonable. Here's the conundrum it leads to. Perhaps we can resolve this.

Scenario: On day 1, Mary takes an ABX test consisting of 7 trials and gets 5 correct. On day 2, she takes another ABX test consisting of 8 trials and gets 5 correct. On day 3, she takes another ABX test consisting of 6 trials and gets 5 correct.
All 3 tests were properly performed and had the same selections for A and B.

Now the test agency doesn't tell Mary how many she got right. They only tell her the confidence scores of 77.3%, 63.7% and 89.1%. She asks us, what is the overall confidence of her results across all 3 tests? According to the "aggregate the trials" approach, it is impossible for us to compute this. To do that we must know the test details: the trials and scores of each test.

So the conundrum is: we know the confidence of each test, yet we can't compute their joint probability? That doesn't seem right. Why can't we compute the joint probability? Event 1 is "Mary was guessing on test 1", its probability is 22.7%. Event 2 is "Mary was guessing on test 2", its probability is 36.3, etc. These events are independent and their joint probability should be .227 * .363 * .109 = 0.9% chance that Mary was guessing on all 3 tests, or 99.1% confidence that she was not guessing on at least 1 of the tests.

Now you are changing the conditions once again. We have apples, oranges, and you have just added grapefruit. Then asking why grapefruit doesn't sort out like apples. The answer is no one expected it to do so. You are also mixing permutations and combinations while our statistics are set up to tell us about combinations. Then wondering why the predictions for permutations aren't the same.

You also can't say whether or not Mary was guessing. A certain percentage of the time at random Mary will give responses as if she knows for certain even though she is guessing.

MRC01 · Nov 18, 2020

So what you're saying is that in the above example, it is impossible to compute Mary's overall confidence across all tests.
That is, if all we know is that Mary took 3 tests at 77.3, 63.7 and 89.1 % confidence respectively, we can't compute her overall confidence across all 3 tests. To do that we need more information: namely, the trials (X of Y) on each test.
Is that right?

andreasmaaan · Nov 18, 2020

MRC01 said:
So what you're saying is that in the above example, it is impossible to compute Mary's overall confidence across all tests.
That is, if all we know is that Mary took 3 tests at 77.3, 63.7 and 89.1 % confidence respectively, we can't compute her overall confidence across all 3 tests.

That's my view, yeh.

The confidence level we've been using is valid only in respect of a binomial distribution. It can't be applied to three binomial distributions as though they were one, because this would effectively impose additional restrictions on the ordering of outcomes, which would cause the resulting distribution to no longer be binomial.

MRC01 said:
To do that we need more information: namely, the trials (X of Y) on each test.
Is that right?

To be pedantic, we'd only need to know the total n and k - not the specific n and k for each test. But essentially yes, this is my (layperson's) view, for the reasons mentioned.

RayDunzl · Nov 18, 2020

This thread has become misnamed.

Old - "Relevance of Blind Testing"

change to

New - "Relevance of Statistical Analysis With Small Populations and Moving Targets"

Blumlein 88 · Nov 18, 2020

MRC01 said:
So what you're saying is that in the above example, it is impossible to compute Mary's overall confidence across all tests.
That is, if all we know is that Mary took 3 tests at 77.3, 63.7 and 89.1 % confidence respectively, we can't compute her overall confidence across all 3 tests. To do that we need more information: namely, the trials (X of Y) on each test.
Is that right?

No, I'm not saying that.

We know Mary took three tests. We can compute her probability of guessing for any possible outcome. However, you keep wanting to choose a particular singular possible result for each sub-test and then wonder why that gives a different probability result than if you do the calculation for all the aggregate possible results. It is because you are computing different parameters.

You are cherry picking the answer ahead of time, and confusing that with predictive probability. If you specify each portion, then you are looking at a much less common possible result. It however doesn't imply a higher confidence level in Mary's chances overall. Only for the specific results you are arbitrarily choosing.

I'm at a loss at this point to illustrate the point which removes the sticking point in your thinking of this.

We can model every possible result Mary can have in a test you describe and then see how often the exact result you give us occurs. But if you then compare that with other different specific results the probabilities differ. We can figure it either way, but when A is not B then A will not equal B.

Pio2001 · Nov 19, 2020

I think that the key point is that the method of analysis must be defined before the test, and not changed after the test according to the result. That's cheating.

It is because you saw that Mary's results were evenly distributed across the three sessions that you got the idea of combining these as three independent results.
Without that information, maybe you wouldn't have thought about this possibility.

A similar exemple exists in many real-life ABX meetings. For example, three people decide to meet and to ABX two amplifiers. We agree about which score is a success, and which is a failure.
And then, another listener comes in, and another, and oh, that's cool, can I come too ? Sure, bring your friends !...
Then you have to explain to everyone that the score that was agreed to be a success is now a failure because of the much higher probability that someone, among all the atendees, reaches this score. And that proving that two amplifiers have a different sound is much more difficult with 10 listeners than with one.

The usual answer to this is "no problem : please aggregate the scores of everyone".
And then you have to explain that in this case, if one of them has a score of 10/10 and everyone else 5/10, it will be a failure. Is this really what we want ?

That's why the protocol of the test, and the conditions for success must be carefully decided before the test starts.

MRC01 · Nov 19, 2020

Pio2001 said:
I think that the key point is that the method of analysis must be defined before the test, and not changed after the test according to the result. ...

Yep.

Pio2001 said:
... Then you have to explain to everyone that the score that was agreed to be a success is now a failure because of the much higher probability that someone, among all the atendees, reaches this score. And that proving that two amplifiers have a different sound is much more difficult with 10 listeners than with one. ...

That's a double edged sword. If one person (raw) scores slightly better than guessing, he was probably just lucky. But if 10 people each score slightly better than guessing, it's less likely to be luck. Raw scores slightly better than guessing become more confident if they are repeated or consistent.

This is why the confidence percentiles aren't linear. 6 right of 10 is only 62% confident. But 60 right of 100 is 97% confident.

andreasmaaan · Nov 19, 2020

MRC01 said:
That's a double edged sword. If one person (raw) scores slightly better than guessing, he was probably just lucky. But if 10 people each score slightly better than guessing, it's less likely to be luck.

The number of people is not relevant in this case. Rather, it's the number of trials. If one person does 20 trials and gives X correct responses, the probability that they were guessing is the same as if 10 people do 20 trials and give X correct responses.

OTOH, if in a group of 10 subjects who each do an adequate number of trials there is only one subject whose correct response rate suggests they were not guessing, that may be relevant.

It all depends on whether we want to know (a) whether there exists anyone who can discern a difference or (b) whether on average people can discern a difference.

TSB · Nov 19, 2020

Pio2001 said:
I think that the key point is that the method of analysis must be defined before the test, and not changed after the test according to the result. That's cheating.

It is because you saw that Mary's results were evenly distributed across the three sessions that you got the idea of combining these as three independent results.
Without that information, maybe you wouldn't have thought about this possibility.

A similar exemple exists in many real-life ABX meetings. For example, three people decide to meet and to ABX two amplifiers. We agree about which score is a success, and which is a failure.
And then, another listener comes in, and another, and oh, that's cool, can I come too ? Sure, bring your friends !...
Then you have to explain to everyone that the score that was agreed to be a success is now a failure because of the much higher probability that someone, among all the atendees, reaches this score. And that proving that two amplifiers have a different sound is much more difficult with 10 listeners than with one.

The usual answer to this is "no problem : please aggregate the scores of everyone".
And then you have to explain that in this case, if one of them has a score of 10/10 and everyone else 5/10, it will be a failure. Is this really what we want ?

That's why the protocol of the test, and the conditions for success must be carefully decided before the test starts.

Nice example.

More specific to this discussion, the definition of success (hypothesis under investigation) has to be chosen before looking at the data. See here for an explanation.

Newman · Dec 1, 2020

BaaM said:
....you also have to take into consideration that the vast majority of audio blind testings result in failure, no matter what you compare....

Firstly, "no matter what you compare" is completely wrong.

Secondly, an audio blind test that concludes "no difference" should not be said to result in "failure".

Thirdly, the only reason so many blind tests conclude "no difference" is that people think most blind tests would be pointless because the difference is so obvious to them that they think there is nothing to test, so these tests never happen. They always want to test something they think is under debate, like interconnects, or power cords, or not-broken amps. Hence the "vast majority" of "no difference" blind test outcomes. It's caused by the nature of the selection criteria that people want to adopt.

BostonJack said:
When making equipment choices, I believe in factoring in feelings or inclinations that would be ascribed to sighted bias in a listening test. Not paying attention to what brings you satisfaction is like going to the doctor and thinking "I don't really have any evidence that this doctor is capable of curing me, so I better discount him or her". Better to embrace the placebo effect, believe in the physician, swallow the tiny bitter pills with confidence. After all, placebo effects are the route to healing in many instances.

Perfectly happy with that. In fact I have made similar suggestions in the past.

What irks me is the very many people who, having made a personal observation from casual listening to a piece of gear, think it was all in the sound waves and 'going into print' with their claims about 'obvious audible differences'. And it's not just enthusiastic audiophiles in the forums: look at almost any audio gear review in the media. They give the very, very clear impression that they are talking about how the sound waves have been changed by the gear, and they are hearing it.

Peeves me right off.

pkane said:
If you've participated in any discussions with subjectivists on the value of blind testing, the claim is invariably made that blind tests, including ABX, are not as sensitive as long-term evaluations. Common claims are made about fatigue, unnatural fast switching, the pressure of taking a test, the need to "absorb subconsciously" the differences over time, etc.

Don't forget the left-brain/right-brain excuse.

SIY · Dec 1, 2020

Newman said:
Don't forget the left-brain/right-brain excuse.

Or alpha waves. I had to admit that was a new one for me, but Max Townshend never lacked for creativity in his scams.

Pio2001 · Dec 1, 2020

Personally, I don't like placebo.

When I'm buying a device, I want it to work even if I'm not looking at it. I'm not interested in something that stops working when I don't know that it is working.

Also, placebo creates addiction. If I need a fancy cable in order to appreciate good sound, it means that if the fancy cable is not there, then even if I'm listening to the better system in the world, all I'm going to hear is crap ? No way ! I want to appreciate good sound when I hear it, and not only when a 1000 $ interconnect is there.

3dbinCanada · Dec 23, 2020

polmuaddib said:
Thanks. So if you look at it from a subjectivist view, when choosing new hifi component, it is important to do a sighted test. Because, in the end, you are gonna look at it and listen to it and attribute all those qualities that the brain assigns to the sound.
Objectively two devices can sound the same, even measure the same, but you do hear a difference when you know what component you are listening and this difference can't be measured. But it is there all the same. It is in our heads, but it is real, right?

The danger in thinking like this is the basis of all snake oil in the audio industry. It gives creedance incorrectly to biased listening tests, incorrect claims by audiophools who claim their hearing is more sensitive than test equipment.

Wes · Dec 23, 2020

it is important to do a sighted 'session' - do it after you have a group of units that do not sound any different based on DBT, or at min. SBT

but that sessions should include more than just looks - evaluate control functionality and ergonomics

then buy the one you like

Newman · Dec 24, 2020

@Wes That’s not what ‘sighted session’ means: they are talking about listening sessions, not visual appraisal and UI.

Look, I agree with @polmuaddib that, when you are listening at home to your Hifi, you are doing it ‘sighted’, and your ‘perceptual programs’ are running, so it’s self-defeating to ignore the experiences they are creating for you. And you can’t turn them off with willpower, just because you know it’s not in the sound waves. I endorse anyone buying what creates the most valued experiences for them, sonically. Be warned, though: a wonderful sonic experience created today by things other than sound waves, can be fickle (here today, gone tomorrow, for any number of reasons).

What I don’t endorse is people confusing such experiences with sound waves, and rushing into print to tell the world how much better product X is at creating sound waves. THAT is what feeds the snake oil industry, @3dbinCanada

Wes · Dec 24, 2020

Yes - I know (unfortunately)

3dbinCanada · Dec 24, 2020

Newman said:
@Wes That’s not what ‘sighted session’ means: they are talking about listening sessions, not visual appraisal and UI.

Look, I agree with @polmuaddib that, when you are listening at home to your Hifi, you are doing it ‘sighted’, and your ‘perceptual programs’ are running, so it’s self-defeating to ignore the experiences they are creating for you. And you can’t turn them off with willpower, just because you know it’s not in the sound waves. I endorse anyone buying what creates the most valued experiences for them, sonically. Be warned, though: a wonderful sonic experience created today by things other than sound waves, can be fickle (here today, gone tomorrow, for any number of reasons).

What I don’t endorse is people confusing such experiences with sound waves, and rushing into print to tell the world how much better product X is at creating sound waves. THAT is what feeds the snake oil industry, @3dbinCanada

I see that we agree.

3dbinCanada · Dec 24, 2020

Wes said:
it is important to do a sighted 'session' - do it after you have a group of units that do not sound any different based on DBT, or at min. SBT

but that sessions should include more than just looks - evaluate control functionality and ergonomics

then buy the one you like

Once you realize that AVRs sound the same without DSP and room correction engaged, then I fully agree with what you are saying.

Relevance of Blind Testing

Major Contributor

Master Contributor

Grand Contributor

Major Contributor

Master Contributor

Grand Contributor

Grand Contributor

Senior Member

Major Contributor

Master Contributor

Active Member

Major Contributor

Grand Contributor

Senior Member

Senior Member

Major Contributor

Major Contributor

Major Contributor

Senior Member

Senior Member

Similar threads