• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Limitations of blind testing procedures

Status
Not open for further replies.
OP
oivavoi

oivavoi

Major Contributor
Forum Donor
Joined
Jan 12, 2017
Messages
1,721
Likes
1,934
Location
Oslo, Norway
But this is a distraction. We are still at the point of not being able to say that people don't lose (some) of their audio discerning abilities when they know they are taking part in an experiment. As long as this is true, audiophiles will never believe what your tests show (and, possibly, nor should they). The consensus seems to be that it is "highly unlikely" that we lose our listening abilities, but without any attempt to demonstrate whether it is true or not we cannot know.

My main "problem" with ABX tests is not that I know that I'm being tested, but rather the thing with sensory smearing. It's like my brain smears the sensory inputs all over, and everything becomes a soup where I'm only able to differentiate between the really big things. To really be able to perform well at ABX tests, I think you need to train at it like Amir has done. I know a couple of other Norwegian guys as well, who really take blind testing seriously for their home hifis. They have been able to ABX dacs for example. I'm not even close to doing that.

Anyway, there are people who try to investigate scientifically how different testing procedures work... I recently came across this AES conference paper:
http://www.denismartin.net/wp-content/uploads/2017/01/Martin-et-al.pdf

In scientific diciplines where they employ experiments (i.e. in those disciplines where the papers don't sound like gibberish to me), I often find conference papers to be more valuable than published peer-review papers, due to the publication bias to generate positive findings.

Anyway: A group of researchers from Canada tried to see whether other kinds of tests could reveal differences that simple blind testing could not reveal. In blind tests on audio engineers, the statistical limit for hearing differences between bitrates seems to be 256 kbs - beyond that, it sounds the same to most people when consciously doing ABX-tests. So these guys wanted to see if audio egineers would perform mixing tasks differently when mixing on 256 kbs and 16-bit material, when they didn't know what they were listening to - and didn't know what the experiment was actually about. The hypothesis is that if there are indeed differences here that are perceptually meaningful, then audio engineers will mix in a somewhat different way when mixing on 256 kbs as compared to 16-bit.

They had 27 participants, and didn't find any statistically significant results (that would be difficult with so few participants, I think). Still, the general "direction" of the findings are in line with the hypothesis that there are diffences between 256 kbs and 16-bit which are perceptually meaningful. But the differences were not very large.

My take-away from this study is:
a) that 256 kbs is just fine for listening to music
b) that there probably are perceptually meaningful differences between 256 kbs and 16-bit
c) it lends some very very small support to the theory that ABX test may mask objective differences that may perceived under other circumstances
 

Thomas savage

Grand Contributor
The Watchman
Forum Donor
Joined
Feb 24, 2016
Messages
10,260
Likes
16,298
Location
uk, taunton
I'm talking of the distortion generated by the playback system - this is easy to hear with most systems by moving your head close to the tweeter on one side, while system playback is at elevated levels - if this has a harsh, spitty, irksome, unpleasant, "distorted" quality, especially with 'challenging' recordings, well, that's distortion. That's something that goes away, when the system is in good shape ...
No doubt there are know, measurable and understood distortions that effect imaging and/ or can be used to effect the presentation...

These mystery distortions that have no name and only have been discovered in your imagination have no relevance to the discussion here.

So i dont want to read anymore about them... unless you have managed to quantify them and show their effect in the real world..
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,524
Likes
37,057
My main "problem" with ABX tests is not that I know that I'm being tested, but rather the thing with sensory smearing. It's like my brain smears the sensory inputs all over, and everything becomes a soup where I'm only able to differentiate between the really big things. To really be able to perform well at ABX tests, I think you need to train at it like Amir has done. I know a couple of other Norwegian guys as well, who really take blind testing seriously for their home hifis. They have been able to ABX dacs for example. I'm not even close to doing that.

Anyway, there are people who try to investigate scientifically how different testing procedures work... I recently came across this AES conference paper:
http://www.denismartin.net/wp-content/uploads/2017/01/Martin-et-al.pdf

In scientific diciplines where they employ experiments (i.e. in those disciplines where the papers don't sound like gibberish to me), I often find conference papers to be more valuable than published peer-review papers, due to the publication bias to generate positive findings.

Anyway: A group of researchers from Canada tried to see whether other kinds of tests could reveal differences that simple blind testing could not reveal. In blind tests on audio engineers, the statistical limit for hearing differences between bitrates seems to be 256 kbs - beyond that, it sounds the same to most people when consciously doing ABX-tests. So these guys wanted to see if audio egineers would perform mixing tasks differently when mixing on 256 kbs and 16-bit material, when they didn't know what they were listening to - and didn't know what the experiment was actually about. The hypothesis is that if there are indeed differences here that are perceptually meaningful, then audio engineers will mix in a somewhat different way when mixing on 256 kbs as compared to 16-bit.

They had 27 participants, and didn't find any statistically significant results (that would be difficult with so few participants, I think). Still, the general "direction" of the findings are in line with the hypothesis that there are diffences between 256 kbs and 16-bit which are perceptually meaningful. But the differences were not very large.

My take-away from this study is:
a) that 256 kbs is just fine for listening to music
b) that there probably are perceptually meaningful differences between 256 kbs and 16-bit
c) it lends some very very small support to the theory that ABX test may mask objective differences that may perceived under other circumstances

Interesting test they came up with. Yet the results don't contradict the idea that 256 kbps is nearly inaudible vs wav or is inaudible. Since conventional AB and ABX testing generated that conclusion this doesn't support the idea the testing methodology interferes. Being most generous and trying to lean that way as much as possible, none of the tests reach p .05 levels so any interference with perception from doing ABX testing is very, very small.

I would be all for other ways to do testing. ABX isn't stressful if you do some of them till you are comfortable with the procedure. It is however tedious in the extreme. Also what you can reliably test this way is limited.

Also I wonder this. Suppose the same tests were done again. With people trained to be comfortable with ABX testing utilizing fast switching and very short segments. You don't have to do anything different other than people would pick short segments and compare with immediate switching. Maybe the results would be superior.
 
OP
oivavoi

oivavoi

Major Contributor
Forum Donor
Joined
Jan 12, 2017
Messages
1,721
Likes
1,934
Location
Oslo, Norway
Interesting test they came up with. Yet the results don't contradict the idea that 256 kbps is nearly inaudible vs wav or is inaudible. Since conventional AB and ABX testing generated that conclusion this doesn't support the idea the testing methodology interferes. Being most generous and trying to lean that way as much as possible, none of the tests reach p .05 levels so any interference with perception from doing ABX testing is very, very small.

I would be all for other ways to do testing. ABX isn't stressful if you do some of them till you are comfortable with the procedure. It is however tedious in the extreme. Also what you can reliably test this way is limited.

Also I wonder this. Suppose the same tests were done again. With people trained to be comfortable with ABX testing utilizing fast switching and very short segments. You don't have to do anything different other than people would pick short segments and compare with immediate switching. Maybe the results would be superior.

I agree that 256 kbps is nearly inaudible vs. wav for music listening. Not to mention 320 kbps. The only reason I pay for tidal hifi is because I'm a deluded audiophile who pays for things I probably don't need.

And you're completely right that none of the tests reach p.05 levels. But having done some statistics myself, I know how difficult it is to reach significance with a very small n. This doesn't "prove" anything (but then what does?). But I do think that the results at least don't invalidate the theory that ABX testing may lead people to overlook objective differences that can be perceived under some circumstances.

And I don't think that ABX tests are a problem in themselves. Quite on the contrary, I think they are valuable for some things. But I do think that one needs to train at it, to become comfortable with the procedure et.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,370
Likes
234,432
Location
Seattle Area
Anyway, there are people who try to investigate scientifically how different testing procedures work... I recently came across this AES conference paper:
http://www.denismartin.net/wp-content/uploads/2017/01/Martin-et-al.pdf
That's an interesting paper. Thanks for posting it. I finished reading it just now. Alas, it seems they did not have proper appreciation for how psychoacoustical lossy audio compression works. Briefly, a lot of the distortion is dynamic and varies millisecond to millisecond. The codec jumps from transparency to loss of fidelity in each "frame" of audio. Static tools like they presented to testers cannot be aimed this way. Nor would a blind test where the testers don't know what needs fixing allows them to do that.

They did get decent results in collapse of stereo separation which exists in MP3 (but not in other lossy codecs). But even that inference was light.

Using secondary effects like they tested changes the dynamics of the test (for the better given the "fun" comment from one of the testers) but I think it makes the task far more difficult. A trained listener could have nailed a lot of the comparisons with statistical significance.
 

Thomas savage

Grand Contributor
The Watchman
Forum Donor
Joined
Feb 24, 2016
Messages
10,260
Likes
16,298
Location
uk, taunton
That's an interesting paper. Thanks for posting it. I finished reading it just now. Alas, it seems they did not have proper appreciation for how psychoacoustical lossy audio compression works. Briefly, a lot of the distortion is dynamic and varies millisecond to millisecond. The codec jumps from transparency to loss of fidelity in each "frame" of audio. Static tools like they presented to testers cannot be aimed this way. Nor would a blind test where the testers don't know what needs fixing allows them to do that.

They did get decent results in collapse of stereo separation which exists in MP3 (but not in other lossy codecs). But even that inference was light.

Using secondary effects like they tested changes the dynamics of the test (for the better given the "fun" comment from one of the testers) but I think it makes the task far more difficult. A trained listener could have nailed a lot of the comparisons with statistical significance.
Trained listeners don't make up a statistically relevant part of the populous though.. if you have to be trained to tell the difference then the difference is irrelevant in the wider context of the average consumer.

Despite what many audiophiles might assume, spending $50000 on a amp does not constitute 'training' :D
 
OP
oivavoi

oivavoi

Major Contributor
Forum Donor
Joined
Jan 12, 2017
Messages
1,721
Likes
1,934
Location
Oslo, Norway
That's an interesting paper. Thanks for posting it. I finished reading it just now. Alas, it seems they did not have proper appreciation for how psychoacoustical lossy audio compression works. Briefly, a lot of the distortion is dynamic and varies millisecond to millisecond. The codec jumps from transparency to loss of fidelity in each "frame" of audio. Static tools like they presented to testers cannot be aimed this way. Nor would a blind test where the testers don't know what needs fixing allows them to do that.

They did get decent results in collapse of stereo separation which exists in MP3 (but not in other lossy codecs). But even that inference was light.

Using secondary effects like they tested changes the dynamics of the test (for the better given the "fun" comment from one of the testers) but I think it makes the task far more difficult. A trained listener could have nailed a lot of the comparisons with statistical significance.

Thanks, Amir! Interesting response.

If you were to design a similar test... what kind of tasks would you have them do?
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,370
Likes
234,432
Location
Seattle Area
Trained listeners don't make up a statistically relevant part of the populous though.. if you have to be trained to tell the difference then the difference is irrelevant in the wider context of the average consumer.
We go back to the comment I made earlier in that I care about transparency and knowing for a fact that we have achieved it for all people, all content and all situations. I think as audiophiles we should aspire to that even if the general public doesn't care or can't hear the difference.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,370
Likes
234,432
Location
Seattle Area
Thanks, Amir! Interesting response.
If you were to design a similar test... what kind of tasks would you have them do?
An automated tool that would isolate areas of a song that might be revealing would help a lot. I do this manually right now. I "know" what could be the weakness given the test being performed and zoom into those segments. There is a lot of trial and error though and makes the job so boring and time consuming. A tool should be able to mimic what I do based on different impairments.

Take a situation where I just run out of time and patience to find those segments. I would vote for transparency even though there could have been revealing differences to me in that same test.

I remember taking Ethan Winer's transparency test where he looped the same track over and over through ADC and DAC cycles. The first time I could not tell the difference due to not spending enough time on it. A second time (a year or two later) I managed to pass the test with ease by focusing on what could be different.

If we want to declare transparency for all people, we owe it to ourselves to find any and all audible differences. Otherwise would be making declarations that could be false for others who may even have better hearing than trained listeners.
 

Thomas savage

Grand Contributor
The Watchman
Forum Donor
Joined
Feb 24, 2016
Messages
10,260
Likes
16,298
Location
uk, taunton
We go back to the comment I made earlier in that I care about transparency and knowing for a fact that we have achieved it for all people, all content and all situations. I think as audiophiles we should aspire to that even if the general public doesn't care or can't hear the difference.
I agree.. however the culture of the audiophile make folks feel like they are missing out on so much, when in real terms they are not.

Mostly perceived fidelity exsists in our own minds, as you often remark we mark our own work. We anoint ourselves experts but just how many of us can detect such differences ? A truth well avoided for most imo
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,370
Likes
234,432
Location
Seattle Area
Yes, as I mentioned yesterday I think, blind testing to teach audiophiles a reality lesson of what is really audible is different than us investigating the true fidelity of our systems/pipelines of music.
 

Thomas savage

Grand Contributor
The Watchman
Forum Donor
Joined
Feb 24, 2016
Messages
10,260
Likes
16,298
Location
uk, taunton
Yes, as I mentioned yesterday I think, blind testing to teach audiophiles a reality lesson of what is really audible is different than us investigating the true fidelity of our systems/pipelines of music.
Of course.. ones academic and the other has more resonating qualities . In my mind it's like sports cars, average joe drives well within the performance envelope so a racing driver highlighting the handerling differences between two models is , while useful not that meaningful to the majority of users. Still you might well over spend getting the 'better' model only to consign it to a life of never realising it's full potential.

thats not to say from the R&D Perspective of the manufacturer the expert testing is not necessary, the end user is just that though and not involved in critical analysis.

Maybe we should all be throughly tested, then and only then buy the appropriate level of Audio gear:D
 

Fitzcaraldo215

Major Contributor
Joined
Mar 4, 2016
Messages
1,440
Likes
632
An automated tool that would isolate areas of a song that might be revealing would help a lot. I do this manually right now. I "know" what could be the weakness given the test being performed and zoom into those segments. There is a lot of trial and error though and makes the job so boring and time consuming. A tool should be able to mimic what I do based on different impairments.

Take a situation where I just run out of time and patience to find those segments. I would vote for transparency even though there could have been revealing differences to me in that same test.

I remember taking Ethan Winer's transparency test where he looped the same track over and over through ADC and DAC cycles. The first time I could not tell the difference due to not spending enough time on it. A second time (a year or two later) I managed to pass the test with ease by focusing on what could be different.

If we want to declare transparency for all people, we owe it to ourselves to find any and all audible differences. Otherwise would be making declarations that could be false for others who may even have better hearing than trained listeners.
I agree completely. Audible differences with much equipment these days seem to be permanetly shrinking. If listening tests are done with music, it takes me a long time to identify the specific momentary passages, some very brief, where there are differences to focus on. Often, minutes worth of music can seem to fly by with nothing in the way of noticeable comparative differences. I am not optimistic that there is a way to predict those revealing passages, except that they were pinpointed in prior comparitive listening sessions. It might also be true that the passages I find to reveal differences might not be the same ones others would choose.

I am not saying I am a great DBT test subject, and I personally hate the experience. I agree there are many pitfalls . I think test subjects need to be trained on test taking strategies, otherwise their results are likely random noise, unless the differences are fairly obvious. But, if differences are obvious, we don't need DBTs.
 
Last edited:

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
Just thought it worth linking to this post by Fitzcaraldo in another thread that describes a major limitation of blind testing:
That initial perception of "hearing things you did not know were on the recording before" with the second signal path or DUT tends to stay with you as a preference....

...Of course, it is just a refocusing of your listening attention the second time through the same musical passage. I suspect that even in a double blind test, where the A and B choices were really identical while announced to the test subjects to be different, there would be a statistical preference for the second choice listened to.
The blind testing removes bias caused by the knowledge of what is being listened to, but leaves a vacuum for the listener's imagination to fill. Tiny audible differences will be buried in this 'brain noise'. Increasing the sample size will not change this.

This is another major difference between the genuine science of objective measurements (voltmeters don't have imaginations) and the pseudoscience of asking a human consciousness "How do you feel?".
 

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
A larger sample = more stable the results and teasing out of overall trends.



In the cable burn in I certainly did ask a clear question. Again you are welcome to poke holes in the method that is based on a claim.




It's not meant to since sighted listening is inherently flawed.




But I'm not evaluating it from the viewpoint of absurdity. I'm using the claimants own words. AGAIN poke a hole in my cable burn-in evaluation method.

You talked about a listeners claim and a test of the listeners ability. Later you´ve said you would have liked to have more participants.
Therefore i pointed out, that testing a larger group of people will not help wrt the listeners ability that you wanted to test.

Of course, a larger sample helps, but that means to let the listener do more trials.
As cosmik already pointed out, the listener knows being tested and - to get a result that allows to draw further conclusions - you have to get more trials. I´m sure you notice, that the test that you´ve proposed is in fact different from what "sighted listeners are already doing".

That does not mean, it is an invalid approach or an impossible task, just that it is different.

Sighted listening isn´t inherently flawed, but it is impossible to show that it is valid. Most people do controlled listening tests to confirm (or to evaluate) things they have noticed during sighted listening.
 

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
If I were to paraphrase your comment it would be "There is a long history of using the scientific method to test humans. Therefore it is possible to scientifically test every aspect of being human".

But that would not reflect my intention. You can´t never be sure, as this sort of tests only delivers probabilities but not proof.


But larger samples, positive and negative controls etc. can't demonstrate that the listening 'abilities' (and this is a slightly strange concept in the context of listening to music for pleasure) of all participants aren't impaired by the awareness of being in a test.

Of course, but first of all, usually we know that most likely not _all_ people will suffer from this sort of problem. That is the rationale behind the approach of larger samples.
Certainly that will not help wrt to a single listener beeing tested.

Positive controls - to choose appropriate ones is a difficult matter itself - will help by showing that listeners sensitivity/ability is sufficient under test conditions. Of course it is still possible that listeners ability is impaired by the awareness of being in a test (means that he could maybe even much better) but he will still be able to reach good results.

Wrt to the "slightly strange concept..." , yeah it is, but the more important it is to know what the experimenter really wants to test and to choose the best fitting test concept. Testing for difference is different beast than testing for preference, feeling of more pleasure is a different task too.

It is further complicated because listening to the reproduction of music is a multidimensional task and if an experimenter doesn´t know the difference between two DUTs nor about the detections abilities of the listeners he is using, thing are getting tough.
 

Jinjuku

Major Contributor
Forum Donor
Joined
Feb 28, 2016
Messages
1,278
Likes
1,180
You talked about a listeners claim and a test of the listeners ability. Later you´ve said you would have liked to have more participants.

You've failed to point out any mutual exclusivity.

Therefore i pointed out, that testing a larger group of people will not help wrt the listeners ability that you wanted to test.

And it has absolutely nothing to do with the price of tea. I'm happy to test a single claimant.

Of course, a larger sample helps, but that means to let the listener do more trials.

In the case of cable burn in they can go through as many permutations as they want.

As cosmik already pointed out, the listener knows being tested and - to get a result that allows to draw further conclusions - you have to get more trials. I´m sure you notice, that the test that you´ve proposed is in fact different from what "sighted listeners are already doing".

No the test is EXACTLY the same as what the sighted listeners are already doing. If you had a leg to stand on you would have pulled apart the method already instead of just spouting off and pointed out how shipping out cables for the claimant to swap out and use in their system is not the same as the claimant ordering cables and swapping out for use in their system and claiming the Emperor has new cloths. There are cable outfits that actually send out pre-burned in cables.


Sighted listening isn´t inherently flawed, but it is impossible to show that it is valid. Most people do controlled listening tests to confirm (or to evaluate) things they have noticed during sighted listening.

Sighted listening is inherently and completely flawed:
 

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
It is actually worse than that. The experiment uses the rules of their world to generate what is guaranteed to be false outcome. Science says that having such long switch over time will eliminate detection of small differences. By giving them the rope to hang themselves, they do exactly that. :) In other words, if hypothetically there were small differences, it would not be heard in that test.

Amirm, i beg to differ. If a listener claims to hear a difference under the usual longer switching time, and someone wants to know if there is some evidence for the claim, then evaluating by using shorter switching times would not help.
As every other human ability, transferring information into long memory storage is not perfect, but the relationship between degree of a difference and the ability to store this difference isn´t well understood/examined.
Especially in the case of a multidimensional experience as listen to music is.

And it is a matter of practical relevance. A difference that a listener can´t remember after a few seconds will be obviously not really of relevance to this listener in everyday music listening, as he would always have forgotten ´til the next time.
 
Last edited:

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
My main "problem" with ABX tests is not that I know that I'm being tested, but rather the thing with sensory smearing. It's like my brain smears the sensory inputs all over, and everything becomes a soup where I'm only able to differentiate between the really big things. To really be able to perform well at ABX tests, I think you need to train at it like Amir has done. I know a couple of other Norwegian guys as well, who really take blind testing seriously for their home hifis. They have been able to ABX dacs for example. I'm not even close to doing that.<snip>

Maybe the ABX protocol does not suit your personal abilities; as there is no need to use the ABX you could switch over to another protocal, maybe A/B paired comparisons.
Training under the specific test conditions is usually a very good idea, because participating in a controlled listening test is very different from "normal" listening.

Researcher already found out roughly 60 years ago that test results differ due to the protocol (comparing ABX to A/B) and related the divergence to the different internal mental processes involved.

That´s why training and usage of positive controls is so important.
 

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
You've failed to point out any mutual exclusivity.
And it has absolutely nothing to do with the price of tea. I'm happy to test a single claimant.

Now we are back at the beginning. You´ve send a set of cables to a listener and got his correct response.
And now? Have you therefore tested his claim and would confirm his ability to hear a difference between "burned-in" cable and "fresh" cable?

In the case of cable burn in they can go through as many permutations as they want.

That does not help, you still got only one correct answer to a single trial.

Sighted listening is inherently and completely flawed:

The McGurk effect isn´t an appropriate example in our context and not all humans experience it and a newer study draw the conclusion that it depends on training too.
 
Last edited:
Status
Not open for further replies.
Top Bottom