Can You Trust Your Ears? By Tom Nousaine

j_j · Oct 26, 2017

amirm said:
And to make sure everyone is following the argument, here is a sample from HA Forum link above:

View attachment 9423

These are "MOS" (mean opinion score) tests where users rate fidelity on scale of 1 to 5. There is no ABX test that says whether a difference exists at all.

This is the style of testing that is used in development of lossy codecs. Not ABX as I have mentioned repeatedly.

That appears to be MUSHRA test methodology.

Sensitive tests use DIFFERENCE testing, not MOS testing, for non-transparent systems.

In MUSHRA, you'll find most of the systems that aren't at the very top would get completely blasted to bits by ABC/HR, and provide 100% ABX scores.

Not a fan of MUSHRA - Nope.

It is also possible that the test you pointed out is the results of many separate tests, each one using CCIR impariment scale, which is a difference scale, not an MOS scale.

Hard to tell without more information.

But ABX is what you use if you want to determine if there is ANY audible difference at all. I suspect a typo up in the quote.

j_j · Oct 26, 2017

amirm said:
Arny said development of codecs uses ABX. I said it did not. .

Internally, actually, we used ABX as well as ABC/HR (no MUSHRA), in other words, 'detection' or "distance" testing but no MOS testing in developing codecs. It depends on what you're trying to accomplish, and sometimes "is this detectable" is what you need to know inside the guts of a codec. It's kind of complicated.

amirm · Oct 26, 2017

j_j said:
Internally, actually, we used ABX as well as ABC/HR (no MUSHRA), in other words, 'detection' or "distance" testing but no MOS testing in developing codecs. It depends on what you're trying to accomplish, and sometimes "is this detectable" is what you need to know inside the guts of a codec. It's kind of complicated.

When you say you did do you mean AT&T? Inside Microsoft and in countless external codec shoot outs I participated in, it was never ABX. And MOS scoring was quite common in many tests.

I also don't remember any AES papers of lossy codecs using ABX testing. Admittedly it has been a while. I just went back and did a search and first few are all non-ABX tests: http://www.aes.org/tmpFiles/elib/20171026/8367.pdf

http://www.aes.org/e-lib/browse.cfm?elib=11262

http://www.aes.org/e-lib/browse.cfm?elib=5396

http://www.aes.org/e-lib/browse.cfm?elib=7127

BS1116 is what I mentioned was key specification for such testing and it doesn't have anything to do with ABX.

I will stop here. The message is quite clear that ABX is not a common test in development of lossy audio codecs. While this doesn't rule out people using it, it just isn't the method of choice as Arny said.

DonH56 · Oct 26, 2017

For those (like me) who have forgotten how MUSHRA works: https://en.wikipedia.org/wiki/MUSHRA

I was never professionally involved with audio codec testing but my memory as a "follower" is that MUSHRA was fairly extensively used, at least publically and in marketing, to differentiate among various lossy codecs. E.g. how few bits could they get away with and still be acceptable. I.e. how we got into this mess of highly lossy music that persists even though storage and network bandwidth has vastly improved since then.

My WAG's - Don

j_j · Oct 26, 2017

amirm said:
When you say you did do you mean AT&T? Inside Microsoft and in countless external codec shoot outs I participated in, it was never ABX. And MOS scoring was quite common in many tests.

I also don't remember any AES papers of lossy codecs using ABX testing. Admittedly it has been a while. I just went back and did a search and first few are all non-ABX tests: http://www.aes.org/tmpFiles/elib/20171026/8367.pdf

BS1116 is what I mentioned was key specification for such testing and it doesn't have anything to do with ABX.

I will stop here. The message is quite clear that ABX is not a common test in development of lossy audio codecs. While this doesn't rule out people using it, it just isn't the method of choice as Arny said.

You forget. I didn't work on any codecs at MS. So yes, AT&T. One does not usually write papers about research testing, unless it's necessary to document an outcome.

BS1116 and .1 are DIFFERENCE tests, not MOS tests. This is an important distinction, really.

MUSHRA is a modified MOS test.

Jakob1863 · Oct 26, 2017

The ITU recommendations for both methods are fortunately downloadable free of charge:

http://www.itu.int/rec/R-REC-BS.1116

https://www.itu.int/rec/R-REC-BS.1534/en

different methods for different aims.....

j_j · Oct 26, 2017

Jakob1863 said:
The ITU recommendations for both methods are fortunately downloadable free of charge:

http://www.itu.int/rec/R-REC-BS.1116

https://www.itu.int/rec/R-REC-BS.1534/en

different methods for different aims.....

Indeed, some of which attempt to force a listener to summarize a multidimensional perception in one value, which I strongly object to.

Jakob1863 · Oct 26, 2017

j_j said:
Indeed, some of which attempt to force a listener to summarize a multidimensional perception in one value, which I strongly object to.

Some people even wonder why in preference tests involving multidimensional evaluation even transitivity isn´t necessarily a given, so i share the objection.

Ron Party · Oct 27, 2017

BE718 said:
Apoclypse in 9/8

At this point, the drums enter, with the rhythm section striking out a pattern using the unusual metre of 9 beats to the bar (expressed as 3+2+4).[7] The lyrics employ stereotypical apocalyptic imagery, alternating with an organ solo from Banks (played in 4/4 and 7/8 time signatures against the 9/8 rhythm section)

My favorite song of all time, bar none.

fas42 · Oct 27, 2017

Just had a quick glance at the BS.1116-3 document - right up my alley! Magic word there - "impairment" ... the goal, Grade 5.0 - Imperceptible ... 99.9999..% of playback systems fail, badly, in this regard - use of the 'right' test signal, say a musical piece which is a particularly complex mix of sound elements, with high dynamic range, played at sufficiently high volume, will trigger a perceptible impairment of the potentially realisable quality ... every time.

As long as a system can be trivially tripped up by this type of "testing" it doesn't deserve to be considered, high fidelity ...

j_j · Oct 27, 2017

fas42 said:
Just had a quick glance at the BS.1116-3 document - right up my alley! Magic word there - "impairment" ... the goal, Grade 5.0 - Imperceptible ... 99.9999..% of playback systems fail, badly, in this regard - use of the 'right' test signal, say a musical piece which is a particularly complex mix of sound elements, with high dynamic range, played at sufficiently high volume, will trigger a perceptible impairment of the potentially realisable quality ... every time.

As long as a system can be trivially tripped up by this type of "testing" it doesn't deserve to be considered, high fidelity ...

I think you need to read BS1116. It's obvious from your definition of "fail" that you don't understand what the test is, how it works, or what it reports back.

What is your hidden reference for this test you refer to?

fas42 · Oct 27, 2017

j_j said:
I think you need to read BS1116. It's obvious from your definition of "fail" that you don't understand what the test is, how it works, or what it reports back.

What is your hidden reference for this test you refer to?

Fair enough to fully read BS1116 - the intent is clear, to me, that distortion is deliberately, knowingly included, at certain levels - perhaps from data compression - until the subjects are able to detect impairment - I just apply this in the greater context, when listening to systems playing back "uncontaminated" material.

My reference for this is having heard some clip, or track, being played back with minimal impairment at some point - I know how well it was recorded. Example: a rock track where the drummer is constantly striking the cymbals through the piece; Grade 0 - "Is the drummer hitting the cymbals, I can't hear it?" through "Kitchen pots and pans being whacked" to Grade 5 - the delicate, clear shimmer of the instrument is reproduced, fully intact.

Cosmik · Oct 27, 2017

fas42 said:
Fair enough to fully read BS1116 - the intent is clear, to me, that distortion is deliberately, knowingly included, at certain levels - perhaps from data compression - until the subjects are able to detect impairment - I just apply this in the greater context, when listening to systems playing back "uncontaminated" material.

My reference for this is having heard some clip, or track, being played back with minimal impairment at some point - I know how well it was recorded. Example: a rock track where the drummer is constantly striking the cymbals through the piece; Grade 0 - "Is the drummer hitting the cymbals, I can't hear it?" through "Kitchen pots and pans being whacked" to Grade 5 - the delicate, clear shimmer of the instrument is reproduced, fully intact.

Couldn't you just... measure your system? I can tell from this thread that listening tests are regarded by many as the Swiss Army knife of measurements. People become listening test enthusiasts, because they think they can resolve "multidimensional" mysteries such as why people seem to like the sound of valve amplifiers and vinyl over the perfection of digital and solid state. Or they can prove "their opponents" ('subjective' audiophiles) wrong - a very strong motivation for the scientific listening test 'community'. But of course, it may be nothing to do with the sound in the first place. In effect, the listening test enthusiasts have already fallen for the subjective audiophiles' schtick by even dignifying their claims with a test.

Lossy compression is one area where listening tests are needed to check it works - which is why it is so often raised as an example of the usefulness of listening tests - but even so, the test can only fine tune a system that was designed on paper based on pretty straightforward logic (not saying I'd have thought of it, though!). A workable lossy compression system could be designed without any listening tests at all.

At the end of the day, it is odd that lossy compression seems to be such an obsession for high performance audio professionals. It is as though they are stuck in the year 2000.

Jakob1863 · Oct 27, 2017

Cosmik said:
Couldn't you just... measure your system? I can tell from this thread that listening tests are regarded by many as the Swiss Army knife of measurements.

Isn´t it obvious? Which way - without listening tests - could somebody decide which sort of measurement is important and which degree (or kind) of deviation from a predefined "perfect" measurement result could be nevertheless acceptable?

The facts/premises should stand firm:
1.) a stereo reproduction system (for example a two channel system) represents a great deviation from the original soundfield, when reproducing it, but our ear/brain tag team is able to compensate for a lot of these deviations
2.) listeners to these reproduction systems are complex individuals that might repond in quite a different way to these deviations
3.) listening to music is a multidimensional experience
4.) different deviations/errors can lead to the same percepted impression

1.) - 4.) together are imo the reason why the aforementioned transitivity isn´t garantueed in listening tests.

So, if you accept the necessity in the case of "lossy codecs" why don´t you accept the necessity in general, as reproduction usually is just a "lossy version" of an original?

Cosmik · Oct 27, 2017

Jakob1863 said:
Isn´t it obvious? Which way - without listening tests - could somebody decide which sort of measurement is important and which degree (or kind) of deviation from a predefined "perfect" measurement result could be nevertheless acceptable?

The facts/premises should stand firm:
1.) a stereo reproduction system (for example a two channel system) represents a great deviation from the original soundfield, when reproducing it, but our ear/brain tag team is able to compensate for a lot of these deviations
2.) listeners to these reproduction systems are complex individuals that might repond in quite a different way to these deviations
3.) listening to music is a multidimensional experience
4.) different deviations/errors can lead to the same percepted impression

1.) - 4.) together are imo the reason why the aforementioned transitivity isn´t garantueed in listening tests.

So, if you accept the necessity in the case of "lossy codecs" why don´t you accept the necessity in general, as reproduction usually is just a "lossy version" of an original?

What you are advocating is the system designed by listening test. This is very similar to the idea of training artificial neural networks.

The ANN can be regarded as a black box that will transform inputs to outputs, and can be 'trained' to implement any arbitrary multidimensional function. Really, the sky's the limit: you could create your dream system that will transform your recordings in exactly the way you like them best. All you have to do is to train the network to give you the sound you like.... But at that moment, you realise that there is no such thing as "the sound you like", because it varies. And the job of training the network from scratch would be immense. Instead, you would set the network up to at least start from 'linear' and then fine tune it. What are the chances, do you think, that on average you would have a strong preference for anything but 'linear' at the end of the day? At the same time it is easy to see how a bias in the training examples could completely screw the thing up.

Design by listening test is, effectively, a version of training a neural network, but extremely slowly and based on very sparse 'training data'. People who work in neural nets usually start out very enthusiastic but then eventually come to understand that the network is reflecting their choice of training & testing data rather than adding new insights. In order to get an effective network they need to be able to provide an even distribution of training & testing data, and in order to ensure this, they need to understand the essence of the data. If they do this, they may as well engineer the system using conventional methods!

In audio, the current state of the art (as it always has been) is the linear system. Two channels is accepted by convention, or deliberate design, as sufficient. Is there a better method, that listening tests could reveal? I don't think so. As such, listening tests are probably just confirmation of what we already know - and much less sensitive than measuring instruments - and possibly less sensitive than a relaxed person listening for enjoyment (controversial! but we cannot know the answer).

oivavoi · Oct 27, 2017

I do believe listening tests could tell us whether three channel would be superior for subjective image perception than two channel, for example. Intuitively, three channel has always seemed to me like a more logical way of reproducing music than two channel. Floyd Toole mentions a couple experiments about this in his book (I only read the 2nd edition), but it seemed to me like the results were inconclusive at the time.

Fitzcaraldo215 · Oct 27, 2017

Cosmik said:
In audio, the current state of the art (as it always has been) is the linear system. Two channels is accepted by convention, or deliberate design, as sufficient. Is there a better method, that listening tests could reveal? I don't think so. As such, listening tests are probably just confirmation of what we already know - and much less sensitive than measuring instruments - and possibly less sensitive than a relaxed person listening for enjoyment (controversial! but we cannot know the answer).

Yes, 2 channel stereo was adopted as a conventional standard in the 1950's, and it remains the widely accepted paradigm. I know of no formal, published scientific listening tests that reveal that multichannel music reproduction from discretely recorded multichannel sources is preferred over stereo by test subjects. As you say, it would just be confirmation of what many already know to be obvious. Therefore, such scientific tests are not published. No one has an interest in rigorously attempting to prove the obvious. Such tests would be superfluous, laughably so.

At the advent of color TV, I also do not think that there were scientific studies to "prove" conclusively that people statistically preferred color to black/white. Not saying multichannel audio sound is quite as obvious as that, but I think you catch my drift. Maybe such formal listener preference studies of stereo vs. mono were published, but I doubt it for the same reasons.

Where rigorous scientific listener preference studies of sound reproduction are useful is in identifying many smaller, less obvious differences that were not previously "known" or which were controversial. And, indeed, many of the issues we now routinely deem as "known" might not be known were it not for such earlier studies on human test subjects.

We often forget or are oblivious to the painstaking research and testing that underlies what we now deem as self evident. If you take the long view, some things in audio were not so obvious at one time, and many were not routinely accepted until adequate scientific testing on human subjects was done.

And, there are many mysteries in the realm of psychoacoustics that measurement instruments cannot measure, but which can have significant impact, including on audio system design. There still are many mysteries to be solved, I believe. And, there have even been some accepted conventions proven by newer research to be inferior in terms of listener preference.

As Toole says, two ears and a brain are not the same as an omni mike. Again, I highly recommend his latest book, Sound Reproduction - 3rd Edition for many useful insights into the research on acoustics and psychoacoustics, his own and others.

j_j · Oct 27, 2017

fas42 said:
Fair enough to fully read BS1116 - the intent is clear, to me, that distortion is deliberately, knowingly included, at certain levels - perhaps from data compression - until the subjects are able to detect impairment - I just apply this in the greater context, when listening to systems playing back "uncontaminated" material.

No, that is not what BA1116 says. It is an ABC/hr system, where there are two goals, to discover which of B and C are the "hidden reference" (A is the reference) and then describe how much different 'B' is.

Unless you are able to switch systems clicklessly while sitting in the same seat, with prompt switching, etc, you aren't running anything like BS1116.

j_j · Oct 27, 2017

Cosmik said:
Couldn't you just... measure your system?

Measure for what? Frequency response of direct signal? Frequency response of diffuse signal? Degree of diffusion of diffuse signal? Direct to diffuse ratio? That as a function of frequency? Distortion? Room interaction (many measurements there), etc.

Remember, two channel playback is a very, very ROUGH approximation of an original soundfield. Steingburg and Snow proved in 1933 (YES, 1933) that ***THREE*** channels were absolutely necessary for proper rendering of the front soundstage, and that 2 was not sufficient.

So we have an illusion made by a flawed system.

Which illusion do you PREFER? Tell me that? Can you describe that in measurements? Just for starters.

j_j · Oct 27, 2017

Cosmik said:
In audio, the current state of the art (as it always has been) is the linear system.

That because the ear is a kind of spectrum analyzer, and adding new tones to a signal is a very annoying kind of impairment.

Two channels is accepted by convention, or deliberate design, as sufficient.

Not even close. Not even in the dugout, let alone in the actual ballpark. Steinburg and Snow took that apart on 1933. There was an argument about 2 vs. 3 then, with jingoistic advertisers ranting quite maliciously about "you only have two ears", and even then, missing the basic physics of the situation that were already known.

3 channels is rock-bottom MINIMUM for a front soundstage, with no envelopment or depth. Sorry, but that's been firmly established for going on a century now.

Is there a better method, that listening tests could reveal? I don't think so. As such, listening tests are probably just confirmation of what we already know - and much less sensitive than measuring instruments - and possibly less sensitive than a relaxed person listening for enjoyment (controversial! but we cannot know the answer).

So then, you do understand (obviously you do NOT) that different people prefer illusions from their stereo system. Some like direct, some like totally diffuse, some like a mix. That's just one element for starters. I am not going to even describe this in one paragraph.

And for the "two ears" foolishness, our heads move. That alone is a refutation.

Can You Trust Your Ears? By Tom Nousaine

Major Contributor

Major Contributor

Founder/Admin

Master Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Addicted to Fun and Learning

Senior Member

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Addicted to Fun and Learning

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Similar threads