• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Can You Trust Your Ears? By Tom Nousaine

Status
Not open for further replies.

Jakob1863

Addicted to Fun and Learning
Joined
Jul 21, 2016
Messages
573
Likes
155
Location
Germany
I'm sorry, but you need to read more carefully. Instant, clickless switching is required. ABC/hr is the paradigm, not something else. These issues rest on the basic length of loudness and auditory feature memory. Unless you're a computer or some other species, this isn't in any debate.

I think fas42´s argument was more along the line that there is no proof that the ABC/HR method is the most sensitive test protocol among the many others.
Which is sort of universially true as there exist no "one size fits" approach, it always depends on the hypothesises tested and the usual constraints, as resources aren´t endless.
Reading the recommendation is important as it - although in short form - covers a lot of important factors in the design of experiments and even emphasized the fact that it is well worth to read additional material and to look for competent external consultants, as testing is a complex task.

Most of the material incorporated is well founded on earlier scientific research, but despite that, the recommendations wrt to the audio material sample length is a bit vague, as the recommended length is 10s - 20s .

I have written a bit in an earlier thread about the different models of memory and i think that this recommended time span is inconsistent with the usual models (including the working memory approach) especially considering the assumed FIFO characteristic.

Precoda/Meng proposed in an AES convention paper to use only sample length of <5s based on the assumption that the impact of categorical memory could be minimised.

As the emotional response to music is an important factor it is imo unclear if short sample length could/will lead to the same response as longer excerpts.
Of course that must not be important for certain types of artifacts but for a quality/difference assessment aside those it surely will be.

Up to now i was not able to find publications about experiments where multidimensional evaluation was done with varying degrees of difference and varied sample length to find out which memory time spans could be achieved.

We know that categorization is the important factor while transferring from "short term memory" to "long term storage" but it seems that there exists a somewhat shorter/other path for auditory input.

Have i missed some relevant publications and could you help with some citations?
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,441
Likes
36,872
Some abandoned work from the 1990's suggests that NEAR-coincident works even better. Coincident does not provide a wide listening area or proper perspective as the listener's head moves.

I wish I had links to the articles. Those I have in mind put near coincident behind coincident and ahead of wider spacing. They did have the listener in a central position so they didn't get into how wide the listening area would be. They were concerned primarily with accurate placement of small musical groups during playback.

There are a lot more published articles (at least that I am aware of) testing various surround recording methods. In those the wider spacing seems to be preferred when it is a test of listener preference. Results on accuracy seem very mixed with different tests having different methods reported as better. In preference a Fukada tree ( 3 cardioids spaced 1 to 2 meters apart in a triangle with left and right cards pointing to the sides) seems to do well in more than one test. This for the front three channels.
 

fas42

Major Contributor
Joined
Mar 21, 2016
Messages
2,818
Likes
190
Location
Australia
Again, until you can state a falsifiable hypothesis, which the above is not, you are stating nothing but an unsupported belief. This is 'audio science forum'.
I would suggest it's highly falsifiable - recreating the 1933 experiment with two systems, one of "high enough standard", and one which is not: play recordings of people speaking in various positions, with the systems behind a curtain, and note where the speakers are perceived to be located. If the vast majority of subjects don't detect any variation between the systems, then it's falsified.
 
Last edited:

fas42

Major Contributor
Joined
Mar 21, 2016
Messages
2,818
Likes
190
Location
Australia
I have seen listening tests of stereo microphone configurations and their spatial accuracy where crossed figure 8s followed by mid-side recordings resulted in listeners able to most accurately place the performer's location in the recording. Various spaced techniques including spaced pairs of omni's fared much less well. I seem to recall the spaced omnis were the worst.
In the absence of "high enough quality" playback this would work - something like how stage productions do it in the visual field, with lighting. The audience are "told" what is currently important by a spotlight hitting the stage where the key action is, because they are too far away from the actors, etc, to be able to "read" the subtleties of body movement, etc.

My experience is that sufficient fine detail in the sound field reproduction does allow at least some people to accurately "read" the aural clues, without enhanced "spotlighting".
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,266
Likes
4,753
Location
My kitchen or my listening room.
BTW, what's your opinion on two channel omnis, JJ? I've been a lone proponent of that kind of speaker here. For me, spatialization is one of the key reasons why I like them.
The answer to that is much more than a paragraph. It depends on the room, the way the material was recorded, etc.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,266
Likes
4,753
Location
My kitchen or my listening room.
I would suggest it's highly falsifiable - recreating the 1933 experiment with two systems, one of "high enough standard", and one which is not: play recordings of people speaking in various positions, with the systems behind a curtain, and note where the speakers are perceived to be located. If the vast majority of subjects don't detect any variation between the systems, then it's falsified.

Define "high enough". Until you do, you're simply engaging in tautology.
 

j_j

Major Contributor
Audio Luminary
Technical Expert
Joined
Oct 10, 2017
Messages
2,266
Likes
4,753
Location
My kitchen or my listening room.
I wish I had links to the articles. Those I have in mind put near coincident behind coincident and ahead of wider spacing. They did have the listener in a central position so they didn't get into how wide the listening area would be. They were concerned primarily with accurate placement of small musical groups during playback.

The work happened JUST before AT&T Research went *poof*. There is one preliminary report, and no more as a result. That is in, I think, 2000 AES from LA, as a convention preprint. I ought to know, I wrote it, but I don't have the reference handy.

It was reviewed in several places, but of course reviews are reviews.

http://www.onhifi.com/features/20010615.htm
https://www.stereophile.com/content/wheres-real-magazine-we-see-it-february-2001
 

Blumlein 88

Grand Contributor
Forum Donor
Joined
Feb 23, 2016
Messages
20,441
Likes
36,872
The work happened JUST before AT&T Research went *poof*. There is one preliminary report, and no more as a result. That is in, I think, 2000 AES from LA, as a convention preprint. I ought to know, I wrote it, but I don't have the reference handy.

It was reviewed in several places, but of course reviews are reviews.

http://www.onhifi.com/features/20010615.htm
https://www.stereophile.com/content/wheres-real-magazine-we-see-it-february-2001

I have read those reviews. I even read your pre-print at one point( pre-print #5202 with Y.H. Lam). I also read later work that referred to your work from these authors:
Enzo De Sena, Hüseyin Hacıhabiboǧlu and Zoran Cvetković

http://www.desena.org/multichannel/Ambisonics_2_Int_Symp_2010.pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.337.7632&rep=rep1&type=pdf

http://epubs.surrey.ac.uk/813317/13/hacihabiboglu_etal2017.pdf

Descriptions of course sounded like this was a no-brainer that someone would make use of it. I take it AT&T or some offshoot of it owns the rights, and sadly no one has made it available? Is there anything similar available or in the works somewhere?
 

Arnold Krueger

Active Member
Joined
Oct 10, 2017
Messages
160
Likes
83
Scepticism of ABX is justified if the protocol is used as often reported, which means without training of participants and without using positive controls (and negative ones as well which of course is of no importance in case of negative results).
As already mentioned/cited earlier people already noticed back around the 1960s that ABX tests delivered inferior results compared to A/B tests; the differences were attributed to the different mental processes involved which made it more difficult for the participants when doing ABX tests.

The fallacy here is continued tired reiteration of the truism that there is nothing that one man can invent, that others can't either improve on the one hand, or screw up on the other.

In the end, I only have control over what I do.

I strongly tend to only publish results that I consider interesting, and so I don't post a fraction of the ABXing that I do.

For example, I've known that good DACs generally ABX random guessing as opposed to no signal processing at all or the purist possible technical equivalent of "The straight wire". since the late 1970s. The only ABX test of DACs that I would find interesting enough to publish would be something that most informed people would consider to be impossible, so that is a huge area of testing that I have intentionally excluded myself publishing about, even though I do retest this hypothesis from time to time.

Amir often speaks from ignorance. His comments about published tests do not stand the test of time (this time times past) which anybody can verify for themselves. From approximately 2000 to 2007 I operated a website called www.pcabx.com which one can find about 1/3 of on the Wayback machine. On that site, I published a free Windows ABX Comparator (probably one of the first to ever be seen by the public).

I also published the .wav files, many 2496 format for conducting literally dozens of different ABX experiments, probably 100 or more. Each included a full set of graduated listener training files. It goes without saying that the whole site's contents were tested by me and my associates such as the late Tom Nousaine. Literally tens of thousands of copies of that first ABX Comparator and probably over a million copies of the various test files were downloaded. Needless to say I can therefore take some meaningful credit for that, and there does not seem to be anything like it that has happened before or since. I, therefore take some credit for the 100,000s of ABX tests that it seems to have stimulated.

Amir will have to admit that the one ABX test sequence that he did under my auspices (which related to jitter) included abundant and effective listener training. It was patterned after the 100s of different ABX tests from my now departed www.pcabx.com web site.

However, there is a relevant argument that the biggest offenders of the long-stated requirement for listener training (going back to no later than the mid-80s) are people who are already prejudiced against ABX because of their emotional and often financial stake in audio panaceas that are actually generally ineffective such as cables. So many people, even against my direct advice hvae used this kind of thing as their "training", and this was true while www.pcabx.com was in operation.

There is also an argument that ABX testing leads to self-training. It surely was that way for its developers. You see, unlike sighted evaluations, if you don't train yourself to listen effectively for things that should be audible in accordance with the findings of science (as opposed to the findings of audiophilia) you don't get statistically significant results. Plain and simple! Back in the late 1970s we were blessed with many amplifiers right before us (it was the tail end of tubed amps being mainstream) that did and did not sound different for example. We learned how to listen with such things.
 
Last edited:

Arnold Krueger

Active Member
Joined
Oct 10, 2017
Messages
160
Likes
83
And to make sure everyone is following the argument, here is a sample from HA Forum link above:

View attachment 9423


These are "MOS" (mean opinion score) tests where users rate fidelity on scale of 1 to 5. There is no ABX test that says whether a difference exists at all.

This is the style of testing that is used in development of lossy codecs. Not ABX as I have mentioned repeatedly.

Great example of finding a false conclusion from good data. It happens to the best!

Amir does not seem to understand the difference between testing for different purposes. The test shown were summary results that were arrived at after various but probably lengthy development sessions. ABX is pretty useless for this summary comparisons of flawed alternatives. ABX can be very good for development. ABX can be very good when there is a mix of sonically perfect products and imperfect products such as comparing tubed amplifiers and good SS amplfiers.

The way this works is that extensive listening sessions are used to find critical passges where the encoder screw up. Then the coder listens to the critical passage and enhances his code so that the artifact is managed. Then he runs a suite of critical passages to make sure that he didn't break what used to work.

What JJ and I are talking about is ABX as a development tool. What developer uses to debug code with. Obviously, unless you are writing paper or course about detailed encoder development, you don't publish much about this. About as high as I've ever seen this sort of thing percolate is the paper from Apple a few years back about processing files for download on their music site.
 
Last edited:

Arnold Krueger

Active Member
Joined
Oct 10, 2017
Messages
160
Likes
83
Great example of finding a false conclusion from good data. It happens to the best!

Amir does not seem to understand the difference between testing for different purposes. The test shown were summary results that were arrived at after various but probably lengthy development sessions. ABX is pretty useless for rankings of flawed peoducts. ABX can be very good for development. It can be very good when there is a mix of products, some sonically perfect and some flawed.

The way this works is that extensive listening sessions are used to find critical passages where the encoder screw up. Then the coder listens to the critical passage and enhances his code so that the artifact is managed in using ABX tests. Then he runs a suite of critical passages to make sure that he didn't break what used to work.

What JJ and I are talking about is ABX as a development tool. What developer uses to debug code with. Obviously, unless you are writing paper or course about detailed encoder development, you don't publish much about this. About as high as I've ever seen this sort of thing percolate is the paper from Apple a few years back about processing files for download on their music site.
 

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
The way this works is that extensive listening sessions are used to find critical passges where the encoder screw up. Then the coder listens to the critical passage and enhances his code so that the artifact is managed. Then he runs a suite of critical passages to make sure that he didn't break what used to work.
A very familiar concept to anyone who develops neural networks or other heuristic pattern recognition systems - but much much slower. Maybe a lifetime of listening tests translates into the AI scientist's first week of work, on a high with the apparent universal power of what they are working with. After a little while they realise the limitations and rein back on their enthusiasm.

Here, the usual system is to train the network on a set of data, and test it in parallel on other data - you watch two error lines hopefully descending together. The theory is that where they begin to diverge, with the testing data error beginning to rise, the network is being over-trained, and training should stop just prior to there. Of course, in such a scheme, the error is being measured automatically, so thousands of iterations can be run in seconds. Having to do the same thing using humans and ABX tests would be rather more tedious!

So perhaps the network is 98% correct on the training data and 80% correct on the test data. You then gather some new data, perhaps a camera in a location you've never tried before. Suddenly the thing fails to work. You realise that it wasn't really detecting what you thought it was detecting. So you feed a selection of this new data to the test and training batches and try again. This time you achieve 98% and 78%. You bring in some more data...

In the end, you begin to realise that the problem is many-dimensioned, and there is a conflict between the simplicity of a practical network - which is too 'blunt' to provide good discrimination, and a complex network for which you can't provide enough data, plus your data is 'lumpy' and not distributed evenly enough for the network to learn to generalise ('multi-dimensionally interpolate') properly.

In the end, you abandon the neural network idea, learn to understand what the system has to do from first principles, and engineer it conventionally. I imagine that lossy codecs could only have been created this way, and the listening tests are just playing at the margins.
 

Arnold Krueger

Active Member
Joined
Oct 10, 2017
Messages
160
Likes
83
Cosmik, if wishes were fishes.

The people who write encoder code are generally very well-informed about the relevant first principles. Trouble is, there is still no way in the current SOTA of AI for AI to write good code like this. If there was, they would be there first...

Does it get windy and cold in that ivory tower? Your posts seem to reflect that! ;-)
 
Last edited:

Cosmik

Major Contributor
Joined
Apr 24, 2016
Messages
3,075
Likes
2,180
Location
UK
Does it get windy and cold in that ivory tower? Your posts seem to reflect that! ;-)
I can't help it - I think I'm an ideas person; I can only get interested in ideas. I'm also fairly practical and hands-on, but the low level practical stuff is hopefully a means to an end.

When a discussion heads down the path of the practical stuff obscuring the ideas, or elevated to the status of an idea itself, I have to chip in with one of my (apparently) lofty posts! In this case, I see the iterative listening test procedure you describe as a direct analogy of neural network training, down to the distinction between training data (the 'network' being adjusted to reduce an error that has been heard) and the testing data (checking nothing has been broken = testing performance against other examples).

Technically, rather than the coder checking nothing has been broken, shouldn't the whole test be repeated in a double blind test? What he has changed *will* affect everything else to some extent - it is not just a question of broken or not broken. As such, it would be easy to see how this would never end: every artefact the coder fixes may result in another popping up elsewhere, or a general reduction in quality versus data compression performance - and that assumes that listening tests are always reliable and repeatable. And even if you keep introducing new examples of Tibetan nose flute or whatever, you'll never know that there isn't a complete disaster waiting around the corner - unless the system is fully understood anyway, in which case the listening tests are almost superfluous..?
 
Last edited:

Arnold Krueger

Active Member
Joined
Oct 10, 2017
Messages
160
Likes
83
I can't help it - I think I'm an ideas person; I can only get interested in ideas. I'm also fairly practical and hands-on, but the low level practical stuff is hopefully a means to an end.

When a discussion heads down the path of the practical stuff obscuring the ideas, or elevated to the status of an idea itself, I have to chip in with one of my (apparently) lofty posts! In this case, I see the iterative listening test procedure you describe as a direct analogy of neural network training, down to the distinction between training data (the 'network' being adjusted to reduce an error that has been heard) and the testing data (checking nothing has been broken = testing performance against other examples).

The problem is that among real-world practitioners, we know from experience that seemingly-good ideas are worth about a dime a dozen in the wild, and only become worth the powder it takes to blow them to #&!! by means of sifting though the sieve of practical experience. And its pretty easy to figure out who has that an who does not.

Technically, rather than the coder checking nothing has been broken, shouldn't the whole test be repeated in a double blind test?

Didn't I say that? Pretend that I said it as clear as day, read what I said again while thinking that I said it but you didn't notice.


What he has changed *will* affect everything else to some extent - it is not just a question of broken or not broken.

In general false. One of the goals of programming styles is to program the solution in a modular fashion and partition the modules so that fixes to different problems as much as possible don't step on each other.

The general audiophile version of this myth is "Everything Matters". The short answer is finding out what does matter and what doesn't matter is a basic part of being a successful practitioner. Of course in a placebo-driven world like audiophilia, so-called "proofs" that everything matters rule.

As such, it would be easy to see how this would never end: every artefact the coder fixes may result in another popping up elsewhere, or a general reduction in quality versus data compression performance - and that assumes that listening tests are always reliable and repeatable.

Ahh, d@## you can't even discuss things a little without introducing the excluded middle fallacy. Every artifact? Please see previous comments.

So, how big of a trust fund or settlement allows one to survive in the midst of such idealism, unreality and inexperience? ;-)
 

watchnerd

Grand Contributor
Joined
Dec 8, 2016
Messages
12,449
Likes
10,406
Location
Seattle Area, USA
This strikes a deep chord with me (ok, no pun intended). When I listen to most music I am definitely, in my head, playing lead guitar, then the next moment I'm playing the bass, then I'm playing the drums, then I'm playing the guitar again, etc. LOL, I'm never picturing myself as the singer :) I see and feel the frets on the guitar neck, I feel my thumb, index and middle fingers plucking the strings on the bass.

I play trombone and bass (acoustic and electric) and find myself doing that with genres that I actively played in (jazz, classical, blues, a bit of funk). Sometimes I seem to even mentally fill in progressions before they happen, then get shocked when the artist goes in a different direction. I also think I mentally fill in sounds I can't necessarily actually hear.

Interestingly, for genres for in which I have little to no hands-on experience (classical Chinese or Japanese, gamelan, EDM, liturgical chant) I don't find myself "listening as a musician".
 

svart-hvitt

Major Contributor
Joined
Aug 31, 2017
Messages
2,375
Likes
1,253
A very familiar concept to anyone who develops neural networks or other heuristic pattern recognition systems - but much much slower. Maybe a lifetime of listening tests translates into the AI scientist's first week of work, on a high with the apparent universal power of what they are working with. After a little while they realise the limitations and rein back on their enthusiasm.

Here, the usual system is to train the network on a set of data, and test it in parallel on other data - you watch two error lines hopefully descending together. The theory is that where they begin to diverge, with the testing data error beginning to rise, the network is being over-trained, and training should stop just prior to there. Of course, in such a scheme, the error is being measured automatically, so thousands of iterations can be run in seconds. Having to do the same thing using humans and ABX tests would be rather more tedious!

So perhaps the network is 98% correct on the training data and 80% correct on the test data. You then gather some new data, perhaps a camera in a location you've never tried before. Suddenly the thing fails to work. You realise that it wasn't really detecting what you thought it was detecting. So you feed a selection of this new data to the test and training batches and try again. This time you achieve 98% and 78%. You bring in some more data...

In the end, you begin to realise that the problem is many-dimensioned, and there is a conflict between the simplicity of a practical network - which is too 'blunt' to provide good discrimination, and a complex network for which you can't provide enough data, plus your data is 'lumpy' and not distributed evenly enough for the network to learn to generalise ('multi-dimensionally interpolate') properly.

In the end, you abandon the neural network idea, learn to understand what the system has to do from first principles, and engineer it conventionally. I imagine that lossy codecs could only have been created this way, and the listening tests are just playing at the margins.

I think what you describe is the power of «rule of thumb».
 

svart-hvitt

Major Contributor
Joined
Aug 31, 2017
Messages
2,375
Likes
1,253
Meaning classic reproducibility of results by other experimenters, as in chemistry or physics?

Real-life audio evaluation by ear is much, much more complex than normal hard sciences due to (1) humans being involved as well as (2) complexity in recreating listening scenarios (for example room-speaker interaction).

Even though audio evaluation is complex doesn’t mean it’s impossible. But researchers need to know (a lot about) what they’re doing. It cannot always be outsourced to «nurses taking blood samples».
 

watchnerd

Grand Contributor
Joined
Dec 8, 2016
Messages
12,449
Likes
10,406
Location
Seattle Area, USA
Real-life audio evaluation by ear is much, much more complex than normal hard sciences due to (1) humans being involved as well as (2) complexity in recreating listening scenarios (for example room-speaker interaction).

Up to a point. Neutrino and gravity wave detection is pretty hard, even compared to studying humans. :)
 
Status
Not open for further replies.
Top Bottom