Df Measurement

Vasr · Jul 22, 2020

bobbooo said:
To be honest I think my waffling posts have made the metric seem more complicated than it is. The actual measurement is very simple - just a recording using a high-quality ADC of the output of the DUT playing the 30-second Program Simulation Noise file.

My problem isn't whether the procedure is complex. It intuitively seems easy to do. But I still can't get my head around what to do with the numbers generated. What conclusions would you draw from it if something is indeed measured?

pkane · Jul 22, 2020

Vasr said:
To play a devil's advocate, something doesn't need to be perfect in order to be useful (see Nirvana fallacy).

Are there justifiable reasons to believe that it is uncorrelated with audibility (as a random measurement would be). There is certainly a thesis that it is correlated with audibility with some empirical testing. How can one prove or disprove this?

The metric is implemented in DeltaWave, so testing can be performed and the result would need to be correlated to distortion audibility tests.

amirm · Jul 22, 2020

bobbooo said:
I understand if you don't want a single thing more to do though, no matter how straightforward

That's not true at all. I am constantly evolving and doing more. My AVR tests have nearly 30 tests now and growing.

What I am not here is to pursue pet theories of people at my expense.

Vasr · Jul 22, 2020

pkane said:
The metric is implemented in DeltaWave, so testing can be performed and the result would need to be correlated to distortion audibility tests.

That was my understanding but that is different from concluding positively that it is "not a perfect predictor of audibility". That would require some tests to show a lack of correlation. As of now now, to be fair, if I understand correctly, it seems like it is a "not proven with exhaustive testing to show it to be a useful predictor of audibility, only some anecdotal experiments by its author that needs to be validated".

I am trying to figure what one would have to do to prove/disprove the latter. Who would do that is a more difficult question.

pkane · Jul 22, 2020

Vasr said:
That was my understanding but that is different from concluding positively that it is "not a perfect predictor of audibility". That would require some tests to show a lack of correlation. As of now now, to be fair, if I understand correctly, it seems like it is a "not proven with exhaustive testing to show it to be a useful predictor of audibility, only some anecdotal experiments by its author that needs to be validated".

I am trying to figure what one would have to do to prove/disprove the latter. Who would do that is a more difficult question.

There are some good reasons to believe it’s not a perfect predictor. I already mentioned why.

Disproving requires just one example, so knowing something about psychoacoustics can help select the kind of distortion that may not be properly measured by DF. Proving that it is an accurate predictor will require many more audibility studies, across all kinds of distortions and many different listeners.

Vasr · Jul 22, 2020

pkane said:
There are some good reasons to believe it’s not a perfect predictor. I already mentioned why.

Sure but that is a conjecture just as the opposing claim that believes it does. Neither prove nor disprove one way or the other (from a science perspective). Just to be fair. I am not saying your are necessarily wrong, of course.

pkane · Jul 22, 2020

Vasr said:
Sure but that is a conjecture just as the opposing claim that believes it does. Neither prove nor disprove one way or the other (from a science perspective). Just to be fair. I am not saying your are necessarily wrong, of course.

Yes, if you ignore the math behind it, then it’s just a conjecture.

bobbooo · Jul 22, 2020

Blumlein 88 said:
@bobbooo are you familiar with the null testing thread on Gearslutz? A standard file is played and it is nulled against the original. It was part of the inspiration for Pkane's Deltawave. The thread used Diffmaker which is problematic and Paul saw he could do better. And he did of course much better.

I've read a little of the threads on DF, and it never made sense to me what we are gaining much less that it should become a standard measure. It seems a more complex way of doing basic nulls with no added benefit I could discern. So what about it is worth pursuing?

I see Amir's response above which seems to capture my thoughts about it right now.

Yes I am aware of that Gearslutz thread, and have linked to it in previous comments, as well as Serge's computation of the Df value using all the files from that thread here.

The Df metric has the following advantages (over SINAD) that mean I think it's worth pursuing:

It is guaranteed to measure all signal degradation of the the DUT, even those of currently unknown origin / mechanism (whereas SINAD only measures noise and harmonic distortion), which has the bonus of dispelling the objections of some (at least somewhat rational) 'audiophiles' that current measurements 'don't tell the full story', leaving room for pseudoscience and snake oil.
It can be used with any source signal, including real music, or a close spectral analogue thereof in the form of the already standardized 30-second Program Simulation Noise file for ease of measurement reproducibility, the latter having shown good correlation in Df value with real music.
All you need to measure the Df value of a device is an ADC, an audio cable connecting them, and a PC with DeltaWave (or Serge's software) installed to compute the value (the latter can be an independent process to the former).
As @pkane has already said, a specific advantage of Df over other null metrics such as overall RMS of the difference signal is that it is slightly better correlated to audibility given the shorter time scale of 400ms over which it's computed (if a null metric is to be used, might as well use the one that has this advantage, even if it's only ancillary to its main purpose as a signal degradation metric). Df also has quite a large body of measurements, data and analysis by Serge which would be a shame to waste.

Conversely, the advantages of SINAD over the Df metric I can think of:

It's an industry standard.
??? (What am I missing?)

Disadvantages of both the Df metric and SINAD:

The signal degradation of the ADC used must be higher than that of the DUT for accurate results.
Neither metric has been conclusively correlated to listening test results (although as shown in my previous post Df seems promising in that regard, showing good monotonic correlation in preliminary tests).

But I'd argue the latter isn't the primary purpose of either metric anyway. They are both simply mathematical ratios of either signal to unwanted audio (THD+noise in the case of SINAD), or the ratio of unwanted audio (THD+noise+all other possible erroneous audio) to signal in the case of the Df metric. So any objections made against the relation of the Df metric to audibility can equally be made against SINAD (I don't understand how anyone can coherently make the former without the latter). How we currently deal with that with SINAD is to prescribe a hard limit of ~120 dB, above which a device can be safely said to be audibly transparent (strictly only for 1kHz sine signals though). So, as I suggested, a similar hard limit can be prescribed to Df values, which can be determined by ABXing difference signals of ever-decreasing Df value against digital silence, until they can't be distinguished. The Df value at this point would then be a good hard lower limit of Df audibility, an equivalent to the 120dB SINAD audibility limit. No, neither of these limits take into account psychoacoustic effects such as perceptual masking, but as @pkane said, Df (and SINAD) are primarily engineering metrics, and the engineering goal of any audio reproduction device should be guaranteed audible transparency. Probabilistic psychoacoustic models will never guarantee this, so I think hard audibility limits should instead be strived for. Just look at what the 120 dB limit and ASR's mountain of measurements have done for the industry already in getting companies (ahem Schitt) to shoot for this high goal. The caveat remains with the 120dB SINAD audibility threshold though that it only guarantees this for THD and noise, and only when the input signal is a pure sine wave. (Note, this leaves open the possibility of 'fixing' the DUT to just ace this test. Incidentally this may be what happened with the FiiO M11 mentioned previously, which has a relatively good Df value for the sine test - implying a high SINAD - yet a poor Df value for real/simulated music.) The Df metric is just a natural extension of SINAD that aims to guarantee audible transparency of all possible signal degradation, and when playing back real/simulated music instead of test tones.

By the way, Serge has made his paper available directly here, for those who aren't AES members: http://soundexpert.org/documents/10179/11017/DiffLevel_AES118.pdf

I'd suggest anyone interested in audio measurements read the paper in full before judging the merits of the theory, as it does address some of the objections raised here, and is obviously the product of a lot of time, effort, rational reasoning, measurements and listening tests, which would be quite unfair and disrespectful to dismiss without good reason considering its potential in driving forward audio reproduction fidelity of real music.

Oh, and if anyone wants to get a better intuitive feel for what the Df values mean in practice, I found this table of example measurements from the paper quite helpful:

This shows the relatively large difference between the Df value for a sine wave (indicative of SINAD) and the Df for real music, which over a range of devices tested so far (see here, here. and here) also show poor correlation with each other, suggesting SINAD is not representative of audio degradation when playing actual music, and so an inadequate metric for real-world usage.

amirm · Jul 22, 2020

bobbooo said:
It is guaranteed to measure all signal degradation of the the DUT, even those of currently unknown origin / mechanism (whereas SINAD only measures noise and harmonic distortion)

This is not correct. THD+N "meter" in analyzers takes out the signal and *everything* that remains is considered "THD+N." All noise, spurious tones, intermodulation, harmonic distortion, jitter, aliasing, etc. is included in the measurement.

pkane · Jul 23, 2020

bobbooo said:
Yes I am aware of that Gearslutz thread, and have linked to it in previous comments, as well as Serge's computation of the Df value using all the files from that thread here.

The Df metric has the following advantages (over SINAD) that mean I think it's worth pursuing:

It is guaranteed to measure all signal degradation of the the DUT, even those of currently unknown origin / mechanism (whereas SINAD only measures noise and harmonic distortion), which has the bonus of dispelling the objections of some (at least somewhat rational) 'audiophiles' that current measurements 'don't tell the full story', leaving room for pseudoscience and snake oil

It can be used with any source signal, including real music, or a close spectral analogue thereof in the form of the already standardized 30-second Program Simulation Noise file for ease of measurement reproducibility, the latter having shown good correlation in Df value with real music

All you need to measure the Df value of a device is an ADC, an audio cable connecting them, and a PC with DeltaWave (or Serge's software) installed to compute the value (the latter can be an independent process to the former)

As @pkane has already said, a specific advantage of Df over other null metrics such as overall RMS of the difference signal is that it is slightly better correlated to audibility given the shorter time scale of 400ms over which it's computed (if a null metric is to be used, might as well use the one that has this advantage, even if it's only ancillary to its main purpose as a signal degradation metric)

Conversely, the advantages of SINAD over the Df metric I can think of:

It's an industry standard

??? (What am I missing?)

Disadvantages of both the Df metric and SINAD:

The signal degradation of the ADC used must be higher than that of the DUT for accurate results

Neither metric has been conclusively correlated to listening test results (although as shown in my previous post Df seems promising in that regard, showing good monotonic correlation in preliminary tests)

But I'd argue the latter isn't the primary purpose of either metric anyway. They are both simply mathematical ratios of either signal to unwanted audio (THD+noise in the case of SINAD), or the ratio of unwanted audio (THD+noise+all other possible erroneous audio) to signal in the case of the Df metric. So any objections made against the relation of the Df metric to audibility can equally be made against SINAD (I don't understand how anyone can coherently make the former without the latter). How we currently deal with that with SINAD is to prescribe a hard limit of ~120 dB, above which a device can be safely said to be audibly transparent (strictly only for 1kHz sine signals though). So, as I suggested, a similar hard limit can be prescribed to Df values, which can be determined by ABXing difference signals of ever-decreasing Df value against digital silence, until they can't be distinguished. The Df value at this point would then be a good hard lower limit of Df audibility, an equivalent to the 120dB SINAD audibility limit. No, neither of these limits take into account psychoacoustic effects such as perceptual masking, but as @pkane said, Df (and SINAD) are primarily engineering metrics, and the engineering goal of any audio reproduction device should be guaranteed audible transparency. Probabilistic psychoacoustic models will never guarantee this, so I think hard audibility limits should instead be strived for. Just look at what the 120 dB limit and ASR's mountain of measurements have done for the industry already in getting companies (ahem Schitt) to shoot for this high goal. The caveat remains with the 120dB SINAD audibility threshold though that it only guarantees this for THD and noise, and only when the input signal is a pure sine wave. (Note, this leaves open the possibility of 'fixing' the DUT to just ace this test. Incidentally this may be what happened with the FiiO M11 mentioned previously, which has a relatively good Df value for the sine test - implying a high SINAD - yet a poor Df value for real/simulated music.) The Df metric is just a natural extension of SINAD that aims to guarantee audible transparency of all possible signal degradation, and when playing back real/simulated music instead of test tones.

By the way, Serge has made his paper available directly here, for those who aren't AES members: http://soundexpert.org/documents/10179/11017/DiffLevel_AES118.pdf

I'd suggest anyone interested in audio measurements read the paper in full before judging the merits of the theory, as it does address some of the objections raised here, and is obviously the product of a lot of time, effort, rational reasoning, measurements and listening tests, which would be quite unfair and disrespectful to dismiss without good reason considering its potential in driving forward audio reproduction fidelity of real music.

Oh, and if anyone wants to get a better intuitive feel for what the Df values mean in practice, I found this table of example measurements from the paper quite helpful:
View attachment 74694
This shows the relatively large difference between the Df value for a sine wave (indicative of SINAD) and the Df for real music, which over a range of devices tested so far (see here, here. and here) also show poor correlation with each other, suggesting SINAD is not representative of audio degradation when playing actual music, and so an inadequate metric for real-world usage.

Comparing SINAD to Df, I'd pick Df as a better measure of the amount of distortion present in a signal. SINAD, expressed as a single number at a single frequency and at a specific level, simply can't account for other distortions that might occur when testing with a different signal. SINAD vs frequency and SINAD vs level measurements can help some in this respect.

amirm · Jul 23, 2020

bobbooo said:
Serge's AES paper describes studies which showed good correlation between Df value and listening test results. From the paper:

No. Just no. Only two tracks used and even there, the correlation is not great. No way a mechanical method like this without psychoacoustics has a prayer of working for lossy codecs. The limitation of the testing is specified right where the graphs are that you post:

Such metrics are good for a wet thumb in the air assessments but no way are they a replacement for our ears and eyes. On the latter, we use PSNR for video and it has the same issues.

Blumlein 88 · Jul 23, 2020

@bobbooo
I'm not that big a fan of SINAD. It has a very defined purpose and use. I understand why Amir has chosen to use it here.

So to me SINAD vs Df isn't even what I'm thinking. My thinking is more along the lines of why is Df better than the null results you already get with Deltawave.

bobbooo · Jul 23, 2020

amirm said:
This is not correct. THD+N "meter" in analyzers takes out the signal and *everything* that remains is considered "THD+N." All noise, spurious tones, intermodulation, harmonic distortion, jitter, aliasing, etc. is included in the measurement.

Ah yes, thanks for the correction. I suppose IMD would not be produced for the sine signal used for SINAD (at least not significantly, maybe there would be some IMD produced through interaction between the fundamental and its own harmonics, or even between the harmonics themselves?) If SINAD is then a signal-to-everything-else ratio, that means Df (an everything-else-to-signal ratio) is an even closer metric to SINAD than I stated, the two being approximate inverses. That makes Df even more of a natural extension of SINAD, just measuring the everything-else for all signal types including actual music, instead of SINAD's single sine tone.

bobbooo · Jul 23, 2020

amirm said:
No. Just no. Only two tracks used and even there, the correlation is not great. No way a mechanical method like this without psychoacoustics has a prayer of working for lossy codecs. The limitation of the testing is specified right where the graphs are that you post:

View attachment 74711

Such metrics are good for a wet thumb in the air assessments but no way are they a replacement for our ears and eyes. On the latter, we use PSNR for video and it has the same issues.

Yeah they're just preliminary, suggestive results so far. Have there been any studies suggesting an equivalent or better correlation between SINAD and subjective audio quality? As I said in my previous post under disadvantages of both Df and SINAD, neither of them have been conclusively shown to correlate with listening test results. But again, predicting listener preference is not really the goal of the Df metric (just as it isn't for SINAD). They're just signal degradation metrics. Finding an approximate hard limit on audibility for both is all that's needed for them to be useful to a consumer really.

bobbooo · Jul 23, 2020

Blumlein 88 said:
@bobbooo
I'm not that big a fan of SINAD. It has a very defined purpose and use. I understand why Amir has chosen to use it here.

So to me SINAD vs Df isn't even what I'm thinking. My thinking is more along the lines of why is Df better than the null results you already get with Deltawave.

I answered that in the last bullet point under advantages of the Df metric in my post.

Vasr · Jul 23, 2020

pkane said:
Yes, if you ignore the math behind it, then it’s just a conjecture.

As a science term what would you call it if it isn't proving it. A conjecture can have reasoning behind it but not enough to prove it. It is not ignoring everything. By definition, you just don't have enough information to call it a proof.

You prefer Hypothesis? Ok.

This is a science forum after all.

My point isn't what it is called (other than proven theory). My point is that both claims are the same unproven theories and should be subjected to the same level of scrutiny. For example, when you assert that it is not a perfect measure of audibility unless you mean a tautology in which nothing can be a perfect measure of audibility.

Ok, enough nitpicking back to the interesting debate while I sit with the popcorn.

pozz · Jul 23, 2020

Blumlein 88 said:
Do I read that setup correctly. He is always nulling the left against the right channel? Why would you do that instead of a conventional null test?

I'm not sure. I've looked over his materials and I haven't found anything definitive, but whenever he describes the procedure it's to seek the difference between original and degraded version. So I would assume it's not a L/R null.

bobbooo said:
Serge's AES paper describes studies which showed good correlation between Df value and listening test results. From the paper:

View attachment 74663
View attachment 74661
I'm not aware of equivalent studies showing such quantitative correlation for SINAD. Sure Df is not an industry standard, but nevertheless I think it does hold some promise for elucidating audio degradation when playing real/simulated music that standard measurements can't, so I think it's an interesting metric to explore that could push audio measurement further than standard measures. To be honest I think my waffling posts have made the metric seem more complicated than it is. The actual measurement is very simple - just a recording using a high-quality ADC of the output of the DUT playing the 30-second Program Simulation Noise file. Any Df computation or analysis can be done independently if the recording is publicy uploaded. I understand if you don't want a single thing more to do though, no matter how straightforward - I don't know how you find the time to eat and sleep as it is!

SINAD is just inverted THD+N. There are plenty of studies about the latter but nothing definitive because it's an engineering metric, and not designed with psychoacoustic considerations in mind.

The same is true of the Df metric. The graphs you quoted aren't from third party studies. They were made by Serge. (This next part below is less for you than for others reading this thread.)

He created tests using voluntary online submissions. Visitors to his website are prompted to download a zipfile which automatically and randomly chooses from his database of testfiles. The visitor is given no information about the DUT. There is one short wav file with two segments, one the original digital file (as taken from EBU's SQAM, for example) and another recorded from the DUT (or a psychoacoustic encoder like AAC), with the order of the segments randomized. So it's a good setup considering the one-man operation.

The user then inputs the name of the zip file and answers two questions:

The result is then automatically uploaded to his database and computed.

These aren't controlled listening tests, but he's received over 10,000 responses so far. So he's able to calculate and interpolate values for audibility curves, which is how he's come up with the figures I quoted in my post above.

But that is the extent of the evidence correlating to audibility. It's been said more than once that his metric does not reference the math developed for psychoacoustic analysis. What you have in the end is measurement of a DUT using a commercially available ADC, and analysis of the resulting digital file. As he says:

Let's say that's not a problem and that Df can be calculated accurately (due to quantization noise, Serge says the maximum accuracy for Df is limited to -140dB for the digital domain only). It's still just a measurement of device output.

Regarding what I said before about compression. Look at how the values are grouped:

Serge considers -50Df is transparent, and these devices are apparently occupy the middle in what looks like two different statistical modes (let's call them the unpolished brass mode and the brick mode). Many appear to be within 1dB. Serge considers a 1.5dB to 2dB difference in Df is subjectively significant, but it's not clear what "significant" really means. You could say that all that fall within that range "sound the same". But is that accurate?

SINAD can show the same value for wildly different audible and inaudible spectrums. Is that also the case for the Df metric? For a collection of devices with very degraded signals (falling within a 2dB range of a Df of -5dB or lower, for example), can you have results which audibly are very different (due to spectral differences) but similar numerically?

bobbooo said:
Thanks! I was about to create a thread myself actually. It was just two review threads by the way Both in response to related comments from other members - the first asking Amir directly if he could do null difference tests using DeltaWave (which also computes the Df metric), and the second a discussion about whether the high measured SINAD (and other standard metrics) of the Okto dac8 make it the 'best DAC in the world'. My point there was that could not be determined without quantifying the DAC's performance when playing real music (or a close analogue thereof in the form of an already standardized test signal which was developed to have similar spectral content to music i.e. the Program Simulation Noise). But I agree those two threads were getting overloaded with this discussion which deserves its own thread. (Unfortunately it seems the previous thread on the topic devolved into ad hominem and only tangentially related arguments, so I think a fresh start is warranted.)

I'll repost my answers from the previous threads for reference as I've gone over some of this before:

So if the Df metric is an abstraction, SINAD is just as much so. Serge does actually specify reference levels for playback tests - the same maximum level recommended by the EN 50332-2 standard (see his full methodology at the bottom of this page). It could be argued this is an arbitrary choice, but the same thing could be said about the level used for SINAD measurements. Note: the EN 50332-2 level was chosen by Serge because he was initially interested in testing portable devices, which the standard was made for. The standard also specifies testing with 32 ohm loads, I presume to simulate an average pair of headphones. (As almost all modern portable players follow this standard these days, just setting them to max volume will usually yield the same 150mV level after the 32 ohm loads, as Serge's testing diagram prescribes). But this does not mean the Df metric could not also be used for larger DACs/amps used for speaker playback - standard test levels just need to be chosen (again, just like SINAD), and the 32 ohm loads would not be needed. Testing portable players also without loads would give their line-out performance anyway, so that would still be useful and easily obtainable data, if adding loads would be too time consuming (although that would just be a one-time soldering job really).

I presume you meant what device behaviour is causing the FiiO M11 (not the Questyle QP1R) to 'react' badly to the Program Simulation Noise (hereafter PSN), yet not the sine signal? (The inverse is true for the QP1R.) I think that's the beauty of the Df metric in a way - it highlights all possible sound degradation when playing actual (or simulated) music, some of which we may not currently know the cause or mechanism of. In the FiiO M11's case I can't imagine it's clipping otherwise I would have thought the sine Df would also be adversely affected, no? Maybe it's come kind of as yet unknown nonlinear effect due to the complexity of the PSN/music waveform, who knows. Of course, from an engineering perspective this would be useful to know, but for ranking sound degradation (what the Df metric was intended for) not really - all that matters are the correlations between the inputs and outputs, everything else can be a black box for that purpose.

To your point about generated pink noise (and so the PSN) not being identical due to probability functions, this can easily be overcome by just using the same identical source file for all tests, such as the ones pre-generated and included in Audio Precision's Audio Player Test Utility. As for the Df between the PSN and pink noise, that could be determined through the generation of the PSN from a known pink noise file, saving both and running them through DeltaWave to compute the Df. (Not exactly sure why you want to know this value though.) Linear level (as well as time shift) differences are adjusted for in the Df computation, so two signals with their only difference being a 10dBFS attenuation would have identical Df values.

Can you say what a 7dB difference in SINAD really means intuitively? That's not that easy for me, and I don't think you can even get a feel for that for Df values until a large number and range of devices are measured. What you can see (and hear) using DeltaWave that is intuitive to understand is the actual difference signal between the original and recorded sound file produced by any device, which is quite fascinating. If you listen to a difference signal of real music and turn your headphones/speakers up, you can actually still hear the form of the original music, and comparing these difference signals between devices, hear the differences in level and noise in this signal across DUTs, directly listening to the degradation the devices are imparting on your music. Thinking about this, it may be possible to actually work out a limit on perceptible relative Df values between devices, by for example ABXing difference signals of ever-closer Df value, until they can no longer be distinguished. Of course this doesn't take into account perceptual masking when listening to real music, but it could be a useful hard lower limit at which it can be safely said two devices with Df values closer than this limit will have comparable levels of degradation to your ears (the limit could even be individual depending on your performance in the ABX test). In a similar way, a hard lower limit on absolute Df value audibility, and so pretty much guaranteed transparency with music, could be determined by ABXing difference signals of ever-decreasing Df value against digital silence, ending up with whatever the Df equivalent of the often quoted ~120dB limit on SINAD audibility turns out to be. Additionally, Serge is also attempting to quantify the correlation between null difference measurements and listening tests results here, which has shown some promising results. Is there strong quantifiable evidence for the correlation between SINAD and listening tests? This AES paper by Steve Temme and Sean Olive doesn't sound too promising (my emphasis):

I personally see the Df metric being most useful as an objective, pure measure of audio signal degradation though i.e. a natural extension and expansion of SINAD to encompass all unwanted changes in the electrical audio chain, and using real (or simulated) music instead of test tones for a more accurate relation to real-world listening.

From this post of yours I'll highlight this:

bobbooo said:
I think that's the beauty of the Df metric in a way - it highlights all possible sound degradation when playing actual (or simulated) music, some of which we may not currently know the cause or mechanism of. In the FiiO M11's case I can't imagine it's clipping otherwise I would have thought the sine Df would also be adversely affected, no? Maybe it's come kind of as yet unknown nonlinear effect due to the complexity of the PSN/music waveform, who knows.

Regarding possible clipping of the DUT: the signal itself is soft-clipped pink noise. We might be seeing the results of IMD and images aliased into the audible band. It might be an internal overflow error. The fact that Df measures and conflates "all possible sources of degredation" doesn't serve us well. As many have said here, all converted into an amplitude difference. There is no way to subtract, for example, the degredation brought on by his ADC recorder from that of the DUT given the two values and come to a valid answer. It is not as mathematically straightforward as addition and subtraction of dB.

He has posted for example these spectrograms as a way to visualize differences in DUT behaviour:

The question to ask is what are we seeing? The reason I asked before about all of those kinds of null combinations is because it is absolutely unclear what a colour shift or a numerical shift mean. We have, through his measurements, assuming accuracy, very definitive presentations of "better" and "worse" which are undefinable and uninterpretable due to their self-reference. So we turn to his listening test metrics. There we find statistical significance, albeit a significance which cannot be related to established psychoacoustic metrics. It is like entering into a world closed in on itself.

And I think that is kind of his conclusion as well. His philosophy of "honest audio" is that right now, manufacturers, if they wanted to, could produce completely transparent devices but do not, lacking incentive:

He does not seem interested in understanding the results in terms of traditional measurements. His goal, rather, is to push for better engineering to exceed Df thresholds and all others (THD/IMD/SNR), entirely bypassing the psychoacoustic side of things.

So it seems like this metric would be most useful for designers given how much time they spend on evaluation. They could dig into what's really happening. Doing Df measurements for the top end of devices is interesting too, but it's one of those things that could be more readily considered if the metric were more established and there was more than just Amir doing the testing work.

Blumlein 88 · Jul 23, 2020

pozz said:
I'm not sure. I've looked over his materials and I haven't found anything definitive, but whenever he describes the procedure it's to seek the difference between original and degraded version. So I would assume it's not a L/R null.

SINAD is just inverted THD+N. There are plenty of studies about the latter but nothing definitive because it's an engineering metric, and not designed with psychoacoustic considerations in mind.

The same is true of the Df metric. The graphs you quoted aren't from third party studies. They were made by Serge. (This next part below is less for you than for others reading this thread.)

He created tests using voluntary online submissions. Visitors to his website are prompted to download a zipfile which automatically and randomly chooses from his database of testfiles. The visitor is given no information about the DUT. There is one short wav file with two segments, one the original digital file (as taken from EBU's SQAM, for example) and another recorded from the DUT (or a psychoacoustic encoder like AAC), with the order of the segments randomized. So it's a good setup considering the one-man operation.
View attachment 74701
The user then inputs the name of the zip file and answers two questions:
View attachment 74700
The result is then automatically uploaded to his database and computed.

These aren't controlled listening tests, but he's received over 10,000 responses so far. So he's able to calculate and interpolate values for audibility curves, which is how he's come up with the figures I quoted in my post above.

But that is the extent of the evidence correlating to audibility. It's been said more than once that his metric does not reference the math developed for psychoacoustic analysis. What you have in the end is measurement of a DUT using a commercially available ADC, and analysis of the resulting digital file. As he says:
View attachment 74705
Let's say that's not a problem and that Df can be calculated accurately (due to quantization noise, Serge says the maximum accuracy for Df is limited to -140dB for the digital domain only). It's still just a measurement of device output.

Regarding what I said before about compression. Look at how the values are grouped:
View attachment 74695

Serge considers -50Df is transparent, and these devices are apparently occupy the middle in what looks like two different statistical modes (let's call them the unpolished brass mode and the brick mode). Many appear to be within 1dB. Serge considers a 1.5dB to 2dB difference in Df is subjectively significant, but it's not clear what "significant" really means. You could say that all that fall within that range "sound the same". But is that accurate?

SINAD can show the same value for wildly different audible and inaudible spectrums. Is that also the case for the Df metric? For a collection of devices with very degraded signals (falling within a 2dB range of a Df of -5dB or lower, for example), can you have results which audibly are very different (due to spectral differences) but similar numerically?

From this post of yours I'll highlight this:

Regarding possible clipping of the DUT: the signal itself is soft-clipped pink noise. We might be seeing the results of IMD and images aliased into the audible band. It might be an internal overflow error. The fact that Df measures and conflates "all possible sources of degredation" doesn't serve us well. As many have said here, all converted into an amplitude difference. There is no way to subtract, for example, the degredation brought on by his ADC recorder from that of the DUT given the two values and come to a valid answer. It is not as mathematically straightforward as addition and subtraction of dB.

He has posted for example these spectrograms as a way to visualize differences in DUT behaviour:
View attachment 74712
View attachment 74713
The question to ask is what are we seeing? The reason I asked before about all of those kinds of null combinations is because it is absolutely unclear what a colour shift or a numerical shift mean. We have, through his measurements, assuming accuracy, very definitive presentations of "better" and "worse" which are undefinable and uninterpretable due to their self-reference. So we turn to his listening test metrics. There we find statistical significance, albeit a significance which cannot be related to established psychoacoustic metrics. It is like entering into a world closed in on itself.

And I think that is kind of his conclusion as well. His philosophy of "honest audio" is that right now, manufacturers, if they wanted to, could produce completely transparent devices but do not, lacking incentive:
View attachment 74714
View attachment 74715
He does not seem interested in understanding the results in terms of traditional measurements. His goal, rather, is to push for better engineering to exceed Df thresholds and all others (THD/IMD/SNR), entirely bypassing the psychoacoustic side of things.

So it seems like this metric would be most useful for designers given how much time they spend on evaluation. They could dig into what's really happening. Doing Df measurements for the top end of devices is interesting too, but it's one of those things that could be more readily considered if the metric were more established and there was more than just Amir doing the testing work.

Spectrograms need a table showing what level corresponds to what color optimally. At a minimum you need to know where the bottom is. In the images from Serge it could be anything below - 50 db goes green. We don't know where he set the threshold for those.

A test I like to run is a sweep and look on a spectrogram. One that stops at -100db. It shows HD, and IMD, spurious tones, aliasing, and imaging. If you can get a device which is squeeky clean down to -100 db I doubt you'll ever hear it different vs another squeekly clean device.

Here is an example. Both spectrograms are the same test signal and devices. The upper one goes to light gray at -100db and nothing lower than that level shows up. The lower one goes to -160 db and everything above that shows up. You can see this device has nothing other than signal above -100 db, but there is some noise and distortion below -100 db. So you need to know the scaling and the lower cut-off of spectrograms for them to make sense.

Soundstage · Jul 23, 2020

One naive question on the Df metric. It is defined as a function of the linear correlation of two signals.
What about non linear correlations?

Vasr · Jul 23, 2020

pozz said:
He created tests using voluntary online submissions. Visitors to his website are prompted to download a zipfile which automatically and randomly chooses from his database of testfiles. The visitor is given no information about the DUT. There is one short wav file with two segments, one the original digital file (as taken from EBU's SQAM, for example) and another recorded from the DUT (or a psychoacoustic encoder like AAC), with the order of the segments randomized. So it's a good setup considering the one-man operation.

Thanks or posting the details. I have no conclusions to advance just processing the information and testing them for validity.

I will assume that he is reporting the results without any bias or manipulation.

These aren't controlled listening tests, but he's received over 10,000 responses so far. So he's able to calculate and interpolate values for audibility curves, which is how he's come up with the figures I quoted in my post above.

Sometimes being "controlled listening tests" is overused. It depends on the purpose as to whether it is necessary. If you wanted to make conclusions based on a small sample space as to whether two sounds are different, then you would need a controlled listening test for the express reason of eliminating other variables that could explain the difference.

But this is more of a statistical sampling where the listening conditions are sufficiently randomized over a large sample space so presumably no one single variable dominates. In such a scenario, if you tossed a coin and reported the results, over a sufficiently large attempts, they would fall on a binomial distribution around the middle. As many people would roughly say they are different/worse as they say it is the same. But if there is a skew in that curve towards same or worse, it is possible to statistically conclude that the two are considered the same or considered different/worse to statistical significance.

As a form of empirical science, that would be a valid observation.

If he is able to find a positive correlation between Df values and the results that detected a difference to statistical significance then, while it might not be a perfect measure, it may have some validity. Note that in such empirical studies, it is not necessary to have a mechanistic explanation as to why this is so or even have a theory to explain it.

So the following

It's been said more than once that his metric does not reference the math developed for psychoacoustic analysis.

isn't necessarily a knock against it for the limited conclusions being drawn. If he claimed that Df somehow captures a mechanistic explanation of what makes sounds be perceived differently, then his metric must be consistent with psychoacoustic analysis. I don't see him making any such claim. So he shouldn't be held to that standard.

Let's say that's not a problem and that Df can be calculated accurately (due to quantization noise, Serge says the maximum accuracy for Df is limited to -140dB for the digital domain only). It's still just a measurement of device output.

It is a valid question as to whether his measurement itself is robust and repeatable. While he might have done a large sampling of listeners, the number of samples he has is relatively small and so it would be difficult to make the case that his method is well-defined enough. All he can say is that for that set of measurements done, there was a positive correlation between Df values computed and perceived difference to statistical significance.

But for that conclusion to have validity and confidence, this needs to be tested against totally different set of samples (preferably by someone else using similar equipment which is why repeatability is important in science process) and a similar large sampling of listening conducted. If it shows the same positive correlation between his thresholds and perceived difference then the confidence level in the correlation increases and the measure becomes more useful. If the second set showed no such correlation, then his first sampling was a statistical aberration. If it showed a very different cut-off point then the reliability of that metric to capture "badness" would be in question.

The final test is one of forecastability. Take another sample, compute the Df values and based on earlier tests predict which ones would be perceived as different/worse. Now test that hypothesis over a sufficiently large population. If the predictions were supported in a statistically significant way, then it would be a valid metric for that purpose.

Regarding what I said before about compression. Look at how the values are grouped:
View attachment 74695

Serge considers -50Df is transparent, and these devices are apparently occupy the middle in what looks like two different statistical modes (let's call them the unpolished brass mode and the brick mode). Many appear to be within 1dB. Serge considers a 1.5dB to 2dB difference in Df is subjectively significant, but it's not clear what "significant" really means. You could say that all that fall within that range "sound the same". But is that accurate?

This is a valid concern. It is possible that the values fall so close to each other that just arbitrarily drawing a threshold between them may not be statistically justifiable. It could mean that the metric does not have sufficient resolution/granularity to separate out goodness from badness. BUT, it could also mean that the units tested were too similar to cause such a little spread and so the audibility correlation might be an accidental consequence of small numbers (spread) which would be caught in a totally different sample.

At the least, it suggests he needs a more "discriminating" metric that would spread it out. But then he is constrained by trying to differentiate between units that show very similar SINAD to claim that his metric can differentiate between them. That may just not be possible with his current metric.

So it seems like this metric would be most useful for designers given how much time they spend on evaluation. They could dig into what's really happening. Doing Df measurements for the top end of devices is interesting too, but it's one of those things that could be more readily considered if the metric were more established and there was more than just Amir doing the testing work.

Yes, it could be a diagnostic tool for QC purposes, not as a defining metric but rather as a way to prompt further enquiry if the number was to fall in the badness range with the caveat that it could be a false positive.

For a metric like that to be used here, it would require the forecastability test I mentioned above. On the other hand, we don't have any such result for SINAD other than at extremes. So, we are discussing a known devil vs an unknown devil.

Df Measurement

Major Contributor

Master Contributor

Founder/Admin

Major Contributor

Master Contributor

Major Contributor

Master Contributor

Major Contributor

Founder/Admin

Master Contributor

Founder/Admin

Grand Contributor

Major Contributor

Major Contributor

Major Contributor

Major Contributor

Слава Україні

Grand Contributor

Active Member

Major Contributor

Similar threads