I'm not sure. I've looked over his materials and I haven't found anything definitive, but whenever he describes the procedure it's to seek the difference between original and degraded version. So I would assume it's not a L/R null.
SINAD is just inverted THD+N. There are plenty of studies about the latter but nothing definitive because it's an engineering metric, and not designed with psychoacoustic considerations in mind.
The same is true of the Df metric. The graphs you quoted aren't from third party studies. They were made by Serge. (This next part below is less for you than for others reading this thread.)
He created tests using voluntary online submissions. Visitors to his website are prompted to download a zipfile which automatically and randomly chooses from his database of testfiles. The visitor is given no information about the DUT. There is one short wav file with two segments, one the original digital file (as taken from EBU's
SQAM, for example) and another recorded from the DUT (or a psychoacoustic encoder like AAC), with the order of the segments randomized. So it's a good setup considering the one-man operation.
View attachment 74701
The user then inputs the name of the zip file and answers two questions:
View attachment 74700
The result is then automatically uploaded to his database and computed.
These aren't controlled listening tests, but he's received over 10,000 responses so far. So he's able to calculate and interpolate values for audibility curves, which is how he's come up with the figures I quoted in my post above.
But that is the extent of the evidence correlating to audibility. It's been said more than once that his metric does not reference the math developed for psychoacoustic analysis. What you have in the end is measurement of a DUT using a commercially available ADC, and analysis of the resulting digital file. As he says:
View attachment 74705
Let's say that's not a problem and that Df can be calculated accurately (due to quantization noise, Serge says the maximum accuracy for Df is limited to -140dB for the digital domain only). It's still just a measurement of device output.
Regarding what I said before about compression. Look at how the values are grouped:
View attachment 74695
Serge considers -50Df is transparent, and these devices are apparently occupy the middle in what looks like two different statistical modes (let's call them the unpolished brass mode and the brick mode). Many appear to be within 1dB. Serge considers a 1.5dB to 2dB difference in Df is subjectively significant, but it's not clear what "significant" really means. You could say that all that fall within that range "sound the same". But is that accurate?
SINAD can show the same value for wildly different audible and inaudible spectrums. Is that also the case for the Df metric? For a collection of devices with very degraded signals (falling within a 2dB range of a Df of -5dB or lower, for example), can you have results which audibly are very different (due to spectral differences) but similar numerically?
From this post of yours I'll highlight this:
Regarding possible clipping of the DUT: the signal itself is soft-clipped pink noise. We might be seeing the results of IMD and images aliased into the audible band. It might be an internal overflow error. The fact that Df measures and conflates "all possible sources of degredation" doesn't serve us well. As many have said here, all converted into an amplitude difference. There is no way to subtract, for example, the degredation brought on by his ADC recorder from that of the DUT given the two values and come to a valid answer. It is not as mathematically straightforward as addition and subtraction of dB.
He has posted for example these spectrograms as a way to visualize differences in DUT behaviour:
View attachment 74712
View attachment 74713
The question to ask is
what are we seeing? The reason I asked before about all of those kinds of null combinations is because it is absolutely unclear what a colour shift or a numerical shift mean. We have, through his measurements, assuming accuracy, very definitive presentations of "better" and "worse" which are undefinable and uninterpretable due to their self-reference. So we turn to his listening test metrics. There we find statistical significance, albeit a significance which cannot be related to established psychoacoustic metrics. It is like entering into a world closed in on itself.
And I think that is kind of his conclusion as well. His philosophy of "
honest audio" is that right now, manufacturers, if they wanted to, could produce completely transparent devices but do not, lacking incentive:
View attachment 74714
View attachment 74715
He does not seem interested in understanding the results in terms of traditional measurements. His goal, rather, is to push for better engineering to exceed Df thresholds and all others (THD/IMD/SNR), entirely bypassing the psychoacoustic side of things.
So it seems like this metric would be most useful for designers given how much time they spend on evaluation. They could dig into what's really happening. Doing Df measurements for the top end of devices is interesting too, but it's one of those things that could be more readily considered if the metric were more established and there was more than just Amir doing the testing work.