A better test would be the following. Let us say, a user preferred equipment A over equipment B after a long time of listening (say 2-4 weeks) under various conditions. To test the reliability of that preference, you give the person those two devices in random sequence for the same period and type of usage but sight unseen as to which brand is in use. Now, collect information, on whether they preferred or not each time. If the preferences were not statistically significantly skewed towards the one they selected earlier, then you can say the selection process was unreliable.
Testing performed on a single person is statistically insignificant and performing such test on a group of persons would be impractical.