That is the point of the whole topic, comparing upsampling from image processing as an argument that lost detail cannot be recreated. The latest techniques come close to the original. If we would do the same for music, train Neural Networks with reduced data and checking how close they come to the original, would we not get a much better result? Quite likely not real time but maybe as a preprocess stage?
No, there is no image processing issue involved in the ACTUAL process of audio upsampling. No. Imaging works in the spatial domain. Sound works in a peculiar time/frequency domain dictated by how the human cochlea actually is known (tested, verified, and understood) to function. Yes, you have to worry about images and aliases in the signal, that's a completely different thing, which you can see discussed below in some detail.
Image detail requires "making up" information based on SPATIAL cues that must be eliminated (pixelation) and other cues that should be carried through (image edges). The information is spatial in character, and conversion to the frequency domain may be useful as a processing step, but is not the key to understanding the perception of the imagine.
Audio is frequency based. There is one, I repeat ONE issue. Do not muck up the spectrum. If you double the sampling rate, do not add anything, do not take anything away, because there is no detectable feature to be "inserted", barring some very young ears listening. Even if there was, the structure of audio signal creation makes "guessing right" much more difficult.
As a result, the examination of actual audio upsampling is one of the very few things in the audio domain for which least-mean-squares error is actually important. What do you even MEAN "compared to the original"? What original do you have in mind?
With PCM you ***GET*** all of the original in the bandwidth you started with. Your idea does not even fit into the reality of the process.
If you mean fixing the output of a perceptual codec (the equivalent of reducing pixel count in an image, to some poorly equatable extent) then you're arguing about something that has exactly ZERO to do with upsampling, or downsampling.
So, look, don't condescend to me here, by telling me what upsampling means. Yes, I know about both image and audio, and the two problems are simply not the same in any reasonable regard.
Ditto downsampling, by the way.
Here, this is how sampling rate conversion works, and why least mean squares matters for audio. I haven't done the same for imaging, because it's MUCH more in infancy (the work you show is reasonable for images), partially because of the rather substantially different perceptual constraints, and I prefer to work on audio.
So read this for audio sampling rate conversion, covering both upsampling AND downsampling.
https://www.aes-media.org/sections/pnw/pnwrecaps/2016/jjsrc_jan2016/
Scroll up at
http://www.aes-media.org/sections/pnw/pnwrecaps/index.htm if you need some updating on how the ear works.