Wow and flutter will tend to be pitch modulations with a certain frequency and amplitude of their own, within a limited range. If you look at a spectrogram you could probably learn to pick them out by eye, and so maybe Capstan uses a machine learning approach based on that.
For example, with music, if you have modulations of less than one semitone, you can simply compare the same notes across time and extract the pitch modulation, then reverse it.
With voice, the harmonics won't be the same from phrase to phrase, but for a given person speaking on the recording, the formant frequencies will be fixed, so if you can detect those, you can also extract the pitch modulation.