As people have figured out this is due to high core count/high memory GPUs consuming a good amount of power as soon as the compute cores get any higher load.
Stats from my former 3080, tons of stuff running + one video stream decoded by cpu and just rendered by gpu:
(stats are one row per second)
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk pviol tviol fb bar1 sbecc dbecc pci rxpci txpci
# Idx W C C % % % % MHz MHz % bool MB MB errs errs errs MB/s MB/s
0 46 45 - 28 19 0 0 810 510 0 0 1178 8 - - 0 318 154
0 46 45 - 27 17 0 0 810 510 0 0 1178 8 - - 0 170 144
0 46 45 - 28 19 0 0 810 525 0 0 1178 8 - - 0 170 152
0 46 45 - 23 17 0 0 810 525 0 0 1178 8 - - 0 169 142
0 46 45 - 25 17 0 0 810 510 0 0 1178 8 - - 0 169 144
0 46 45 - 24 17 0 0 810 510 0 0 1178 8 - - 0 178 155
46W for 810MHz memclock and 510-525MHz GPU clock. As soon as I put a computing load on it (in this case a custom video decoder for the mentioned video stream):
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk pviol tviol fb bar1 sbecc dbecc pci rxpci txpci
# Idx W C C % % % % MHz MHz % bool MB MB errs errs errs MB/s MB/s
0 111 48 - 11 3 0 5 9251 1785 0 0 1612 11 - - 0 131 255
0 111 48 - 10 2 0 5 9251 1785 0 0 1612 11 - - 0 132 242
0 110 48 - 11 3 0 5 9251 1785 0 0 1612 11 - - 0 19 127
0 111 48 - 11 3 0 5 9251 1785 0 0 1612 11 - - 0 7 130
0 109 47 - 6 1 0 5 9251 1785 0 0 1610 11 - - 0 0 140
0 108 47 - 5 1 0 5 9251 1785 0 0 1610 11 - - 0 3 127
0 108 47 - 5 1 0 5 9251 1785 0 0 1610 11 - - 0 10 127
Memory clocks up to 9251MHz and GPU to 1785MHz, resulting in ~110W power consumption, even though the load-% is just 5-11%.
Since I already had running scripts that change cpu behaviour depending on what I have running I just figured out what frequencies were adequate to keep my desktop and video acceleration running and limited my GPU to those. If I start some machine learning stuff or a game, then the limits are removed. Numbers from my current 4090 (yea, cost half a kidney):
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk pviol tviol fb bar1 rxpci txpci
# Idx W C C % % % % % % MHz MHz % bool MB MB MB/s MB/s
0 59 51 - 6 1 0 2 0 0 5001 810 0 0 1906 15 344 21
0 59 51 - 6 1 0 2 0 0 5001 810 0 0 1906 15 191 34
0 59 51 - 6 1 0 2 0 0 5001 810 0 0 1905 15 177 21
0 59 51 - 6 1 0 2 0 0 5001 810 0 0 1905 15 340 21
0 59 51 - 6 1 0 2 0 0 5001 810 0 0 1910 15 175 15
0 59 51 - 6 1 0 2 0 0 5001 810 0 0 1910 15 172 35
Memory capped at 5001 MHz and GPU at 810 MHz, resulting in 59 W power consumption. There aren't that many steps for the memory clocks so the step below is 810 MHz for memory too, which is too slow to handle 2 video streams and my desktop smoothly, had 600 MHz there earlier. Could probably lower it after GPU upgrade now when I think about it. Note memory usage and rxpci/txpci, I have more stuff running currently than when I measured the wattage before, I'm pretty sure the settings resulted in ~52-53 W usage on the 3080.
I guess the power profiles in nvidia-settings can do the same, or some other tool, so to optimize this you would set clocks manually and see how low everything works as intended and set that as the max in the profile.
An unfortunate backside of us having 10k+ core GPUs that has to run in sync regardless of how many of them we actually need for the current load. By default it's probably running at pretty high GPU clocks too to not cause stutter at intermittent (game) loads, which would look really bad in benchmarks.
edit:
Messing around a bit with the frequency limits it seems 810MHz memclock is fine for video decoding after the GPU upgrade (and a few driver upgrades too btw), so was able to cut another 9 W.
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk pviol tviol fb bar1 rxpci txpci
# Idx W C C % % % % % % MHz MHz % bool MB MB MB/s MB/s
0 50 45 - 13 13 0 10 0 0 810 810 0 0 2161 15 42 9
0 50 45 - 10 11 0 10 0 0 810 810 0 0 2176 15 285 20
0 51 45 - 11 12 0 10 0 0 810 810 0 0 2187 15 15 8
0 50 45 - 10 11 0 10 0 0 810 810 0 0 2187 15 25 15
0 49 45 - 10 12 0 11 0 0 810 810 0 0 2185 15 34 10
Behaviour seems a bit random in the range of a few watts though, switching video streams on twitch (all 1080p 8000Kbps) it's sometimes stable at 50±1 W and sometimes at 52±1 W. Maybe depends on the encoder settings and thus how the video stream has to be decoded, even though it's the same codec and bitrate.
edit2: Nevermind, the extra power usage from some streams (52±1 W vs 50±1 W) was the stream-switching bumping the GPU temperature over 60° and thus starting the GPU fans: