Noise has many components: shot noise(or Poisson noise, essentially the sqrt of the flux), read noise, dark current, and quantization noise.
The dominant factor is, by far, shot noise: that's why darker areas or underexposures are always noisier than lighter areas or properly exposed images.
More light is always better in terms of noise.
Smaller pixels (say 3 µm) used to be much worse than bigger pixels (say 6µm) because the sensing area of 2x2 3µm was smaller than 1x1 6µm. The sensing area did not cover the full pixel surface (columns and lines separation, even CMOS micro-electronics). That disadvantage has now been greatly reduced thanks to microlenses, back illumination, and the like.
View attachment 261874
The other area where large pixels had an edge was dynamic range (because they had a greater electron well "depth" or capacity). Binning solves that problem, but only if it doesn't add noise. That depends to a large extent on read noise. Initially, the CCD architecture and the way it was read used to provide a big gain in read noise (1x read noise for 2x2 binning vs the 4x read noise in CMOS architecture). But that stopped being the case when CMOS read noise became so low that even 4x its read noise was lower than the CCD read noise). You still get a tiny amount of additional noise in 4 small pixels vs a large one though. The next generation of small pixel-based sensors could even improve on larger pixels in terms of the dynamic range because the trenches of the "walls" of the wells will be/are used. That should be the revenge of the small pixel
View attachment 261872
The dark current becomes an issue for long exposure, or if the camera overheats (as many of the recent top-of-the-line models tend to do).
Read noise remains a concern for some applications (ultra high speed among others).
There are some cases where, if you look for the optimal resolution, pixel size vs focal length matter because you want to sample the MTF at Nyquist, but those cases aren't applicable to everyday photography.
Still, in today's world, the very dominant factor is the ability to gather light. The bigger the lens, the bigger the total sensing area, the better QE (because you need to convert those photons to electrons), and the higher well capacity are the dominant factor.
A single Sony AR IV 3.76 µm CMOS pixel laughs at the 11.56µm Canon 1D CCD pixels.