We'll be referring to electrostatic speakers as ESLs or stats, similar to saying dynamics for cone speakers.
Quick comparison between dynamics and stats:
1) Stats use electrostatic force to vibrate a nearly weightless membrane. Dynamics use electromagnetic force to vibrate a cone.
2) The force on a stat's membrane is spread out evenly, so the membrane moves as one. Cones are pushed from the middle and vibrate in ways other than the ideal in-out motion.
3) The force on a stat's membrane is simply proportional to the voltage at all times, so linearity and low distortion occur naturally. The position of a cone, however, is a complex function of current plus half a dozen continually changing parameters, making linearity and low distortion impossible under typical operating conditions.
4) A stat's membrane is as thin as a twentieth of a sheet of paper, far lighter than the air it is vibrating against, so very little power is used to vibrate it. A cone and its driver coil are at least 100 times as heavy and sprung by a spider and surround, so most of the power goes to move all that.
An ESLs highly linear driving force is spread out evenly over the entire membrane, and the membrane is extremely light -- lighter than the air it is pushing.
As a result, the membrane moves with near-perfect coherency and accuracy, remaining flat and moving without a trace of breakup.The upshot is that the sound waves made by the membrane exhibit practically no harmonic or intermodulation distortion, and can be given a frequency response that is both flat and smooth.
Interestingly and importantly, the way an ESL converts an audio signal to sound is the exact inverse of how a recording microphone converts sound into an audio signal. In a microphone, pressure creates voltage, and in an ESL, voltage creates pressure. This contributes to the exceptional accuracy of ESLs. It is not the case for cone speakers, where electrical current supplies non-linear force to a multiple spring-mass-damper system.
Lastly, because an ESL better represents a recording's phase information, an even greater increase in realism occurs with recordings that have been made in a way that conserves a performance's initial phase information.
The situation is very different with cone speakers. One generally hears substantial contrasts in tone when listening to different designs. A large part of the explanation for this is that a heavy, spring-loaded coil pushes and pulls the center of a high mass radiating surface. There are practical limits to how accurately the coil movement can follow an audio signal, and then to how well the cone movement can follow the coil movement.
A cone speaker's resulting lags, displacements and resonances alter the sound in several ways, and add sound that was not part of the original recording. These effects can be minimized, but not eliminated or even made inaudible. Each cone speaker design reflects a particular set of choices and trade-offs made while engineering around limitations that ESLs do not have.