I wanted to write something about this as some of my recent posts have had to do with loudness (in terms of digital files) as well as SPL and measured pressure. However, I really haven't discussed how, due to how the ear operates (as opposed to a measuring system) we perceive loudness.
Mind you, this is not meant to be an exhaustive look at Human hearing; that's an incredibly complicated topic (as those working in the field can attest), and I'm sure that I will gloss over many concepts, but here's my Reader's Digest version.
The Way-Back Machine (Fletcher-Munson Equal Loudness Contours):
Many decades ago, research was done to try and better understand how we perceive loudness as a function of frequency. This was done by getting a large cross section (and sample population) of people to come in and listen to some sounds at controlled (known) levels. Effectively, the procedure for this was to play a known tone that the subject could easily hear (as indicated by the subject raising a hand). Next, a series of different-frequency tones were played for the listener to see what level was required for them to be heard at (according to the participant) a level that was of equal loudness to the other tone just presented.
That is, a participant indicates that they can hear the tone being played. Then, after a brief period of silence, the next tone different frequency) is played (with the researchers controlling the gain of the signal, and thus, the energy imparted to the ears). If the participant does not think it is as loud as the previous tone, they indicate as much. Then adjustments are made to the gain of the second tone (keeping the gain of the reference tone constant) and then the second tone presented again, followed by the reference tone to allow the participant to make a quick A / B comparison. When the perceived loudness of the 'other' tone was equal to the reference tone, the participant indicates this, and the value at which it had to be presented is put into the data pool.
Still with me?
This same process was then repeated many times over with the gain of the original reference tone adjusted to different known levels, and the experiment repeated at that level with the other tones being presented in the same fashion as before.
After collecting all of these data, they were plotted against the loudness of the reference tone, and what they found was that as the reference tone grew softer and softer (quieter and quieter), as you marched down the frequency scale, the loudness required of a given tone...to be perceived as loud
as the reference tone...had to be presented much higher that it would for when the reference tone was significantly louder.
In other words...the frequency response of the human ear depends upon the level (the intensity) of the sound presented to the ear If this seems a bit confusing to you, that's understandable, then the easiest way to resolve this is by looking at a plot of these equal loudness contours that Fletcher and Munson created. This is why, when you look at the equal loudness countours, you can see that the frequency response of the Human ear tends to flatten out (become more uniform) as the intensity level of the sound increases.
This is where the non-linearity of the ear was probably first really documented, but even if not, the Fletcher-Munson Equal Loudness Contours have since become a sort of touchstone for scientists, engineers, and audiologists alike.
Some Time-Dependent Properties of Human Hearing - Integration Time:
OK, now what follows is a highly condensed version of another aspect of how we hear. In the bit about equal loudness countours above (i.e. the Fletcher-Munson curves) the study focused on tones being presented for a few seconds or so at a time - and at a constant level. However, the Human ear has an integrating mechanism, that is, its tendency is to average that which we hear depending upon how long we hear it.
Hmmm...so what you're saying is that how long I hear something affects how I perceive it?
Yes.
Let's take a simple example.
Suppose you hear a very short burst of white noise at a given intensity level. The white noise, as being played back by whatever system we happen to be using (headphones, speakers, etc) is known to produce 70 decibels at your ear canals. Now, if you are allowed to hear this sound for something like a fraction of a second, you will not perceive it as being as loud as if you were allowed to hear that same white noise for a somehitng closer to a full second (or more).
Nevertheless, in this example, the white noise is always being presented at 70 dB, and yet, as the sound plays for longer and longer periods of time, you perceive the sound as being louder. What's interesting here is that (in this example) from a mathematical perspective, the actual sound pressure level is independent of how long the sound plays - but - that is not how we perceive the loudness of the sound. In fact, the converse is true - the loudness that we perceive is a function not only of the intensity level, and not only of frequency...but also...how long the sound exists.
Interestingly enough, this also means that as we are subjected to sustained sounds (i.e. music, speech, environmental noise etc.) that as the sound changes, what we perceive in terms of loudness lags (in time) with respect to the present. That is, as something changes in loudness, our assessment of the loudness will be based on what has just happened, and not what is happening right now. Again, this should kinda-sort-mostly-in-a-way make sense, because if the ear requires time to integrate sounds that are just being presented to it, it would aslo seem to follow that this integration time also plays a role on the 'back side' of this perceptual porch.
Masking:
Most of us are familiar with one form of masking that we experience every day. There are two main types of masking - temporal (time) masking, and frequency masking. Let's start with temporal masking.
Suppose that you are working in your garage - a fairly reverberant space - on a summer's day. Things are relatively quiet; you can hear the birds singing off in the distance, but as you take a step, you knock over the push-broom that was up against the wall, and as it falls upon the hard concrete, it creates a very high-intensity (perhaps 120 dB) but very brief 'snap'. Now because of this, for a fraction of a second after the broomhandle stikes the concrete, you will not be able to hear the quiet sounds (i.e. the birdsongs) until your ear recovers from this sudden and comparatively huge (as compared to the birdsong) level event.
Additionally, in this example, the impact of the broomhandle on the concrete, speaking mathematically, is a bit like a delta function. That is, in a delta function (or in the DSP world, the Kronecker delta function) the energy of the impact is 'infinite' but its duration is almost zero; when you take the frequency transform of the delta function, you get a spectrum that is flat from "DC to daylight". This never is realized in physical systems, but the principle still applies and the resulting spectrum approximates a flat spectrum. That is, the impact of the broomhandle creates a sound whose spectrum comprises (nearly) all audible frequencies. As such, your ear reacts not to a tone, or a note on a piano (or series or group of notes), but instead to a sound that is akin to white noise - a very brief but very loud burst of somthing that approximates white noise.
Temporal masking is one of the principles that is employed in compression codecs (such as mp3, ogg, or any other 'lossy' codec). That is, a mathematical algorithm that is based on these temporal attributes of our ears, uses this to throw away content right after the comparatively loud event - the justification for this is that you would never hear it in the first place, and thus, there is no need to keep it. his 'loss' of content is where the adjective "lossy", as applied to codecs, comes from.
Frequency masking is a different thing, but masking nonetheless. In frequency masking, what happens is that if one sound, of a given frequency is significantly higher (in level) than another tone that is close to the louder tone (in terms of frequency), you will not hear the second, quieter
tone...or if there is a broadband sound (think of the sound of the air coming from a hair dryer that is close to your ear), this too can mask a great deal. There's more here too - the spacing of the two tones (i.e. how far apart in terms of frequency) plays a role as do their respective (and absolute) levels.
While the example of the tones may not necessarily be intuitive, you probably have had an experience that's related to this. Think about the times when you have been trying to hear one thing, but another louder (or broader spectrum) sound prevents you from hearing it. That's masking. In many instances, the 'other' sound is measurable, but the presence of the louder sound renders the quieter sound inaudible to you.
For example, you're in your car, driving along at a steady speed and on a smooth, level, and quiet road; the winds are light and variable. However, you decide that you need more airflow from the cabin cooling system, so you increase the fan speed. When you do, you note there is at least one or possibly two high-pitched tones (which are related to the blade-passage frequency (and its harmonics) of the fan)...but you are still a bit uncomfortable, so you turn the cooling fan up to a higher speed. With the temperature at a comfortable level you now notice that the tone(s) are no longer audible. Instead, all you really hear is the random flow noise coming from the vents (the 'shhhhh' or 'whoooosh' type of sound). However, if you had a spectrum analyzer handy and took a long enough sample, and averaged it, you might very well see the tones standing proud above the random noise floor - they are measurable...but not audible.
Like temporal masking, this is another element of the perceptual codecs used to digitally-compress files (and I'm not speaking of level-compression here...that's a related thing, but different). This is how a 160 kbps mp3 (for example) can end up being only about 10% the size of the wav or FLAC file from which it was derived - because mountains of the data in the original file are being thrown away (and thrown away forever), based on these mathematical models of the human ear.
As far as the information being "thrown away forever", what I mean is that even though you can convert an mp3 back to a wav file (for instance to burn an audio-format CD), the data that were thrown out during the mp3 encoding process are NOT put back into the wav file. Yes, the size of the .wav file is now as you would expect (governed by the sample rate, precision (16 or 24 bit), time, and number of channels), but the lost information cannot be recovered.
Just how severe these losses are is related to the bitrate; codecs that use higher bitrates equate to less compression, and thus, do not yield the file size reduction that lower bitrates can achieve. Conversely, if you want the compressed file to be small, then you have to surrender fidelity.
In contrast, truly lossless codecs (such as FLAC and others) are to music / speech files what things like PKZIP (remember that?) were to document encoding. No one would have accepted a compression algorithm that, upon compression, would change the content of a written document...I mean...can you imagine what would happen if upon unzipping a compressed word document or text file, the words themsleves, punctuation, or associate case of certain letters were to be altered? FLAC (and other truly lossless codecs) is a lot like PKZIP in that sense as it allows for on-the-fly reconstruction to what was in the .wav
file. However...the penalty? Yep, it doesn't compress file size anywhere near what a 160 kbps file would (in rough numbers, at 160 kbps you get about 90% compression in file size, but in FLAC, you only get about 45%).
In closing, let me say that I'm pretty sure that I glossed over a lot of particulars about the issues surrounding masking phenomena and such. However, I wanted to cover the main aspects of the (immensely complicated) subject in a way that hopefully made them a bit more tangible.