(Page 2 of 2 pages for this article < 1 2)
Monday, August 25, 2008
QuickTime Quickies
A couple of non-intuitive hacks for QuickTime audio
Picture Perfect
My clients often send me a QuickTime movie as reference picture. I build my mix against it, send audio back to them for their notes, make changes, send a revised version, and so on until everybody’s happy.
But I don’t send them movies. That’s a waste of time and bandwidth. Since the picture stays the same through each generation, why not just send the track? The client doesn’t have to be particularly technical (or run an NLE) to marry track and picture. They can do it all in the tiny QuickTime Pro app.
The basic moves are simple:
- Open the track I’ve sent them in the QuickTime Player.
- Choose Select All from the Edit menu (or use a keyboard shortcut). Then Edit:Copy. They can close the audio file now; they won’t need it any more.
- Open their original movie file, the one with pictures they sent me.
- Choose Select All on that file.
- Choose Edit:Add to Movie.
They now have a movie with two soundtracks - the original, and the one I just sent. Then, if they choose Window:Show Movie Properties, they get a list of all the streams in the movie… including both tracks as separate entries. Either one can be disabled with a checkbox, or completely deleted.
This only works if both the picture and the new track are the same length; otherwise, there can be sync problems. But it’s no big deal, since I know how long their original file was. And sync problems aren’t intractable. If all else fails, you can sync in QuickTime Pro using leader and a 2-pop. But it takes longer.
You can also stack multiple tracks in the same movie - different musical treatments, or versions of a mix - and select among them from the Movie Properties window.
Some of my clients send videotapes instead of files, but this process can still work. It just means I have to digitize their picture (in QuickTime Pro) and pass it along with the first mix. After that, we use the above steps.
FlexTime
There are some other hacks visible in that last screenshot.
Note that what the client sent me wasn’t a QuickTime Movie at all, but Windows Media… and I’m on a Mac. No problem. Just add the excellent Flip4Mac components, and QuickTime does Windows files.
And note that what I sent them was an mp3. QuickTime doesn’t encode that format, but it can read it. Since my audio software can render mixes directly into mp3 (as well as into full res files), I save a step by sending clients that version instead of transcoding to another format. Like I said, I’m a lazy soundie.
Friday, August 22, 2008
Free Book
I seem to be a premium…
Tuesday, August 19, 2008
Vacuum Packed
Compress audio files without losing quality? You can, if you measure them the right way.
Dawn of Delta
To understand how the better way works, think about what audio data actually means.
Here’s a simplified diagram of how a sound gets digitized. We’re looking at half a sine wave at about -3 dBFS. Five times during that half-wave, we measure its voltage (blue lines) on a scale between 0 and roughly 32,000. The results - numbers between 2601 and 20,400 - are shown in the drawing.
Why ~32,000? Because 16 bit sound allows roughly 65,000 possible values. Half of those are reserved for negative numbers (the other half of the wave). Why only about 20,000 as our maximum voltage? So other sounds can be louder, up to 0 dBFS.
Why only 5 measurements? To simplify the drawing. If this were our 1 kHz test file, there’d be 24 of them.
The highest numbers in our drawing (and their negative equivalents for the other half-wave) require 16 bits to store. If we could somehow make those numbers somewhat smaller, we could store them with fewer bits… saving file space.
It’s not difficult. Here’s the exact same wave. Only instead of noting the value of each sample by itself, we write down just the difference from the previous sample… mathematicians call it the delta.
Same wave, same numbers, just written differently. These smaller numbers will need fewer bits to store!
It’s like if you gave walking directions in two different ways:
Turn right at your front door, go two blocks, turn left, go four blocks, right again one block and you’re there.
or
Turn right at your front door, go two blocks, turn left and keep going until you’re six blocks from home, right again until you’re seven blocks from home...
Both directions will get you to the same place, but the first version is simpler.
Back when desktop computers didn’t have enough power for psycho-acoustic algorithms, this is how audio data compression was done. You can still select it in most audio programs: QuickTime IMA, or Microsoft ADPCM. The delta measurements were arbitrarily limited to 4 bits instead of 16, for 1/4 the data.
Running our test files through IMA gives us the expected 75% reduction (plus a few bytes for overhead):
The only problem with this scheme was that it wasn’t necessarily lossless. Sometimes, samples are more than 4 bits apart. In the case of sudden loud or high frequency sounds, delta numbers would lag behind the proper sample value for a fraction of a second. This would create a soft, short burst of noise around the signal.
As soon as computers were able to handle the more efficient and better sounding masking algorithms, delta encoding was mostly abandoned.
Delta is Ready...
But Moore’s Law still rules, and desktop computers keep getting more powerful.
Modern computers can look at a signal and predict the total delta between individual samples, no matter how big a jump. They’re fast enough to check the guesses, and go back and refine them until they’re accurate. They note the rules for that guesswork in the file, and voilà:
Reasonable shrinkage, with perfect recovery when you open the file. Here are the numbers for two implementations, Apple Lossless and FLAC (Free Lossless Audio Encoder):
Scroll down for an easier to read version.
Signals that are easier to predict will shrink more. That’s why the sinewave loses about 90% (much smaller and more accurate than the original delta method).
But even more complex signals, like our voice/music mix, can shrink 50%. That’s with absolutely no signal loss. When you open the compressed file, it’s a perfect clone of the original.
And unlike most mp3, silences don’t waste much space at all (since the delta remains 0 during a silence). So if you’re sending stems, or individual tracks with pauses, you see even grater shrinkage.
Eye Candy
I’ve thrown some numbers around in this article. They may be easier to grasp as a chart:
The files compressed with IMA (delta 4:1) are nicely shrunk. But remember, this is a lossy compression… sudden jumps in the waveform get noisy. The mixed track doesn’t shrink quite as much under Apple Lossless or FLAC, but it does end up considerably smaller than the Zipped version. And the process is as transparent - or lossless - as Zipping a Word doc before you email it.
FLAC is actually capable of greater compression, because its guesses can be fine-tuned. Here’s a typical FLAC control panel:
But Apple Lossless (and an equivalent setting in Windows Media) are a lot easier to use, and give almost as good results. So next time you have to get small, take a trip to the Delta.
Next Time: A couple of unintuitive shortcuts that can speed up sending audio-for-video in QuickTime.
Saturday, August 16, 2008
Living with (Data) Loss
mp3 and its cousins are a fact of life… here’s how to get the most out of them
Magic Part II… The Takeaway
If you couldn’t hear it in the first place, is it really missing?
There’s a song about stars at night being big and bright (clap four times) in the heart of Texas. But while that state is certainly nice, we know those same stars also live over Ohio and New York. If you’ve got a good imagination, you can visualize being in Times Square, looking up, and seeing the same starry sky.
But while you can imagine it, you’ll probably never see it. Having all those bright signs around makes your eyes less sensitive to starlight, and the atmosphere over Times Square further confuses things by reflecting ‘earthlight’ back at you.
As you learned in my last blog entry, human ears get similarly desensitized when they hear other sounds at nearby frequencies (the nerves can’t handle the data). Add that to the natural ‘blurring’ of even the best playback systems and acoustics (equivalent to those atmospheric reflections), and it’s no wonder we’re sometimes blind to certain details in a recording.
I’ll assume you read that blog entry, or already knew how spectral and temporal masking work. If not (nyah, nyah): you’ll just have to trust me.
Framed!
When you run a signal through an mp3 or similar encoder, the algorithm first breaks the audio into frames, lasting up to a few milliseconds each.
These frames have nothing to do with video frames in the same file. Their length is determined primarily by the data rate - or amount of compression - you’ve chosen. Lower rate files, with more compression, use longer frames.
Each frame is boosted so its loudest wave reaches 0 dBFS. This is to take advantage of every bit during processing. The amount of boost is noted with the frames, so they can be restored to their original volume on playback.
(That’s why normalizing or boosting a raw audio file doesn’t make compression any more efficient. A louder file might be easier for users to hear after it’s been decoded, but that’s a different issue.)
The algorithm looks at each frame and measures how much energy the frame has at different frequencies. The number of frequencies is a trade-off: more bands allows tighter masking, but requires sharper filters that respond more slowly. The mp3 format uses up to 512 bands, other compression systems have more or less.
- If a particular band is silent during the frame, the process notes it and doesn’t waste any more data there.
- If a band is loud, it reduces the number of bits. The loud signal will mask noises at the same frequency.
- If a band is soft, it’s processed with more bits, unless there’s a masking sound in an adjacent band. Then it assumes the band won’t be heard, and deletes it entirely.
- The resulting audio is run through a data packer similar to WinZip or Stuffit. Normal audio is too complex to compress well in these systems, but they do a good job with the simpler data-reduced frames.
The common mp3 algorithm uses this scheme. How good it sounds depends on how well the encoder has been written, and on the bitrate chosen. The newer AAC algorithm couples it with a quick look at adjacent frames to see if temporal masking will hide even more details. For a given bitrate, a good AAC will sound better than a good mp3.
Your First Choices
The most critical setting in a lossy compression scheme, including mp3 encoders, is the bitrate. Lower bitrates mean longer frames, increasing the chance that masking sounds won’t last the whole time. The result is noise and a flangey or chirping effect.
Which bitrate you consider low, and how much noise or distortion is acceptable, depends on the application. But if you do things right, broadcast-quality sound can be achieved at 128 kbps. One of the most important factors is which encoder you use. Even in a standardized format like mp3, there are multiple trade-offs that program designers have to make.
Commercial encoders are usually better designed in this respect than freeware. Because they’ve also paid licensing fees to the Fraunhofer Institut - inventors of the mp3 format - commercial publishers may have had more access to inner workings of the system. But at least one free encoder, the open-source LAME library, is also very good.
It makes sense to use a high-quality encoder. Other things will help as well:
- If you have to encode at a low bitrate, get rid of high frequencies first. Apply a low-pass filter at 8 kHz to 12 kHz (or use a good sample-rate converter to lower the rate to 22 kHz, which filters sounds above 10 kHz). The moderate dullness this imparts will be less objectionable than low bitrate noises.
- Don’t try to help the high-frequency filtering by boosting just below the Nyquist Limit, even though many encoders or sample rate converters give you this choice with a “Preserve Highs” option. It wastes precious bits on unimportant sounds, and can increase the chance of flanging or chirping.
- Don’t use extreme broadcast-style level compression, particularly multiband compression. This makes it harder for the algorithm to tell the difference between important sounds and those that can be lost.
- Speech is harder to encode than music because it changes faster. The most common distortion at low bitrates is a reverberation-like noise tail on the words. It can be lessened by lowering the number of bands in the encoder, which raises the internal filters’ response times. ( Most encoders don’t let you control the number of filters, but many let you select a “speech” optimization. It does the same thing.)
- Higher background noise levels also increase problems with encoding. Start with the cleanest possible recording.
- The above note does not mean you can use a noisy recording if you run it through most Noise Reduction plug-ins first. The two algorithms fight each other.
While we’re at it, the encoder might give you some other choices as well:
Stereo or joint stereo Most algorithms expect the left and right channels of a stereo pair to be similar. This is usually true in music. A joint stereo mode encodes only major differences between the channels, particularly at high frequencies, freeing up more of the bitrate for better quality. But ambiences and crowd sounds can be very different on the left and right, if the space isn’t reverberant and there are lots of spread-out sources. With these sounds, “joint stereo” pushes things toward the center.
Variable bitrate This option, also known as VBR, can both reduce file size and improve the sound. The algorithm uses different bitrates for each frame, depending on how many are needed. This avoids wasting bits on pauses or easy-to-encode passages.
VBR works best on simpler or slower-moving sources, including a lot of new age or classical music. It presents little advantage on faster and highly processed sounds, such as most pop styles, because the maximum bitrate must be used for most frames.
Lossy: The Next Generation
When you convert a compressed file back to 16-bit linear audio, something will be missing. If you encode it again, the algorithm has a harder time finding details that can be safely deleted. Noise and distortion build up with each subsequent pass.
If you must go through multiple encodings, stay with the highest bitrates possible. If the final release format will be at a low bitrate, don’t apply it until the last step.
There is some evidence that multiple generations through the same compressor sound worse than the same number of generations through a variety of algorithms.
What have they done to my song?
Want to hear exactly what the mp3 algorithm takes away from voice or music, when you do it properly? No hype, no simulation… but a scientific experiment you can replicate on your desktop. It’s at my website.
Next time: how lossless encoders shrink files without sacrificing any data.
Thursday, August 14, 2008
Hearing What’s Not There
Sometimes, making data disappear can be acceptable
Magic Part I… The Mask
How your brain gets around a neural traffic jam
Our ears are not particularly precise sensors.
It’s not for lack of trying: each ear has close to 30,000 nerves on the basilar membrane, tuned to respond to different pitches. But those membrains aren’t like giant organ keyboards, with specific nerves for every tone we ever hear. That would be too much data for the brain to process efficiently.
Instead, when we hear a tone at a particular frequency, a group of nerves centered around that pitch fire. How many nerves go off depends on the volume and other factors. A loud sound triggers more nerves. The brain interprets these groups as a specific pitch and volume.
The nerves aren’t spread out linearly. They’re more concentrated at frequencies where sounds tend to be important, and sparse at the extremes of the band.
In other words, the first audio data compression systems were human. They evolved in our eardrums and auditory cortex.
This has been known for years, and been measured across very large populations. It’s generally called the Threshold of Hearing. Pitches above the threshold get heard. Those below it, don’t.
You can express it with a graph:
Low frequencies are on the left, mids in the middle, highs on the right. (Those calibrations are logarithmic because that’s how we hear pitch.) The vertical decibels are calibrated relative to the frequency where most people’s ears are the most sensitive, around 3.5 kHz. You could consider 0 dB on this chart to be true 0 dB SPL - the nominal threshold of hearing - or any other normal listening level.
The important thing isn’t the calibrations; it’s what happens at the heavy brown line. That’s how the threshold varies with frequency, in most people. (The line is pretty accurate, given my drawing abilities; there are more rigorous ones elsewhere on the Web.)
At 3.5 kHz, the short, green bar at 15 dB is louder than the threshold. It gets heard. But the red bars at 50 Hz and 15 kHz are ignored, even though they’re louder. In fact, most people can’t detect a very high or very low pitch until it gets some 40 dB louder than one they could comfortably hear in the mid-range!
There seems to be good evolutionary reason for this. While roaring predators are louder than human speech, the most important parts of intelligibility are around 3.5 kHz. That’s where it would be most advantageous to understand your neighbor’s shouts, even if there’s a tiger nearby. (More about these frequencies in an earlier blog entry.)
The darned line keeps moving
How many nerves get involved for a particular tone depends on volume, and is constantly being adjusted by our ears. That’s necessary. A nearby jet plane hits your ear with about 10,000,000,000 more pressure than the quietest tones used in a hearing test. But it means there’s even more data compression going on in your head.
All this efficiency comes with a sacrifice. Because louder sounds use larger groups of nerves, and the threshold is constantly being adjusted, softer sounds at a nearby frequencies can’t get through at the same time. Neural pathways that would normally respond are already busy.
The effect can be thought of like this:
When something loud enough comes along (blue bar, about 40 dB at 2 kHz), it drags the threshold with it. The green bar from our previous drawing - and a slightly louder one I added at 1 kHz - don’t get heard, even though they’re above the normal threshold.
The actual amount of masking varies with the frequency, volume, and overall timbres of the sounds, but it’s always there. It gets broader at the extremes of the bands, where nerve bundles are more spread out. A 250-Hz sound, 25 dB above the threshold, ties up so much neural activity that a simultaneous 200-Hz sound that’s 10 dB softer actually disappears.
After-images (and pre-images) in your ear
One of the magic tricks described in the Nature Reviews article is The Great Tomsoni’s Colored Dress Change. His assistant appears in a white dress, which he says he’ll turn red. Her white spotlight goes out and a red one comes on. He makes a joke, the audience laughs, and he tells the booth to change the light back. When the spot turns white again, her dress is made of red fabric!
I write audio tutorials, so you’ll have to read the article to see how he does it. But I’ll give you an audio-based hint. Nerves are chemical, and chemicals have to recover after they’ve been fired. This results in a time-based masking as well.
In this drawing, frequency doesn’t matter. A long loud tone is sounded (blue bar, lasting 180 ms or about 6 frames), and it drags the threshold up to match. But look at what happens in the 50 ms or so after the tone: nerves are still recovering, so the threshold stays up. In fact, the brain even forgets nearby pitches that happened 20 ms or so before the tone, because its pathways get overwhelmed!
What it all means
These two effects - loudness and temporal masking - are the basis behind perceptual encoders like mp3, AAC, and Dolby Digital. Our hearing mechanisms can’t hear certain sounds, so bits in a compressed audio file don’t get wasted on them. They’re also the basis behind most noise reduction algorithms, but we’ll save that for a future series.
Of course, you know there’s a lot of bad perceptual encoding going on. Sounds get thrown away that the brain should be hearing, and we miss them. And bad choices during the encoding can add artifacts that make things even worse. But it’s not the encoding’s fault… it’s the user’s.
Next article: What these compression algorithms actually do, and how to make them do it more efficiently. It’s usually not what’s on your encoder’s menus. Here’s a link.
Technical note:
This masking effect was researched with multiple tones sounding together, at normal and moderately high listening levels. A different set of curves, Fletcher-Munson, was taken with single tones over a much wider range of volumes. It’s similar, but suggests that relative low-frequency sensitivity increases as sounds get much louder, and the difference disappears by 140 dB SPL (threshold of pain). The high end loss remains fairly consistent at any volume.
Fletcher-Munson is the basis behind ‘loudness compensation’ switches on some hifi amps (along with some dubious assumptions about recording levels and speaker efficiency). It’s also the very real reason why movies are mixed on dub stages that are calibrated to theatrical levels.
Both phenomena can be reconciled. But unless you regularly listen to extremely wideband sounds at painful levels, it’s not important here. If you do listen that way - considerably above OSHA recommendations for even short bursts - you’ll permanently damage your ears very quickly. Then you probably won’t hear anything at all.
Friday, August 01, 2008
Deep Throat, Cetacean
What whales consider sexy… and what’s really going on in the audio band.
Monday, July 28, 2008
Sour Notes
The music revolution will not be televised.
Sunday, July 27, 2008
Required (Re)reading
A short essay can turn you into a better filmmaker.
Tuesday, July 22, 2008
Rolling Your Own
A free utility lets you assemble audio tools in an instant. It’s also fun to play with.
Sunday, July 13, 2008
This month, Peas
A famous,funny outtakes tape is worth another listen.
Wednesday, July 09, 2008
Wrong, wrong!
Tuesday, July 08, 2008
Make Her Sound Like I Love Her
Sexual attractiveness may be partly a question of ear candy.
Monday, July 07, 2008
Time Out of Joint
Fixing lipsync for humans… and others
(Page 2 of 2 pages for this article < 1 2)
|