From time-to-time, I get a call from a customer who recorded
audio-for-picture for a concert or reality show with prosumer audio gear using
MetaCorder and wonder why their audio drifts when post-synced to picture.
This article is an attempt to explain why recording audio for picture requires a
high level of accuracy, often more than is available from prosumer audio gear,
and ways to achieve it.
One of the challenges when designing any location multi-track rig is
balancing ergonomics, stability, audio quality and timebase accuracy. The nature
of reality television production, with its long, unscripted recording, requires
a special emphasis be placed on timing accuracy. This is due not just because
these long takes need to be post-synchronized to picture, but also because of
the nature of the Broadcast Wave file itself and how it stores and represents
time code as an abstract calculation.
Digital audio is recorded by sampling the level of the analog audio waveform
at regular intervals. The rate at which the audio is sampled is of course known
as the "sampling rate" and the standard for film and video production is
generally 48 KHz. If we do a little math, we can calculate that with a sampling
rate of 48 KHz, the audio must be sampled 48 times every thousandth of a second
(ms). The result of inaccuracies in the sample rate clock is audio drift. When
audio alone is recorded, minor drifts are virtually undetectable. When audio
must be synchronized against picture, however, even minor drifts can create
chaos in the editing room, with the result being that for any given shot there
will be more or less audio, depending on the nature of the drift.
Compounding this problem is the way in which time code is stored in the
Broadcast Wave file: unlike "analog" linear time-code, which is recorded
continuously on its own track, BWF files have a "time stamp" at the beginning of
the file, which is the number of samples past midnight (00:00:00:00). When
timecode is played back on a digital workstation, time code is calculated by
adding the number of samples contained in the time stamp to the number of
samples that have been played back in the file at that given point. If the audio
was recorded with a drifting timebase, then the time code will always be
drifting relative to the camera time code, and this drift will increase the
farther into the audio file the workstation plays back. This will be true
regardless of the accuracy of the time code clock - the time code is represented
accurately only once, when the record button is pressed. The longer the take
(and therefore the more drift), the more inaccurate the time code will be when
the file is played back.
When you multiply the amount of audio and video recorded by the number of
cameras and number of tracks being recorded in a given reality production, even
small drifts become enormously costly (at least in terms of time) to constantly
correct. Of course, other things go wrong during production that can cause time
code issues as well, and in my opinion that makes it all the more imperative to
start with a completely stable timebase for the audio recording.
Let's look at how drift plays out with a high end prosumer mixer/FireWire
audio interface, such as a
Mackie Onyx 1640 or
PreSonus StudioLive. Both of these products have excellent value, combining
good sound quality and ease of use with low price. However, they lack one
important feature: The ability to sync to an external clock such as wordclock.
The end user is therefore required to use the device's internal sample rate
clock. The component that drives the sample rate in any digital audio interface
is a crystal controlled oscillator, and its accuracy is measured in Parts Per
Million, or ppm. Every manufacturer has to make compromises to get their product
to market, and since this interface was designed mainly for music recording, the
crystal typically chosen for this purpose is specced with an accuracy of 50 ppm,
which is good for even high end music recording. But when using this product to
record sound for picture, that number tells a very different story:

An oscillator with an accuracy of 50 ppm translates to a
timebase drift of about .05 ms, or 2.4 samples per second. Remember, MetaCorder
rigs are often left recording for a few hours at a time, but for the sake of
simplicity, let's say that the rig is making a one hour recording. Multiplied
out, 2.4 samples per seconds becomes 8,640 samples per hour. Since there are
about 1,600 samples per video frame, this equates to 5.4 frames per hour of
drift - that's both audio and time code drift. Of course, some individual units
may be more accurate than 50 ppm (the spec indicates the maximum oscillator
drift), but without the ability to sync to an external source, the Mackie and
Presonus mixers tie the customers hands.
There are a few ways to insure audio recordings made will be accurate enough
for recording with picture:
- Use a master wordclock generator with high accuracy and low jitter. Two
examples are the
Rosendahl Nanosync HD and the
Brainstorm Electronics DCD-8. Both devices also have the added benefit of
being able to sync from not just external word clock but external video
sources as well. The Nanosync can natively generate video sync signals and
timecode (ensuring perfect phase accuracy between video, timecode and word
clock, while the DCD-8 can optionally generate video sync – features perfect
for multicamera video shoots. The Rosendahl Nanosync HD specifies a crystal of
0.5 parts per million, which translates to a drift of .054 frames per hour, or
1 frame in 18 hours - 100 times more accurate than the typical mixer with
built in FireWire interface. The Rosendahl or Brainstorm would then supply
wordclock to the audio interface – just remember to set the device to external
sync!
- Along the lines of option #1, you can use a professional audio recorder
designed for film and television production to supply wordclock. The
Sound Devices 788T, for example, is specified with a crystal capable of
being tuned to 0.2 ppm.
- Use an audio interface designed with film and television applications in
mind. The
Metric Halo 2882 and
ULN-2 2d interfaces, for example, are specified with an accuracy of 5 ppm
(and are often more accurate in practice). The
RME Fireface 800 with the TCO (video and Time Code) option is another
example, and is the only interface that can natively generate a sampling rate
of 47952 and 48048 – useful in some film workflows.
Remember, no matter what your workflow is, the most important element is to
test it. For some productions, any audio drift can be dealt with by simply
varispeeding the audio to match the picture. Other productions may find that
solution intolerable and require frame accurate audio and timecode.