The 2013 Hollywood Post Alliance Tech Retreat started today with a presentation from Charles Poyton on “High[er] Frame Rates”. Charles couldn't attend in person, as his Canadian passport expired unnoticed, so we enjoyed him via a live webinar instead (which had all the “cone of silence” problems most webinars do!).
Charles, an imaging scientist, discussed perceptual and display-technology issues that need to be considered as a basis for understanding high frame rate (HFR) presentation. As is usually the case with my Tech Retreat coverage, what follows is a rough transcription / condensation of his remarks.
High frame rates? It's Peter Jackson's fault; “he had the balls to release a film at 48fps”.
Figuring out and measuring motion perception is a tricky thing; harder than figuring out spatial resolution. Multiple pathways in the brain decode motion; “it's super-complicated”. It's so complicated we shouldn't really rush into HFR imaging before we understand it better.
A slit-scan image, shown as an example of an odd motion artifact
“Is perception discrete or continuous?” This discussion has been ongoing for at least a decade. We think it takes 20-50 msec to process visual data, but we don't know if this is “frame by frame” or continuous. We know 48 fps is different, but we're not sure exactly why. It may be a little like MPEG: the only way to seriously evaluate images is to look at 'em–we can't really do it algorithmically. “And that may be all we get.”
How do we see things in motion? A huge amount of the performance of HFR video is due to eye tracking (Charles discussed eye-tracking research and drew diagrams; see figure).
One of the papers Charles discussed, with “telestrator” sketches atop it.
Charles laid out the Science / Craft / Art division of color grading; then suggests replacing the “science” controls on a display with a 4-position switch for temporal rendering: Black-Frame Insertion, Interlace, etc., as on the Sony BVM-series LCDs.
Monitor controls are science, the proc amp is craft, and the grading concole is art.
The “science” controls for motion: interlace settings, dark-frame insertion, etc.
But these controls affect the picture the “craft” and “art” folks are watching, though the picture being fed the display is unchanged. “Can we be surprised that the production material looks different from the material in distribution, if we don't know how those display controls are set?” Today, the reference display is missing that standardized temporal rendering: we've finally standarized gamma, but not time rendering. Ideally we'd include metadata on temporal processing during mastering and the final display could match that, but until then, there's a danger of temporal mis-rendering. Cimena production , post, and display is all done at a professional, controlled level; HD may or may not be as well controlled (a good example of pro HD display is Mark's Met Opera broadcasts, shown in theaters). The consumer display is entirely uncontrolled.
So, Hobbit shot at 48fps, shown at 48fps. How will this translate to Blu-ray? Met Opera 'casts at 60 images/sec, shown the same way.
Sampling theory reviewed in 1D, extended to 2D (as on an image plane). Can't really sample the world with square pixels; need a weighted (Gaussian) sampling function. Digitization = sampling + quantization. Camera will typically use 12-14 bits, getting it down to 10 bits is complicated.
As Charles said: it's complicated!
Flash rate: need ~48 Hz in the cinema, ~60 Hz in the living room, ~85 Hz in the office (to fuse flicker). Depends on the duty cycle of light emission (Joyce Farrell): if short duty cycle, flicker is more noticeable; 50% less so, 100% not at all. CCFL backlights don't flicker [not necessarily so, and PWM-controlled LED backlights also may flicker; depends on driver frequency in both cases], so LCDs don't flicker.
It's better to use N bits to grayscale-modulate a pixel than to use the same # of bits to control N binary subpixels. Alvy Ray Smith: “A Pixel is not a Square! A Pixel is not a Square! A Pixel is not a Square!” A pixel is a point, not an area (2D sampling); how to translates to 3D (X, Y, Time)? “Smooth pursuit” eye movement causes simple extension of 2D sampling theory and reconstruction to temporal sampling to fail–as the eye follows moving subjects of interest. Fovea about 1 degree wide; where fine detail is seen. Outside that area, detail falls off, but the eye can move to another area to see detail.
An experiment involved eye-tracking a subject reading text on a 24×80 character screen. A degree away from the gaze point, software changed all the characters to Xes, and the subject couldn't detect the change! But of course we can't fit all our image observers with eye-trackers, so we need to maintain detail through the scene.
Saccades (rapid eye jumps from fixation to fixation) at 4-8 Hz. Dwell on gaze points around 100msec. Microsaccades / tremor (beyond the scope of today's talk). The key for motion imaging is that light hitting the fovea is integrated along the smooth pursuit track.
Question: A bird flies past a tree. If in frame 1 it's entirely to one side of the tree, and in frame 2 it's entirely to the other side, did it fly in front of or behind the tree?
CMOS sensors: 3 transistors per photosite gives rolling shutter. A global shutter requires an extra transistor. Thus 33% increase in complexity to get a global shutter.
A 1908 photo taken with a focal-plane shutter (same artifact as a rolling shutter).
Backwards-turning wagon wheel due to insufficient temporal sampling. Poynton's 1996 paper on “Motion Portrayal, Eye Tracking, and Emerging Display Technology”. Triggered by looking at early DLP chips and how they made images. Charles showed a diagram of 24fps film being interlaced-scanned with 3:2 pulldown; this leads to judder (due to some frames being “50% longer” than others), which is better than spatial disturbances within the frame (if 3:2 pulldown weren't used, and the frame change happened without regard to the video scanning). Another example: dot-matrix LED signs are typically row-sequentially illuminated; if the text on the sign is moved laterally (crawled), you'll see a slant in the text as your eye tracks the text moving across the sign.
Next: How display technologies affect motion presentation…
Difference in displays: CRTs look sharp in eye-tracking as the illumination time of the phosphor spot is ~100 microseconds. Double-projected film images show a double-imaging; LCDs are constantly-illuminated (+/- on/off times) so eye-tracking an image across it causes visual smear: the image is static for the entire frame duration, but your eye moves across it. (DLPs and Plasmas use pulse-width-modulated pixels; these are very complicated to model.)
Capture timings on the left, display timing on the right.
There's also background strobing: a moving element in motion on film doubles up (due to two-blade projector shutter), and the background blurs (as you track the FG object). On a CRT, both FG and BG are sharp, but background strobes (due to mostly-off duty cycle). Imaging a ball rolling past a picket fence; if shot with a short shutter time, you'll also see aliasing (fence pickets moving backwards, like the backwards-spinning wagon wheel). Thus you want some temporal aperture on both acquisition and display sides (anecdotally, 1/3 to 1/4 the time interval, or around 90-120 degree shutters).
Let's set up “HD Ping Pong”: a 16 pixel wide ball traversed the 1920-pixel-wide screen in 1 second. The ball moves 32 pixels per 60fps frame (30 degrees/second of smooth pursuit; smooth pursuit can be up to 200 degrees/sec). Eyetracking the ball on an LCD, it's 32+16 pixels wide by 16 pixels high! What to do? Boost fps?
A different thought experiment: Move a windows cursor around an HD screen, 1 revolution per second: about 1000 pixel diameter, or 3142 pixels/sec circumferential motion. Charles imagines a fixed gaze, but a moving display such that the moving cursor seemed fixed at the gaze point. Imagine also a window between viewer and gaze point, so you won't see the rest of the display. You'll see whatever blurring / trailing occurs, without having to actually track a moving target. (Right now this is being done with camera tracking: moving camera, not moving display.)
On Plasma and DLP (PWM displays) causes textural and color artifacts (dynamic / false contours; color fringing and contouring). Pulse width modulation, could also be called pulse-count modulation what counts is how much time the pixel is lit, whether through wider pulses or more of 'em.
Original DLP modulation simply used binary-scaled bit widths: MSB (bit 7) drives 50% of duty cycle, bit 6 drives 25%, etc. Used linear light, so still needed gamma-compensation upstream. But with motion, each pixel strobes out a pattern across the retina: the retinal doesn't integrate light at a single pixel, but the time history across (say) 32 pixels.
Simplistic DLP modulation method: scaled binary pulse times.
If one pixel's value is 128 (binary 1000 0000), all its light comes out in first half of the time interval; 127 (0111 1111) comes out in 2nd half.
One pixel at code value 128, the next at code value 127, and what their time histories are.
As your eye tracks from one to the other, you'll see a double-bright (255) contour as your eye crosses from one to the other (moving the other way, you'll see a black contour).
Newer DLPs “bit split” to distribute the energy throughout the frame; instead of the MSB triggering the entire first half of the frame, it may be split into four pulses throughout the interval. This greatly reduces (but doesn't eliminate) the visibility of contouring (same thing in plasma is called subframing). Higher-value bits are bit-split; lesser ones are semi-randomly distributed.
There's a limit to the possible subdivision due to setup/load times on the pixels. DLP is basically a 1 bit SRAM topped with a 1 bit “mechanical memory” in the mirrors themselves. Transfer of the SRAM “memory” to the mirror memory is very fast, but it's not instantaneous. Charles estimates a 200 microsecond load time for a 2K array; the minimum bit-cell width is about 10 microseconds, so load time is 20x the minimum bit cell time; there's a loss of efficiency since (obviously) not all of the load time can be “hidden” behind a memorized display.
PDP (plasma) is much worse: you can't load in the background while a pixel is displayed. 1 microsecond to address a row, x 1080 rows is 1 msec to load the panel: 1/16 of the frame time is taken up with loading (at 60fps). Thus plasmas will never do high frame rates.
Back to DLP: could you compute the PWM pulse train so that the integrated eye tracks would appear correctly? Different viewers have different eye tracks; compute, say, the 64 most likely tracks? We don't really have a reliable way to do that, and even if we could it's a really difficult “inverse” problem (computationally tricky to suss out what the inputs should be for the desired outputs).
LCD and AMOLED: panels with row & column drivers. In HD, 1080 row select switches; 5760 column driver DACs (1920x R,G,B). The row selects feed the analog column values into each pixel's capacitor, which stores the value.
Basic layout of a 1920×1080 LCD or AMOLED display panel.
Thus the panel is refreshed scanline by scanline. If the refresh time top-to-bottom is fast enough, you can repaint the panel at 120Hz, 240Hz, etc., possibly with synthesized in-between frames (as is done with consumer displays). MPEG-2 and h.264 compute their motion vectors to minimize data rate, not to transmit “true” motion vectors, so can't really use decoded motion vector data to synthesize these in-betweens; might be nice for HFR to also send true motion vectors.
Since LCD and AMOLED are essentially 100% duty-cycle displays, eye-tracking causes motion blur. Sony's OLED displays use a 25% duty-cycle; after painting the picture, 1/4 of the frame time later Sony paints black. This reduces motion-tracking blur considerably without reducing brightness too much. With LCDs, you can flash the backlight to reduce the duty cycle and minimize blur.
To conclude: it seems obvious, but you can't judge motion portrayal by looking at a still frame! It's a bit like Stereo 3D: psychophysically we don't really understand it yet. We shouldn't rush into HFR before we understand it better.
It must be evaluated visually.
It's subjective.
The HD / video / digital cinema community hasn't done a great job on motion standardization: we still have interlace; we don't convey 3:2 pulldown and scene-change metadata (so consumer displays have to work very hard–millions of transistors–to figure this out); we didn't include 1080p60 in ATSC Table 3; we haven't standardized temporal characteristics of reference displays.
(BT.1886, March 2011, standardized gamma at 2.4. That should have been part of Rec.709, or ATSC in 1953!)
We should standardize a reference display including temporal characteristics, and consider a plan to migrate away from interlace in production.
YouTube's preferred rate is 30.00 Hz. 29.97 to 30.00 Hz is a “higher” frame rate!
Afterwards, Mark Schubin followed up with “the look of HFR”, the presentation that was “supposed to be given in the bar with Charles after his talk”.
NHK at NAB 2012 showed a clear difference between 60fps and 120fps at 8K. But the real-world targets the 8K camera was shooting looked much clearer still…. which is why the BBC proposes 300 fps.
IBC 2012: twice as sensitive to lip-sync errors in 3D than in 2D (due to alternate-eye presentation, at twice the one-eye rate?).
What is reality? “The Arrival of a Train” caused consternation (silent B&W); 78 rpm Edison records were thought indistinguishable from live performance. Do we want reality? Or storytelling?
Disclaimer: I'm attending the Tech Retreat on a press pass, but aside from that I'm paying my way (hotel, travel, food).