The show floor itself was filled with the usual collection of VR hardware, motion capture companies, and VFX and 3D software (including the excellent new release of Fusion 9–stay tuned for an in-depth review shortly). But the real story was found within the dozens of technical papers presented during the conference sessions. Many of these have bizarre academic titles like “Variance-Minimizing Transport Plans for Inter-Surface Mappings”, “Harmonic Global Parametrization With Rational Holonomy” and “Spherical Orbifold Tutte Embeddings.” (I’m not making these up–they’re straight out of the Siggraph program.)
However there are several sessions that are much more direct in application. For example three of the lectures I attended involved researchers recreating facial animation from a single audio source with no video. Some of the results were remarkably organic.
To me one of the most interesting sessions was titled “Computational Video Editing for Dialogue-Driven Scenes.” In other words, how a computer can edit your scene for you.
Now the good news is that Michael Kahn and Paul Hirsch aren’t out of a job just yet. Their assistant editors, however, may well be. The most surprising thing about the paper is that the edit that the computer came up with wasn’t horrible. I’d encourage you to check out the short overview video below. Now it’s not going to win an academy award, but it was actually better than a lot of the mediocre editing that constantly makes it out into the real world.
The Stanford paper in a nutshell
Here’s how the system works: audio analysis aligns various takes with different portions of the script and different speakers. Facial recognition both identifies the actors in a given shot and the kind of shot based on the size of the face relative to the frame (e.g. a face that is close to the full height of the frame would be considered a close-up or extreme close-up, while a small face in frame would be considered a wide.) The system was actually pretty simple in its current implementation; given modern machine learning algorithms it would be relatively trivial to train such a system to identify and tag shots as OTS, cutaways, shots with camera bumps to be avoided etc.
Then idioms–the papers’ authors’ chosen term for cinematic language–are applied to get a desired edit. These idioms include: no jump cuts, start wide, and focus on a specific actor for emotional emphasis. Obviously all kinds of other idioms could be implemented.
The AI assistant editor
So again, to reiterate: I’m not suggesting that a computer will replace the important nuancing of emotion that makes great editors great. But think for a moment about the job of an assistant editor. Sort through footage, break out and arrange shots based on shot type, and sometimes create baseline edits from which another editor can begin. All these tasks are performed extremely well by this prototype project from Stanford.
Ultimately where this and similar technologies are disruptive is in eroding the traditional entry level platform into the industry. With less need for assistant editors, how does any editor get in the door to start working on high level content in the first place?
Like all disruptions in life, you can see this evolution of technology as threat or opportunity. It threatens to take away a slew of traditional jobs from the industry, but it also opens the door to individual creativity by removing more of the tedium that takes away from the creative.
Other takeaways from Siggraph
A few other observations from Siggraph: the film & video VR bubble, as predicted by many writers here, has burst. I found even the most avid evangelists from last year’s show were willing to concede to me that narrative storytelling in VR at a Hollywood level lacks a financial model. Most of the VR technology vendors are focused on VR and AR as a gaming platform or for industrial applications (arch viz, mechanical repairs etc.)
In the drone world, an interesting paper on multi-view drone cinematography showed promise for using drones to replace traditional steadicam and dolly shots, both indoors and outdoors. This includes the automatic avoidance of actors (i.e. not slicing your talent up with copter blades) and coordinating between drones so they don’t show up in each other’s shots. The indoor flight paths are still quite noisy due to air perturbations, but with cameras getting smaller and lighter I could see this being a viable alternative to dolly and jib work within the next couple of years.
Light field capture and display technology were also big at Siggraph. One particularly interesting light field capture method was a paper titled “Light Field Video Capture Using a Learning-Based Hybrid Imaging System .” What it essentially showed was the ability to combine a cheap Lytro camera with a standard DSLR to create effective light field video at a fraction of the price and bulk of existing methods. The paper addressed mainly using the technique for defocusing and rack focus in post, but it seems like it could be used for pulling quality keys without the need for a green screen. And while they combined the Lytro with a DSLR, there’s no reason that the same technique couldn’t be used with a higher-end camera.
The age of AI
One thing that ran throughout the technical papers: almost every paper used machine learning or deep learning (artificial intelligence algorithms) somewhere in the process pipeline. We don’t have to worry about Terminators or being plugged into a Matrix just yet, but there is a very real threat to jobs in highly technical areas previously considered immune.
I always feel like technological advancement is a mixed bag. I love having a smartphone, but I also remember a day when people left you to get a solid few hours of work done without having to constantly reply to emails, text messages and voicemails. At the end of the day, while I can be nostalgic for those less connected times, I’m not going to be able to stop the artificial intelligence revolution. The clear message: artificial intelligence is coming to our workplace and we need to start adapting.