Is Generative AI for Video Ready for Prime Time Production?

By now, everyone has heard of Generative AI (artificial intelligence) tools like MidJourney, Stable Diffusion, and ChatGPT and how they’re impacting the world. But are they actually as disruptive as the headlines suggest? And if so, how might we use the potential of AI and machine learning to power our own projects?

There’s a lot of theoretical chatter about what AI might or might not be capable of, but to really answer these questions you’ve got to use these new tools in a real-world environment. So that’s exactly what I’ve done.

With a group of like-minded and very technical friends, I created a project that combined traditional and virtual production techniques with cutting-edge AI tools aimed at generating new forms of media. As a team, our goal was to establish how far we could push these new tools, whether they’re capable of delivering viable results, and what they might allow us to achieve on an extremely limited budget.

Before we get too deep into the process, let’s begin with some definitions of AI terms and the current AI filmmaking tools. (If you’d prefer to skip ahead, click here.)

AI terms

Artificial Intelligence has been around longer than you might realize. It was first recognized as an academic discipline back in 1956 at Dartmouth College. Initial incarnations included computers that could play checkers, solve math problems, and communicate in English. Development slowed after a while but was renewed in the 1980s with updated AI analysis tools and innovations in robotics.

This continued through the early 2000s as AI tools solved complex theorems, and concepts such as machine learning and neural networks took form through well-funded data mining companies like Google.

All this development set the stage for Generative AI, which is what most people are describing when the term AI is being used. Generative AI is a system capable of generating text, images, or other media in response to natural language prompts.

Generative AI models learn the patterns and structure of training data and then generate new data with similar characteristics. In layperson’s terms, Generative AI tools imitate human reasoning and responses based on models derived from examples created by humans. Unless otherwise stated, when we say AI in the rest of this article, we mean Generative AI.

Now that we’ve zeroed in on Generative AI let’s understand some critical terms of interest to filmmakers.

Algorithm: a set of instructions that tells a computer what to do to solve a problem or make a decision.
Computer vision: an AI that can understand and interpret images or videos, such as recognizing faces or objects.
Deep learning: an AI that learns from large amounts of data and makes decisions without being explicitly programmed.
Discriminator: In a GAN, the discriminator is the part that judges whether something created by the generator is real or fake, helping both parts improve over time.
Generative Adversarial Network (GAN): GANs are like an art contest between a creator and a judge. The creator makes art, and the judge decides if it’s real or fake, helping both improve over time.
Generator: an AI that can create realistic images, like drawing a picture of a person who doesn’t exist or creating new artwork inspired by famous artists.
Inpainting: the process of retouching or completely replacing parts of a generated image.
Large language model (LLM): an algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate, and predict new content.
Machine learning: a method for computers to learn from data and make decisions without being specifically programmed.
Natural language processing: a part of AI that helps computers understand, interpret, and generate human languages, like turning spoken words into text or answering questions in a chatbot.
Neural networks: AI systems inspired by how our brains work, with many small parts called “neurons” working together to process information and make decisions.
Outpainting: extending a generated image beyond its original borders with a second generation.
Prompt engineering: Carefully crafting or choosing the input (prompt) you give to a machine learning model to get the best possible output.
Seed: a unique code that represents a specific generated image; with the seed, anyone can generate the same image with multiple variations.
Segmenting: the process of dividing an image into multiple parts that belong to the same class.
Style weight: the degree to which a style reference influences a generated versus an input video.
Style structural consistency: the degree to which a generated video resembles the input reference.
Training data: a collection of data such as text, images, and sound to help AI model new examples.
Upscale: an AI method of increasing an image’s resolution by analyzing its contents and regenerating them at a higher resolution.

AI tools

There are other interesting terms related to AI, but this list gets you into the ballpark for filmmaking. Let’s continue with a survey of some AI filmmaking tools. Some of these tools are commercial front-ends to open-source/academic products.

Text-to-image generators create highly detailed, evocative imagery using a variety of simple text prompts.

MidJourney: a well-known image generator.
Stable Diffusion: a similar tool to MidJourney.
Adobe Firefly: Adobe’s take on AI is well-integrated with Photoshop and features a familiar professional interface compared to some of the more abstract interfaces.
Cuebric: combines several different AI tools to generate 2.5D environments for use as backgrounds on LED volumes.

Text-to-video generators build on text-to-image by expanding their capabilities to moving image outputs. Some also include video-to-video generators, which transform existing video clips into new clips based on a style reference. For example, you could take footage of someone walking down a street, add a reference to a different city, and the tool will attempt to make the video look like the style reference.

Runway and Kaiber offer text-to-video and video-to-video modes. We anticipated that most of our work would involve these kinds of tools.

Additional tools

NeRF: Neural Radiance Fields or NeRFs are a subset of AI used to create 3D models of objects and locations based on taking various images as a sort of supercharged photogrammetry.

Luma.AI and Nvidia Instant NeRF simplify the capture and processing of NeRFs, creating highly realistic and accurate 3D models and locations using stills from cameras or phones.

AI motion capture/VFX

Wonder Studio: a visual effects platform that takes source footage of a person, rotoscopes them out of the footage, and replaces them with a CG character matched to their movements- without tracking markers or manual intervention.
Move.AI: derives accurate motion capture from multiple cameras without needing expensive motion capture hardware, capture suits, or tracking markers.

The team

I have a group of filmmaker friends I’ve known for years. We met at an intensive film production workshop and bonded over a whirlwind summer of pre-production, production, and post-production. They’re all successful filmmakers in various disciplines, including production, writing, directing, visual effects, editing…it’s a long list.

So when I discussed the potential of AI tools to transform how movies are made with my filmmaking friends, they were keen to test them out together on a real-world project.

The plan

The idea was to have two of my colleagues come out to my place in San Francisco (which includes a 16’ x 9’ LED wall) for a three-day test project. We’d shoot live-action with different cameras and techniques and then process them with various AI tools. We wanted to see if we could create a high-end result with minimal resources and learn if these AI tools held any realistic promise for filmmaking democratization.

First, we needed a story. One of our team members is a writer/director/editor, and was generous enough to offer a few pages from an existing script to use in this shoot. (She requested I keep her anonymous for this article, so we’ll call her Kim.) In this script segment, a rescue worker is trying to retrieve secret antiquities from the cathedral at Notre-Dame in Paris during the 2019 fire—an ambitious undertaking to be sure.

The other collaborator is my good friend, Keith Hamakawa. Keith works as a visual effects supervisor based in Vancouver, Canada. He’s got tons of experience overseeing live-action and transforming it into impressive visual effects for shows like CW’s The Flash, Supernatural, and Twilight. He has also set up virtual production LED stages in Vancouver.

We tried different techniques to capture live-action footage and then see how well they would translate into good AI material. Our three primary modes of production were virtual production with an LED wall, a green screen, and live action captured on-location. Our shoot took place the week of June 19th, 2023—important to note because at the current pace of evolution, things may have changed by the time you read this.

Our experiments

Experiment #1— LED wall with fixed camera, medium close-pp

We started with a sequence in which our protagonist is rushing to Notre Dame on a motorcycle. The shot was set up with me as the actor, seated in a fixed chair wearing a somewhat goofy spaceman helmet in front of the LED wall. We then projected driving plates behind the actor with a Blackmagic Ursa 4.6K camera with Zeiss Compact CP2 prime lenses on a pedestal tripod.

Next, we processed this footage through Runway and used its Gen-1 video-to-video generation tool. We fed a news photo of Notre Dame on fire as a background image, with a full-frame image of a guy on a motorcycle composited over it as our reference. Finally, we hit Generate Video and awaited the results.

Interestingly, this first image turned out as one of the best results of the entire shoot. Runway picked up on both the style of the foreground and background we were looking for. It transposed my helmet into an appropriately high-tech motorcycle helmet. It also transformed the daylight driving footage into appropriately intense fire-filled city backgrounds. Although the humanoid figure had some odd artifacts around the mouth area, in general, we liked the results.

Experiment #2: LED wall with moving camera

For this shot, Kim acted as our main character peering over a wall as the cathedral burns in the background using the LED wall. This time we shot with a handheld iPhone shooting 4K and projected an Unreal Engine 3D background featuring a forest fire environment. Honestly, it looked pretty cool on its own as a shot, so we figured AI would make it even better.

This time we fed the results into Runway with a similar reference image of Notre Dame on fire. We weren’t sure if it was the change in the camera format, the moving image, or the amount of movement in the LED screen background, but Runway couldn’t seem to produce the desired results.

Complex camera movement where the main actor’s face shifts quite a bit…seemed to throw off the AI.

It treated the LED background as a flat object, like a sign vs. a moving 3D object. We chalked this up to the possibility that very few people have an LED wall and are using Runway, so it might not have a lot of successful examples to refer to. Also, having a complex camera movement where the main actor’s face shifts quite a bit during the shot in size and orientation seemed to throw off the AI.

Experiment #3: Green screen

For our next experiment, we tried a green screen setup because putting green on an LED and getting a suitable key is elementary. We created a similar shot to the seated driving shot we had done previously. We captured our actor seated in a fixed medium shot on a green background, again with the Ursa, and then composited it over a background driving shot. The compositing was rushed, but it looked acceptable.

When we fed this shot into Runway, we added a reference image of Thor from the Marvel movies to see what would happen. In this shot, the AI again struggled with the concept of a flat CG background with a physical foreground. It did a fine job of transforming the actor to look somewhat like the Thor-style reference. But it struggled to detect the motion of the composited background plate. Instead, it treated it as a fixed location with moving objects. So it looked like a person sitting in the middle of a field with clouds or moving bushes shooting by. Weird and somewhat interesting though not what we wanted to see.

Our initial three experiments brought us to the conclusion that working with an LED wall was a mixed bag—at least with Runway. Combining live-action physical elements with LED or green screen confused the AI.

Also, it was a lovely day outside, so we decided to shoot some scenes in real-life locations and see how the AI would deal with a shot composed entirely of real elements without composited or projected backgrounds.

Experiment #4: Location shoot

As we went outside, we chose a different section of our test script to attempt. In this scene, the main character is supposed to rise out of the Seine River and approach the burning cathedral. We walked to a nearby college campus that had a large church. The church was undergoing renovations similar to the conditions of the 2019 fire, so it seemed like a good start.

We were on the iPhone with me as the actor again, this time pulling myself out of a small fountain with the church in the background. For this experiment, we tried Kaiber to compare its results to Runway’s. For each attempt, we used various reference images of news photography from the original blaze and our new clip.

These results resulted in some trippy images but also some reasonably promising results. Runway and Kaiber did a fine job creating the background and styling it. Runway was more abstract and also changed the time to night. Kaiber looked more like an illustration than cinematography but was also visually sharper. It did a good job adding in fire effects and smoke plumes, although instead of billowing naturally, they sort of vibrated and shook in place.

Both tools struggled with a consistent appearance for the actor. Runway turned me into a spaceman/firefighter, but one that morphed/mutated wildly during the shot. Kaiber’s results were less abstract but also morphed throughout the shot.

I’m a fiend for wearing a flat cap because my skin burns like a vampire’s in the sun. And for some reason, both tools seemed to change their mind along the way as to what to transform my hat into. Runway made it into a firefighter’s helmet with my real face occasionally shining through. Kaiber made it closer to the real hat, but it also shifted between several different styles as I pulled myself out of the fountain. Parts of Kaiber’s shot looked pretty good, but the results were not consistent across the shot’s duration.

Experiment #5: Additional on-location shoot with more camera movement

Most of our shots so far involved the actor in a medium or close-up shot, more or less facing toward the camera with minimal camera movement. While many shots in a movie are precisely these sorts of shots, we also wanted to see what AI would do with the kind of wider master and establishing shots that a movie would contain.

At this point, we realized we probably wouldn’t get enough consistent looks across our shots to make a complete scene, so we abandoned the script and just shot various complex shots as we walked around.

We fed a couple of these shots into both Kaiber and Runway. And hit the same limitations: portions of the results were interesting and what we wanted. But the background and—even more so—the actor would mutate and transform throughout the shot in ways we couldn’t predict or cancel out. They are fascinating as experiments but not tools you could consistently use to produce something not meant to be impressionist/fantasy.

Experiment #6: Text-to-text

Although almost everything we did was live-action experiments with video-to-video, we thought we’d try one more avenue. Just to see what was possible with no source image reference, we also tried some text prompts using the text-to-video Gen-2 generator tool in Runway.

While this consistently produced interesting results, they were different in terms of output style, regardless of the reference imagery and parameters. Sometimes it also resulted in odd deformations of the human characters, so hopefully there will be an option to tamp that down at some point.

This could be useful for storyboarding or brainstorming concepts. But it didn’t seem as useful for final imagery because you want your characters and settings consistent from shot to shot across a sequence. Sure, we had some success getting some consistency by reusing seeds and reference imagery, but the results were still challenging to repeat on a reliable basis.

Experiment #7: Hybrid approach using Wonder Studio combined with Runway/Kaiber

By this point in the shoot, it dawned on us that using AI video-to-video tools to circumvent the need for expensive sets and visual effects was probably not feasible, at least not with current tools. They showed us a lot of promise and occasional glimpses of what we sought. Still, they weren’t ready to consistently deliver the results you would expect from a digital content creation tool on a professional level.

But it did make us think, “What if we leaned into the strengths of these tools and leveraged them?” Instead of trying to transform the entire frame, we could transform the backgrounds and use other AI tools to extract our foreground characters and composite them conventionally. Given the results we’d seen so far, this hybrid approach seemed like it might work.

For this experiment, we took a shot where I’m walking down a hill in Golden Gate Park toward the camera (iPhone), coming from a wide shot to a medium shot, and then I walk out of the frame as the camera slowly pans.

Creating an alpha mask and cleaning the plate.

We added another AI tool, Wonder Studio, from Wonder Dynamics. Wonder Studio is billed less as a generative AI product and more as an AI motion capture, rotoscoping, and compositing tool. The idea was to use Wonder Studio to extract me from the original shot and provide a clean background plate. Then we’d use Runway video-to-video to style that clean plate and finally use the matte from Wonder Studio to composite me back over the newly regenerated plate.

This hybrid approach was slower and more labor-intensive than pure video-to-video. To give you an idea, we’d typically get results out of Runway or Kaiber in a couple of minutes. With Wonder Studio, the time to process a shot would range from 30 to 45 minutes. Remember, that time included motion tracking, rotoscoping the actor out of the plate, and if desired, creating a new CG character following the actor’s markerless mocap.

For perspective, all that work would take human VFX artists several hours, possibly days, to accomplish.

Our back plate

With Runway only having to deal with an empty background plate devoid of humans, the results were much more consistent. Also, the motion of the camera and the shifting background were perfectly mirrored in the AI-generated background. We got several different styles, all of which looked pretty cool.

Next, we took the alpha matte Runway created to composite the actor back into the new AI-generated backgrounds. The results looked good: no more abstract, morphing foregrounds. The background matched up nicely with the framing and camera movement. Our composite was a little ragged but could have been made perfect with more time and effort.

It’s worth noting that we were using FCPX for our compositing, here. It’s likely that a dedicated compositing application such as After Effects or Nuke would have yielded better results. That said, our hybrid approach delivered the most satisfying and controllable results of the whole shoot. And with that, our three-day Generative AI meets virtual production shoot was a wrap.

General observations from the shoot

We found the results from the AI tools to be predictably unpredictable. We struggled to determine why some results were close to our desired effect, and others were far off. No matter which parameters we tweaked, the results were challenging to bring into line in a consistent manner.

For these tools to achieve a place in a proper film production pipeline, they’ll need intuitive UI controls with granular control over the outputs—akin to a traditional 3D modeling or compositing app. Right now, abstract controls lead to abstract imagery. Easy to play with or to create a surreal commercial or music video, perhaps, but not ready for professional projects of all types. That said, we expect this to evolve quickly.

We found the results from the AI tools to be predictably unpredictable.

Another general observation was the AI tools were most adept at medium closeups taken from a selfie perspective. They struggled to track an actor moving laterally throughout a shot or changing relative size, especially when combined with a free-moving camera.

Most movies are shot in a variety of shot sizes, with a camera shifting in multiple directions, so there’s plenty of room for improvement here. Hopefully, as more people use these tools, they’ll get better at processing various shot sizes and camera movements.

Second opinion(s)

After we completed the shoot, we also recorded a debriefing to discuss the results and compare our observations on how useful these tools are now for real production pipelines and where they could potentially evolve to be more so. Here are some quotes from that session:

Keith Hamakawa, VFX Supervisor

“The first places AI tools like Runway or Kaiber generation could slide into is short form. You won’t use it for a whole movie, but because it’s better at creating backgrounds, perhaps matte painters will be in danger of losing gigs or at least needing to work with these tools. You’ll say, ‘I need a futuristic cyberpunk cityscape with flying blimps in the background,’ and it’ll generate that with a moving camera ready to go.”

“It reminds me of the movie Waking Life, which is rotoscoped animation over footage shot on prosumer video cameras. AI renders look a little like that, only now you won’t need a team of 150 animators doing the labor-intensive work.”

“Even on the cutting edge of virtual production, we’re dealing with technology that’s had decades to mature. AI Generative video is not even a toddler yet, but maybe in ten years, it could advance to something where there isn’t a need for a film crew or a sound stage.”

“As Danny Devito in Other People’s Money once said, ‘Nobody makes buggy whips anymore, and I’ll bet the last company that made buggy whips made the best buggy whips you ever saw.’ AI will replace jobs in the movie business, replacing them with new jobs.”

Kim, Writer/Director/Editor

“I expected a far greater level of control and interaction with the UI. There was no pathway to get from one place to another. As a filmmaker, you’re usually not living in the world of abstract storytelling where consistency doesn’t matter.”

“AI opens many potential doors for actors—assuming it’s not abused—with the ability to play de-aged or creature characters without wearing heavy prosthetics. I can’t imagine they’ll miss spending six hours in a makeup chair.”

“These tools would be good for anyone who needs to pitch a project to a studio or investors. You can visually create the story that goes with your pitch. When these tools improve, you can have a very professional-looking product that sells your idea without hiring a crew to shoot it in the first place.”

“The current state of AI reminds me of my first opportunity to work with an experienced film editor on Avid, who had never worked in a digital space. I got promoted to first assistant editor because I’d been trained in digital editing. Everything changed so fast from film editing to digital, and it never went back. That didn’t happen because it made the film any better. It happened because the producers realized they would spend less money cutting digital vs. on film.”

Where does that leave us?

Generative AI, as it relates to filmmaking, is currently in its infancy, but where might it develop soon and over the long term? Based on our experiences, you could envision several possible scenarios:

A massive transformation of how films are made with the potential to disrupt everything—from what jobs continue to exist to who controls the means of production. Some historical examples of this level of disruption: sound, color, optical-to-digital compositing, film-to-digital, and virtual production.
AI tools fail to coalesce into anything game-changing. Examples of this are 3D, 3D, and 3D, which appeared and disappeared in at least three significant cycles in the ‘50s, ‘80s, and early 2000s.Each time they were positioned as the next big thing for filmmaking, but ultimately failed to sustain mainstream success.
3D repeatedly failed because the audience’s appetite isn’t strong enough to justify the increased production costs and the discomfort of wearing glasses. It’s a great gimmick, but audiences seem satisfied for the most part watching 2D entertainment. Perhaps this form of entertainment will finally go mainstream when the technology exists for glasses-free 3D via holography or some other method. Who knows?
AI lands somewhere in the middle of the first two extremes. In this outcome, some areas of cinematic production are heavily impacted and transformed by AI, while others are relatively unchanged.

This seems like a more realistic outcome because it’s already happened to an extent. Areas such as editing, visual effects, previsualization, script research, etc., are already imbued with various forms of AI and will likely continue on that trajectory.

On the hype cycle

Here’s another way to look at the trajectory of AI and really all potentially transformative/disruptive technologies, courtesy of the Gartner hype cycle.

Based on this projection, it’s fair to say that generative AI is currently approaching (possibly even sitting at) the peak of inflated expectations on Gartner’s Hype Cycle.

Final thoughts (for now)

It’s been said by folks in the know that an AI will not necessarily take your job, but a competitor who’s mastered an AI tool just might. AI could be highly disruptive and has already been described as the fourth industrial revolution by various sources.

It happens when a new technology radically alters society via changes to the composition of the labor force and changes in living conditions, some good, some bad, but primarily favorable over the long term. Some of the prior significant technologies which led to industrial revolutions were the steam engine, electricity, and electronics/IT.

So after reading about our experiences, what can you do to surf the wave of AI and not be inundated by it? I’d suggest that you learn everything you can about the AI tools that affect your chosen field. And if you’ve always dreamed of a different career but were afraid to rock the boat because you’re already established, AI might be the catalyst you need to jump ship.

Don’t buy into the hype or the fear mongering surrounding AI in the news. Most of that is there to sell you something or make you afraid while you click through ads. All that’s to say, I’m not discounting the serious ethical, regulatory, and legal issues around AI that need to be resolved.

Under the current rules of the United States Copyright office, works generated by AI are not copyrightable. So, factor that into what you create with these tools and how you plan to use those creations. Also, the training data used to develop some models, such as ChatGPT and MidJourney, comes from the open internet.

The artists whose work is used for the training data are not being credited, compensated, or even acknowledged for the unauthorized use of their work as training data. Whether that act triggers licensing rules or falls under Fair Use has not been decided in the courts nor via legislation.

We’re currently in the Napster phase of AI, and we’re already starting to see regulation, litigation, and product innovation that will ultimately land us in the Spotify phase of AI. By that, we mean that AI content will be classified, accredited, licensed, and acknowledged by the source/training data that it uses—much like how Adobe’s Firefly draws from a legally-acquired dataset.

I’ve learned over the years covering movie production that it is a constantly evolving science experiment. Visionary artists like Walt Disney, George Lucas, and James Cameron embraced new technologies, molded them to tell never-before-imagined stories, and made their careers successful in the process. The naysayers who wanted to put the genie back in the bottle ultimately retired, quit, or just disappeared in the inevitability of progress.

This is an inflection point, and fortune favors the bold. So keep your eye on products as they evolve, and hold these tools up against your own workflows. If you can find a way to put AI to work for you, now’s the time!