AI Assistance: Boost Engagement for Free with Premiere Pro’s Speech-to-Text

AI Assistance: Boost Engagement for Free with Premiere Pro’s Speech-to-Text

If you watched the launch videos at Adobe MAX 2020 you probably noticed a few trends forming during the product demos.

Mobile and social were (and always will be) focal points, but Artificial Intelligence (AI) and Machine Learning (ML)—or Sensei as Adobe has chosen to brand them—took the stage in a number of surprising ways.

As always, a lot of the airtime was given over to Photoshop, which added a bundle of Sensei-driven tools called Neural Filters that include image upscaling, sky replacement, and portrait aging. But while turning the clock forward on your face is fun, and swapping that blown-out skyline with a stock sunset makes a landscape prettier, it’s hard to see much commercial value in these tools. For that, you should be looking at the least talked-about AI feature in Premiere Pro—Speech-to-Text.

Let’s take a look at why you might want it, how you might use it, and whether or not this machine learning tool can augment your productivity.

We don’t talk about that

Let’s take a moment to recall that this is not Adobe’s first attempt at releasing a tool for converting recorded audio into editable text. Speech Analysis was added to Premiere Pro back in 2013. It was…not great.

When I tested it back then, the best description for the results it produced would be word salad.

Premiere Pro Speech to Text is not Adobe’s first attempt.
Premiere Pro Speech-to-Text is not Adobe’s first attempt.

But to be fair, the same was also true of other software at that time. Google’s auto-transcription for YouTube videos was just as unreliable. As one commenter put it “in my experience it does such a bad job that the time it’d take me to correct it is considerably more than the time it’d take me to transcribe it myself.”

And that, in a nutshell, was the problem. So it wasn’t really surprising that Adobe pulled Speech Analysis from public release in 2014, and stayed silent on the matter until the fabulous Jason Levine brought it back into the spotlight in 2020.

Different strokes

The motivation for automatically generating captions is likely to depend on your business perspective.

For example, companies like Google and Facebook want it because it makes video indexable and searchable, allowing us to find content inside videos (and for them to sell ad slots based on the context).

But for video producers and distributors, the need for captions is probably coming from a different place.

Accessibility

The laws around accessibility are different across the world, but the closest we have to a global standard are the Web Content Accessibility Guidelines (WCAG) published by the World Wide Web Consortium (W3C). It’s worth noting that section 7.3 of the current WCAG indicates that media without captions is deemed a critical error that automatically fails the rating process.

In the US, the FCC has already made it a legal requirement that all TV content broadcast in America must be closed captioned, and any subsequent streaming of this content falls under the same rules.

And while it’s true that content that is uniquely broadcast over the Internet falls outside of these regulations, legislation including the Americans with Disabilities Act (ADA) has already been successfully used as the basis for lawsuits against streaming platforms like Netflix and Hulu.

So these days it’s probably safer to assume that captions are required by law in the country/state where you operate than to find out the hard way.

Social Media

While meeting accessibility requirements is an excellent justification for captioning, it’s also beneficial to audiences who don’t suffer from hearing loss, especially when it comes to video in social media.

Muted autoplay is quickly becoming the norm for video in scrolling social feeds, and it’s estimated that as much as 85 percent of video views are taking place with the sound turned off. So if you want to improve the signal-to-noise ratio of your social media content, captions are now an essential part of the process.

It’s estimated that as much as 85 percent of video views are taking place with the sound turned off.

Global reach

And for those of us working with global markets, it’s long been known that captions are the easiest way to repurpose your film and video content for audiences who speak a different language. (Certainly a lot less involved than dubbing and ADR.)

While some translation services can work directly from the original media, offering a caption file in the original language can help to speed the process up.

There are, of course, other reasons why captions are quickly becoming an essential component of media production, and it’s not just because of the memes.

But while the needs might change from business to business, the fundamental benefit is the same—captioning your media will help you reach a larger audience. And that’s good for everyone.

Let’s start the show

To get started, open the project to be captioned in Premiere Pro and have the target sequence active in the Timeline view.

Depending on how you’ve structured your edit, a small amount of preparation might be beneficial before moving forward.

For example, if you’ve laid out multiple vocals on separate tracks, or if you have a mix of vocals and SFX/music on the same track, you should spend some time tagging vocal clips as Dialogue using the Essential Sound Panel (you can also choose to mute any unwanted tracks on the Timeline if that’s easier.) This will let Premiere Pro know which assets to include in the exported audio that it analyzes later on.

Also, if you don’t want to create captions for the entire sequence, you should set sequence In and Out points by moving the playhead to the required positions and hitting the I and O keys respectively. (Note that the Work Area Bar isn’t used for this feature.)

You can limit transcription range by setting sequence In and Out points.
You can limit transcription range by setting sequence In and Out points.

When you’re ready, open the Text window (Window->Text) and hit the Transcribe sequence button.

Your options at this point are straightforward. You can choose to export a mixdown of just the clips you’ve tagged as Dialogue, you can pick Mix to create a mixdown of the entire sequence, or you can select a specific track to export from the drop-down menu.

Select Transcribe Sequence to start the speech analysis process
Select Transcribe Sequence to start the speech analysis process

At present, there’s no way to select multiple audio tracks for the mixdown, which could be irksome if you have multiple speakers on separate tracks. For now, just mute the tracks you don’t want to include and choose the Mix option.

Speech-to-Text supports an impressive selection of languages that covers most of the world’s population. Notable exceptions are Arabic, Bengali, and Indonesian, but it’s interesting to see both US and UK variants of English. (As a UK ex-pat living in Australia, the latter scores bonus points with me.) However, I can only comment on the effectiveness of the tool in English.

Speech to Text supports a wide selection of languages.
Speech-to-Text supports a wide selection of languages.

It’s interesting to note that Sensei’s ability to identify different speakers—which was the default behavior in the beta—now requires consent, and isn’t available in Illinois, presumably due to privacy concerns.

Speech to Text legal notice
Speech-to-text now requires that you opt-in to PID-related functions

The transcription process is relatively fast, with a four-minute test project featuring dual speakers taking around two minutes, and an hour-long sequence taking 24min, which indicates a turnaround time of about half the runtime.

But Speech-to-Text is (mostly) cloud-based and it’s impossible to predict what speeds might be like if the entire Adobe Creative Cloud membership suddenly starts chewing up Sensei’s compute cycles at the same time. That said, even if job queuing becomes necessary, you and your workstation will at least be free to make that coffee or catch up with other tasks in the meantime.

Sensei at work.
Premiere Pro’s automatic transcription at work.

Get back to work

When Sensei is finished with your audio, the Transcript tab of the Text panel will be populated with the results.

And while your mileage may vary, I have to say that I was impressed with the accuracy of the tests I ran. The beta version that I first tested was good—the public version is even better.

By opting into speaker profiling, Sensei recognizes multiple speakers, identifying them as Speaker 1, Speaker 2, etc. If you opt-out, then it will simply list Unknown next to the paragraph segments.

Either way, you can name them by clicking on the ellipsis in the left column of the Transcript tab and selecting Edit Speakers.

You can assign names to the speakers identified by Speech to Text.
You can assign names to the speakers identified by Speech-to-Text.

This tool can let you manually fix instances where Sensei may have incorrectly identified speakers with similar-sounding voices, and it’s worth taking the time to do this now before moving onto the caption creation stage.

The same is true for transcript cleanup. Unless you’ve been extremely fortunate with your Speech-to-Text analysis, there will be errors in your transcript. These are more likely in recordings with a more conversational delivery, background noise, non-dictionary words like company names, or multiple speakers talking across each other.

“Unless you’ve been extremely fortunate with your Speech-to-Text analysis, there will be errors in your transcript.”

And while you’ll be able to edit the text after it’s converted to captions, you should correct the transcript before you get to the next step. This is because Premiere Pro treats the transcript and subsequent captions as separate data sources—so making changes to one will have no effect on the other.

So take the time to get your transcript right as it will be the source from which all of your captions will be created.

Getting around

Adobe has implemented some extremely useful features to help you navigate the video and transcript at the same time.

To begin with, Premiere Pro already has a Captions workspace that divides your screen into Text, Essential Graphics, Timeline, Project Bins, and Program. Though you might want to tweak things to suit your preference. For me, it looks like this…

It’s worth taking some time to rearrange (and save) your workspace.
It’s worth taking some time to rearrange (and save) your workspace.

Once you’re set up, finding your way around is straightforward.

For example, moving the playhead to a new position in the timeline will automatically cue the transcript to the corresponding location, to the extent that the word being spoken at that point beneath the playhead is highlighted blue in the Transcript panel. Any text that lies ahead of the playhead position is colored gray, which is a helpful addition to the UX.

Similarly, playing or scrubbing the timeline will update the transcript view to keep pace with the playhead.

Adobe Speech to Text in motion
Premiere Pro will track the transcript as you play your timeline.

And it also works in reverse, so selecting any word in the Transcript panel will automatically move the playhead and video preview to the corresponding time in the sequence. It can be a little slow to respond at times—possibly because it’s talking to Adobe’s servers—but it’s a highly effective approach, nonetheless.

There’s also a Search box in the top corner, which lets you jump to words and phrases in the transcript, as well as a Replace function should you need to fix repeated errors.

Premiere Pro Speech to text search
Speech-to-Text’s search tool lets you locate and jump to specific points in the timeline.

Best practice

At this stage, you’ll probably do most of your navigation in the Transcript panel; selecting a word, hitting Space to start playback, comparing what you’re hearing with what you’re reading, then stopping and double-clicking on the text to make any changes.

Based on my experience, your changes will most likely center on punctuation and sentence structure, rather than fixing incorrect words. And despite Sensei’s best efforts, you’ll still need to put in the work to get things to a caption-ready state.

And this is to be expected. Natural language processing is incredibly hard. After you factor in accents, dialect, mannerisms, tone, and emphasis, even human beings struggle with it. So expecting perfect results from a machine is unrealistic. (I’d strongly recommend you turn YouTube’s automatic captions on for the following video example.)

(This is a great alternative – https://youtu.be/Gib916jJW1o)

So approach this stage with an open mind, a fresh cup of coffee, and a comfy chair. And if you need guidance on the best practices for caption creation, you might want to read through the BBC’s subtitle guidelines, first.

Also, remember that the transcript data is saved in the Premiere Pro project file, so you can come back to it later if you need to. You can also export the transcript as a separate, but proprietary .prtranscript file, though it’s not clear what the benefit of this approach might be.

Ready?

When you’re confident that your transcript is as clean as you can make it, then go ahead and hit the Create captions button.

You’ll be given a bunch of options here, including the ability to apply Styles (assuming that you’ve previously created some). You can define the maximum character length and minimum duration of your captions, set them to Double or Single line, and even the number of frames you want to insert between them.

If you’re not sure what you want at this stage, I’d suggest that you pick the “Subtitle” format from the drop-down, make sure that the Create from sequence transcript radio button is selected, and leave the rest at their default values.

If you’re not sure what caption format you need, stick to the default values.

I’m not going to spend a great deal of time discussing the different caption formats that Speech-to-Text offers. Partly because I’m not an expert in the differences, and you’ll know your project requirements better than I do. But mostly because it doesn’t matter that much.

This is because Premiere Pro’s Speech-to-Text keeps your transcript data intact and adds your captions to a separate track in the sequence timeline. (This is a huge improvement over Premiere Pro’s first attempt at captions, which incorporated the caption track into the video layer.)

Thanks to this, you can generate captions in as many different formats as you need. Even retroactively, should your project get sold into a territory that uses a different standard. There doesn’t appear to be a limit on how many caption tracks you can add, and the format used for each caption track is clearly labeled.

If things start to get cluttered, you can toggle track visibility using the CC button in the timeline view.

Caption tracks are labeled with their format and can be hidden/revealed with the CC button.
Caption tracks are labeled with their format and can be hidden/revealed with the CC button.

If you’re working with foreign language captions, this aspect of the UI could be extremely useful, as it has the potential to let you build caption layers for as many languages as you need in the same sequence timeline. There are limitations to this approach, which I’ll get to later, but speaking from personal experience, I welcome this wholeheartedly.

So go ahead and hit that Create button, and watch as your transcript is chunked up and laid out in the format of your choice.

Another round

If you have any experience in caption creation, you’ll know that good captions require a surprising amount of finesse.

It’s not as simple as breaking the dialogue into sentences and showing them on-screen for as long as it takes the speaker to say them.

“Punctuation is incredibly important, and line breaks can mean the difference between comprehension and confusion.”

You have to deconstruct what’s spoken into short, intelligible sections that can be read without drawing too much attention away from the visuals. Punctuation is incredibly important, and line breaks can mean the difference between comprehension and confusion. And to be fair, Speech-to-Text seems to do a reasonable job of this.

However, to comply with captioning standards like the FCC’s, you need to convey noise and music to the fullest extent possible. And while it’s unreasonable to expect Sensei to start labeling noises and music (for now, at least), your captioning software should allow you to incorporate information beyond dialogue.

One at a time, please

Unfortunately, Speech-to-Text is limited to a single track with no scope for overlapping elements.

This means that there’s no way to easily incorporate simultaneous speakers or add sound or music identifiers over dialogue. (I tried adding these to a second caption track, but you can only enable visibility for one track at a time.)

So if FCC compliance is needed for your project, then you might need to hand this job off to a different caption solution. But even then, you could still use Speech-to-Text to get you most of the way, then export the results to a text or SRT (SubRip) file for import into a different tool.

Split the difference

Once you get down to the business of editing the captions generated by Speech-to-Text, Premiere Pro’s workflow makes a lot of sense.

Sentences are broken into short, single-line segments that will fit on even the smallest of screens without line-wrapping. And you can choose to merge or split these further if they don’t quite work in their current state.

Adding new captions is also possible, assuming that there’s space to do so (the default for the inserted caption is three seconds, and you can end up overwriting existing captions if you’re not careful here). 

You can use Premiere Pro’s timeline tools to adjust captions in the same way as clips.
You can use Premiere Pro’s timeline tools to adjust captions in the same way as clips.

Captions also behave like any other asset in the timeline. So you can adjust their In and Out points by dragging clip handles, link them to video clips, split them with the Razor tool, or even perform slip, slide, ripple, and roll edits. 

So if you already know your way around the Premiere Pro toolset, your existing skills will stand you in good stead here.

Fixing it in post-post

There is, however, a track editing limitation that’s unique to captions.

While you can select and manipulate multiple video, audio, or image tracks at the same time, only one caption track can be active at any time. If you need to adjust multiple caption tracks in different formats, you’ll have to do it one track at a time.

But this feels like splitting hairs. Given that the captioning process typically takes place long after the edit is locked and approved, the need to make changes across multiple caption formats should be a fringe scenario.

Open or closed?

Premiere Pro offers a wide range of formatting tools for your captions, including the ability to save styles and apply them to future projects.

You can adjust font, color, shadow, outline, and background options, as well as position, text alignment, and usable caption area. And these can be assigned to individual captions, or across the entire caption track.

Closed captioning

But the extent to which you can change the appearance of your captions depends on whether you intend to deploy them as open or closed.

Closed captions are stored as separate files—also known as sidecar files—and can be toggled on and off by the viewer during playback.

Closed captions can be exported to a selection of sidecar files.
Closed captions can be exported to a selection of sidecar files.

Most of the formatting for closed captions is handled by the playback system, so formatting options are limited (and Premiere Pro will only display functions that are supported by your chosen caption format). But, despite the name, closed captions are easier to change after being finalized as they’re usually a simple text or XML file.

Open captioning

In contrast, open captions are “burnt in” to the video, so they’re always visible (regardless of the playback platform or device) and you can format them however you see fit.

It also means that you can create a single version of the captioned video that will play on all video platforms.

But the trade-off here is that your captions can’t be changed without re-rendering and redistributing the entire video. And, if you’re working with multiple languages, you’d have to create entirely new videos for each language instead of a more manageable set of caption tracks.

It’s also worth noting that open captions will resize along with the video, so if your audience is looking at a piece of 16×9 media in portrait view on a mobile device, there’s a chance that your captions might become too small to read.

On this basis, you might think there’d be no compelling reason to opt for open captions on your video content. But if you’re publishing to social media, then you might not want to rely on the automatic captioning tools that are currently your only option on platforms like Instagram or TikTok.

Also, some social platforms only allow you to add captions at the same time as you upload the video, which makes scheduling or auto-posting video content with captions impossible. So open captioning can still be a viable option.

Finishing up

Looking at the current version, it seems as though your export options have been reduced to  EBU N19 or plain text SubRip SRT file—the MacCaption VANC MCC format and Embed in output file option found in the beta are no longer available.

This isn’t as limiting as it sounds, though, as EBU serves most streaming and broadcast services, and SRT covers most online and social video platforms.

Options to export to SRT or text file can be found in the Text panel.
Options to export to SRT or text file can be found in the Text panel.

What we’re not seeing is the ability to export only the caption track from Premiere Pro’s export tool or Adobe Media Encoder, so you need to render out at least an audio file in order to get an XML caption file.

Given that you can export to .srt and .txt files from the Captions panel, this seems odd, and seems likely to change in the future.

Open captions can be “burned in” to your video on export
Open captions can be “burned in” to your video on export

If you want open captions, you can just pick the Burn Captions Into Video option. And of course, if you want to create multiple exports in different formats, you can queue them up in Adobe Media Encoder for batch export. Just make sure that you set the required caption track’s visibility in the timeline first.

Multiple formats can be queued for batch export in Adobe Media Encoder
Multiple formats can be queued for batch export in Adobe Media Encoder

What’s missing?

While testing the beta, I noted some areas where Adobe might improve this tool before releasing it to the public and, with one small exception, they’re still “missing.” So here’s my wishlist:

  1. Adjustable font size in the Transcript and Captions panels.
    The text size is currently defined by the system settings, and there are times I wanted to dial the font size up to make things easier to read while editing the transcript.
  2. Script import.
    If you’re working with scripted material, then Speech-to-Text could, in theory, skip the transcription process and focus on timing, instead. This would allow you to quickly convert what you already have into a caption-ready format. (YouTube already has this.)
  3. Custom formatting based on speaker.
    While you can identify the speakers in the transcript, there’s no way to automatically add that data to your captions. And if you’re captioning scene by scene, it might be useful to have custom caption placement for speakers who are always going to be on a particular side of the frame.

But is it worth it?

I can’t say what your experience with Premiere Pro’s Speech-to-Text might be.

Is it one-button automation for all your captioning needs? Of course not. And I believe we’re still a long way from building a system that can handle this complex and infinitely variable task without some kind of human intervention.

But for me, this tool became a standard inclusion in my toolkit before it even left beta.

If pressed, I’d estimate that it’s cut the time it takes to caption content to about a third of what it was before. It’s not the only option available—Otter.ai will export transcripts to the .srt caption format, Digital Anarchy has a Premiere Pro plugin called Transcriptive, and of course, you can pay companies to do the job for you—but all of these have a cost component, while Speech-to-Text is currently free to use.

It all comes back to that comment I included at the beginning of this article—is it easier to use Speech-to-Text than it would be to transcribe it yourself? For me, the answer is a very firm yes. So if you’re looking at finding a better way to add accessibility and greater audience engagement to your video projects, Premiere Pro Speech-to-Text is definitely worth a look.

(And if you’re looking for more content on working with audio in Premiere Pro, check out Premiere Pro Mixing Basics and Premiere Pro Audio Tools.)

Thank you to Laurence Grayson for contributing this article.

Laurence is a Sydney-based media producer and editor. After a long haul as the Head of Creative for a well-known global software company, he’s now heading a media production startup specializing in 360-degree content. He also has a patent for an augmented reality marker, which makes his mother very proud - even though she doesn't really understand what it does. Click here to check out YouThere.Media's website.

Interested in contributing?

This blog relies on people like you to step in and add your voice. Send us an email: blog at frame.io if you have an idea for a post or want to write one yourself.