An Open Letter to Our Customers
In response to the recent outages Frame.io has experienced, I wanted to personally acknowledge the issues you’ve encountered and transparently share the steps we’re working on to improve the situation.
First, it goes without saying that our product’s uptime and availability is an absolute top priority. We know many of our customers depend on Frame.io for time-sensitive, mission-critical work.
Over the course of the past several months, the number of downtime incidents we’ve been experiencing has increased and has resulted in multiple outages that have sometimes lasted for hours. This is completely unacceptable.
What’s been going on?
The scale that our product needs to handle has steadily increased over the years and sharply increased over the past 12 months. Some of the early architectural patterns of our product were not able to handle this new scale and we have been hard at work on rebuilding critical parts of our systems.
Behind the scenes, the Frame.io engineering team has been working on architecting significant enhancements to our backend, which will prepare the product for the next level of scale and prevent the downtime incidents we’ve been experiencing over the past several weeks.
While not the type of information we typically share with customers, I’m including a detailed list of initiatives that are in flight to improve the situation.
What we’ve already addressed
- We have expanded our media pipeline and transcoding capacity by incorporating resources across multiple regions.
- We’ve made improvements to performance and transactional behavior around uploads. We’ve also moved the asynchronous events generated by uploads onto a new, dedicated job system.
- We’ve deployed improvements to the performance of our socket service that powers updates and presence across our applications, reducing unneeded load on our infrastructure.
- We’ve made significant reliability and performance improvements to the archiving process while upgrading it to our new job system.
- We’ve optimized database performance on digest jobs, which provides consolidated notifications for your projects.
What’s in progress
Today we have multiple work streams to target for ongoing improvement, and are working across several engineering disciplines concurrently. The majority of the efforts focus on a few key themes:
Database infrastructure improvements
We’re making a significant overhaul of our database infrastructure, including improvements and tuning to our connection pooling, multiplexing, and caching configurations to ensure we’re getting the most out of our data layer. This will reduce load and increase the speed to complete requests.
We are currently in beta with a number of performance improvements to API subsystems such as storage calculation, which reduces overall load for upload and archival workflows. We are also upgrading other event-heavy workloads such as asset management operations onto our new job system, and working on the next revision to our activity and bundling architecture.
Expanding infrastructure capacity across regions
We are continuing to expand the work we began with moving our media pipelines across regions, and will be looking into further separating our infrastructure for storage, data, and compute across regions to improve both capacity and resiliency.
Underneath these themes we presently have nine work streams in progress, and our infrastructure and backend engineers remain focused on resolving the issues that you’ve experienced recently.
I will be actively involved to ensure that we restore our performance to the level of reliability you’ve counted on for years. Thank you for your ongoing patience and support as we work toward reestablishing your trust in Frame.io.