Calm — Healthcare and Life Sciences

About

Calm is a leading global health and wellness brand with the No. 1 app for sleep, meditation, and relaxation.

The Challenge

Limited in-house capacity left Calm’s Kubernetes environment vulnerable, and a major outage revealed the cost of under-resourced infrastructure.

As Calm scaled its wellness platform, the company realized it did not have enough internal resources to manage and evolve its Kubernetes infrastructure. To fill this gap, the company engaged a Toptal senior DevOps engineer. Soon after, Calm’s self-managed control plane failed, causing a two-day outage that underscored the risks of operating such a complex environment without additional expertise. Each hour of downtime was estimated to cost the business $40,000 in lost subscriptions.

The Solution

Toptal rebuilt Calm’s infrastructure on EKS in just three days, turning a crisis into lasting improvement.

When the outage hit, Toptal’s developer advised Calm to rebuild on Amazon EKS rather than repairing the corrupted control plane. Calm agreed, and in only three days, he delivered a fully managed cluster that immediately stabilized operations and reduced the risk of future failures. To ensure long-term resilience, he scripted a new networking layer in Terraform, introduced IAM authorization for secure access, and moved cluster configurations into source control. These changes gave Calm a modern, automated environment that restored confidence in production and accelerated development cycles.

The Outcome

Toptal helped stabilize production, improve reliability, and reduce Calm’s exposure to costly downtime.

The move to an AWS-managed control plane delivered immediate benefits. Stability improved across production, networking performance became more dependable, and developers experienced faster server responses, which shortened delays in deployment workflows and boosted overall productivity. Over the following six months, Calm saw its infrastructure perform with greater consistency, giving engineers the confidence to focus on growth initiatives rather than infrastructure recovery. By avoiding outages that had previously cost an estimated $40,000 per hour—with this specific outage estimated to cost $1.92 million—the company reduced operational risk while gaining a scalable foundation for continued expansion.