Shutting down staging makes production stronger

Running servers in the cloud ain’t cheap. It’s one of those operational costs that sneak up on you.

It always starts small: we need a web server. And a database. And a Redis cache. And High Availability. And two staging environments.

Before you know it the monthly cloud provider bill maxes your credit card.

While production environments are often required to run 24/7, staging environments are severely underused. If development teams work between 8h and 19h, that’s 11 hours of activity. That means the average team is using those servers 55 out of 168 hours per week. Only running staging servers for those 55 hours results in a 67% cost saving. For teams with a lot of staging servers, this adds up quickly!

But the cost reduction is only one of the benefits. Stopping and starting your environment on a schedule is also a great way to reduce complexity.

Turning off a single server is trivial. Rebooting a set of services is often more complicated. What happens with events that are being processed? Are there any race conditions when booting the system?

When teams investigate whether they can shut down their staging environments they are often surprised by how brittle their system is. A lot of the complexity in modern software development is hidden in the interplay between services. The Kafka cluster needs to start before the web server. Read models need to be re-rendered because data gets lost when shutting down the system. Transactionality between microservices turns out to be flaky. Someone needs to manually requeue messages in the Dead Letter Queue.

If such a system would ever reboot because of an incident, it’s going to be a Saturday to remember.

All of these flaws are opportunities to improve the overall stability of the system. By addressing these “interplay problems”, we simplify our architecture and get in a position where our system can stop and start on its own without breaking.

If you have a brittle system, shut down the staging environment and try to start it again. List all the issues you run into and fix those over time. Once you can manually reboot, it’s time to schedule a weekly maintenance window. Every Wednesday at 19h the system goes down. Every Thursday it goes back up again at 8h. After a few weeks of running like this, you’ve built confidence and can set up an automated schedule for weeknights and weekends.

Every interplay problem that gets added from then on, gets caught the very next morning. The hardening of your staging environment rubs off on production.

Cost saving, complexity reduction, and increased robustness.

What’s not to like?