Cloud was the blockchain of the early 2010s. If you wanted something to sound cutting edge, you would label it as "The Cloud". Now, the cloud is mature, and if you want it to sound cutting-edge you say it involves AI, VR, voice control, or of course, blockchain (we'll cover those in other posts).
The cloud made many promises including cost-effectiveness, scalability, agility, durability and programmatic control. Those promises were difficult and expensive to realize at the beginning of the decade, but they have become more achievable over time as the ecosystem of technologies associated with the cloud have matured and become less costly, therefore accessible to more of our customers.
...we weren't achieving the self-healing infrastructure promised by the cloud.
Although Earthling had been following the emergence of the cloud prior to 2010, we transitioned our hosting from local co-location to Rackspace in 2011. Rackspace provided an easy to use, self-service web console for spinning up cloud servers on demand, and excellent support by chat, ticket, and phone. The main advantage was that the cost of time to set up servers for our web application clients went from days with the co-location partners to minutes with Rackspace. However, we were still setting up single servers per application and the pricing was marginally less than a VMware VPS at the local co-location provider.
For the most part, we would scale servers vertically to meet capacity, and occasionally create separate servers for the database and web server. Generally speaking, customers running on these servers did not have the traffic to warrant load balanced web servers; and yet, while reliability was pretty good, and backups mechanisms were automated—most of the maintenance, disaster recovery, and scaling was still manual. The cost of additional servers, automation tooling, and programming was still more than the cost of manual setup and maintenance.
This meant that server downtime was dealt with manually. Since the reason that a site is down could extend from multiple layers: internet connectivity of client to host, DNS, problems with the underlying host; OS; web server process; database; disk space; memory; CPU, etc., the problem might have to be run past several people before the appropriate actions were taken. Usually, the solution was a restart of the offending service, which might only take a few seconds. Rarely, there was a problem with the underlying infrastructure and a cloud service would have to be migrated, which could take significantly longer for larger applications. In either case, we weren't achieving the self-healing infrastructure promised by the cloud.
In 2014 and 2015, we implemented load-balanced solutions for customers who required high availability and even worked on some more advanced cloud-based data processing workflows that followed a queue-based worker model. Those load-balanced solutions were still constructed somewhat manually, often required non-trivial work to handle replication, and database fail-over was manual. Still, these solutions were becoming more affordable.
In 2016 however, we started to see the technology and economics shift to the point where we reconsidered the notion that small sites were fine on single servers because they don't need load balancing for performance, and high availability is too costly. We found ourselves managing a menagerie of cloud servers of different vintages across both Rackspace and Amazon for multiple clients with varying degrees of uptime and performance requirements and we started to account for the time spent dealing with hosting support requests. Downtime, though usually small, was killing our development teams.
Comparatively, cloud infrastructure is cheap and getting cheaper. On the technology side, key components for containerizing applications, and orchestrating those containers across multiple hosts were becoming widely available as part of open source offerings, thereby reducing the cost of automation. Cloud-database-as-a-service offerings were also maturing. Amazon's RDS and then Aurora offerings decreased the setup and maintenance cost associated with running single and multi-node database clusters to near zero. We also hired our first specialized DevOps Architect, so that we could utilize these technologies for more of our clients.
Our current operating paradigm provides that all clients deserve high availability because downtime costs everyone money, and the cost to achieve a target level of availability (99.9% in this case) was now economically feasible even for relatively small sites.
We spent the fall of 2016 architecting a new multi-tenant containerized hosting platform and released it in the spring of 2017. We have continued to migrate our clients from our legacy cloud hosting solutions to the new platform throughout the year and will likely continue to do so through 2018.
On the new platform, sites are containerized using Docker. An orchestrator, based on the open source orchestration system, Rancher, ensures that containers are running on at least two hosts. Hosts in autoscaling groups are configured programmatically using AWS CloudFormation and Salt. A load balancer directs traffic between the two containers and the containers talk to a database with a read replica that supports automatic failovers. There are several other components that make the system work, including Redis for session caches and network volume mounts for shared file storage. We also have a caching and security layer in front of the load balancer, and a deployment pipeline that builds the apps/sites from source control.
All of this was designed to reduce downtime to a minimum and make recovery as automatable as possible in the event part of the system goes down. Now that we have automation, scalability, agility durability, and cost-effectiveness, we are finally achieving the promises of the cloud.