When a node failed in production for one of our biggest clients, it didn’t result in multiple support tickets, hours of manual tracking, or development work to correct the issue. Instead, the system healed itself within seconds of the outage without a single service disruption. To simply say that this was a win for our team would be an understatement. This level of continuity results from years of design and development work – planning for this type of situation. In this blog post, Kent Brown, CTO, and Mark McWilliams, Director of Engineering, share how we approach design for high availability and resiliency at scale in PortX.
Our team spent several years meticulously engineering our own internal API hosting platform to be highly reliable and self-healing because we had to. Aside from following industry best practices, it was the only way we could enable API calls to PortX at scale while maintaining a relatively small development team. Today, we leverage that same hosting platform to deliver high availability (HA) for our customers to host their APIs. PortX runs Kubernetes at scale across dozens of clusters, processing millions of transactions per day. For one customer alone, the platform processes 2 million API requests daily. We have maintained our service quality objectives without a single outage caused by an infrastructure problem in over 2 years since going into production.
Here, we’ll share about our team’s design philosophy, the industry best practices we follow, and what you can expect down the road from the PortX Platform. To start, we feel it’s important to address a misconception in the industry about cloud-native software.
The myth of the cloud
Cloud-based infrastructure tools have introduced countless new opportunities and capabilities for the software industry. These tools are like a giant set of Legos to be handcrafted into something beautiful. The key is how to go about piecing them together. If you choose the next block based on which one is closest or the prettiest color, you will end up with something that either doesn’t function or can’t stand up on its own. In the same way, just putting up a server or deploying a database in the cloud doesn’t guarantee high availability in the slightest.
We have worked with customers who architected clever solutions by migrating their infrastructure to the cloud. After time, however, they realized that the solution was too much for their team to handle alone. We discovered multiple reliability and security issues and found that their uptime percentage was dropping as a result.
Another misconception is that spinning up Amazon RDS (Relational Database Service) means that a database will never go down. Stop. Postgres is still Postgres and single master, which means that there will be a brief outage for every upgrade, maintenance, or hardware failure.
Cloud-hosted solutions sound easier to manage in theory. The reality is that the stack is only getting more complex. While it’s true that cloud architecture solves many industry challenges, it also raises the level of sophistication and skillset required to manage it well. Your developers must be intimately familiar with tools like Docker, Kubernetes, Terraform, Helm, and all of the other layers of abstraction. This is especially true for building financial-grade software.
“IAP” (Infrastructure as Philosophy)
Philosophically, our approach is to start by defining all of the points of failure. Then we design and engineer to ensure that those failure points can’t break the system. We follow this thought process in everything we do on the product team. Admittedly, this means that we have to design with a bit of paranoia over everything that could go wrong. That is why everything we build is self-healing. If an instance is terminated while in the process of running something, AWS will provision a new one, and the instance will bootstrap itself back into the same state it was in when the termination occurred.
When we designed the platform, we implemented HA at all levels of the stack. Thankfully, Kubernetes does a great job of managing HA at the container level. But there are a lot of other infrastructure pieces to PortX. All PortX environments have at least two availability zones for redundancy. Each service has a minimum of 2 replicas in each cluster, allowing it to automatically failover, ensuring high availability and self-healing.
Yes, we are a little paranoid. But we sleep well at night.
It’s all in the [many] details
There is no one thing you do to ensure resilience. There is no single pattern or server configuration that makes it highly available. It’s a holistic and comprehensive way of thinking through every layer of the stack.
Deployment with Flux
If you were to take a look at our PortX environments IAC (infrastructure as code) GitHub repo, you would see a beautiful history of every single deployment we’ve made into production. This is because we use Flux for our deployments. With all infrastructure configuration being in GitHub, Flux makes it easy to roll back a deployment if something were to go wrong. All of PortX’s Flux repos act as our audit logs because we practice GitOps for everything. GitOps doesn’t just mean you make changes in Git and stuff happens. It’s the specific way that something synchronizes off of Git, like Flux. Check out the specific definition that was defined by WeaveWorks, the company that made Flux. We also touch on GitOps in the PortX overview blog and will dive deeper into this in an upcoming blog.
Highly available infrastructure on demand
Our financial services customers expect the PortX platform to be highly available and secure. Running in the cloud means we need to handle sudden compute instance termination, network and storage failures, routine cloud maintenance, and regular security threats. This means identifying failure modes and creating redundancy through all levels of the technology stack.
By leveraging the Kubernetes Operator pattern, PortX offers customers a suite of common cloud architecture components like SQL databases, caches, and message brokers. These components are pre-configured to provide our developers and tenants with a pre-built, highly-available implementation of common infrastructure, like databases, Redis, DNS, and other services, following best practices for security, redundancy, and observability. We’ve found that managing infrastructure using this combination of GitOps and the Operator pattern to be a big win for ensuring infrastructure best practices while offering developers the ability to create and manage their own infrastructure alongside other Kubernetes resources.
With containers and Kubernetes, we’ve become used to the idea that individual containers can come and go and the redundancy features of Kubernetes will ensure that the application remains available. But what about the cluster itself? Even a cluster configured with highly available nodes is susceptible to human error or other configuration issues. We prefer to treat Kubernetes clusters as immutable, replaceable resources that can come and go at any time, just like a container. This greatly reduces the risk and burden of cluster-wide changes such as Kubernetes version upgrades.
Mirrored environments in dev and prod
Additionally, we maintain dev and prod environments that are identical in nature. By doing so, we are able to work out any pains in dev so that we don’t have to find them in production. For this reason, we consider all of our environments to be production from a technical standpoint. To our team, there is no difference between these environments and, as a result, no surprises when moving from dev to stage to prod.
If something is wrong, we want to know about it before our customers do. Our monitoring proactively tells us what is wrong without requiring us to go looking for it. Every PortX environment has a dedicated Slack channel where all alerting and monitoring is logged for each deployment. The idea here is to provide a single source of information for the people who are responsible for issues in a given environment. The information comes to them. Traditional logging requires sifting through information to find the answer, whereas error management is going to send you exceptions with the required information intact. For this reason, our team prefers instrumenting applications with automated error management solutions in addition to traditional logging.
High availability is expensive. It took years’ worth of development from our team. We traded the quick and easy profits of a haphazardly built product for the confidence of building a financial-grade infrastructure from the ground up. And, cloud-native architecture and services have allowed us to build something that couldn’t be done with traditional software development. If you deploy on PortX, you receive the benefit of many man-years of investment in high availability following best practices for the cloud, so you can focus on your application. Today, our team is working to fine-tune PortX to allow you to dial the reliability up or down to balance your costs with your actual reliability requirements.