CNCF Tech for Self-Service Developer Platforms

What does it mean to be self service?

If the term self-service conjures up imagery of a gas station with self-service pumps there is a good reason for it. I think there are a lot of parallels between the humble gas station and developer platforms. Both are gateway to access infrastructure. Both can be self-service.

Gasoline is flammable. It stands to reason that some may be concerned the average person isn’t qualified to pump their own gas. Yet self-service gas pumps are ubiquitous in the United States. Consider the many features of a gas pump that make self-service possible. There are bollards to prevent someone from driving into the pump. The nozzles are different sizes for gasoline and diesel. The hoses have a special connector that disconnects if someone drives away with the nozzle still in their car to prevent fuel from spilling. There is a payment interlock on the pump to prevent someone from fueling without paying. The list goes on.

I happen to live near one of the few places in the United States that require pump attendants at gas stations. The great state of Oregon. Oregon requires pump attendants due to safety concerns I stated above. In reality, the gas pumps in Oregon are just as capable of self-service as they are anywhere else. In fact, whenever it is too hot, too cold, too smokey, or too whatever the Governor declares a moratorium on requiring pump attendants for the safety of the workers and allow customers to fill their tanks on their own.

Something similar exists at companies that ship software. Environments can break. The best way to break an environment is to change it. Some are understandably concerned about letting environments change without some level of oversight. However, the faster environments can be changed, the faster a company can innovate. Oversight does not scale. Usually, this oversight falls on operational teams (Ops, DevOps, SRE, etc). In many cases, these teams become the gatekeepers to change. This leads to frustration and friction between operational teams and anyone who has to go through them to ship some code. I have observed this in my own personal experience as a software engineer, and I have heard this sentiment in many of the interviews I have conducted with other software engineers.

Every company that ships software has a developer platform. It is only the degree of formality around the developer platform that differs between these types of companies. For example, a company that has a complicated topology of virtual machines connected together with SSH tunnels for the purpose of maintenance and upgrades has a developer platform even if it is not very formal. If every contributor has equal privileges to make changes to the production environment, the developer platform is self-service in the most extreme form. Like a gas station without any of the safety features for self-service it is just a matter of time before everything blows up.

The pillars of a well governed self-service developer platform.

I think it is interesting to talk about self-service governance in the context of CNCF projects. The most well known CNCF project, Kubernetes, is Greek for helmsman, pilot, or governor. It is also the root of the term cybernetics. To me, that means automation. Automation is what makes an effective self-service platform possible. Safe and simple-self service is the key to a good developer experience.

What does it mean to be well governed?

In general, it is safe to make mistakes because the system will correct itself, or prevent the mistake from happening in the first place. This is accomplished with policy automation. Policy automation creates rails instead of gatekeepers.

How do you make such a system?

From my perspective, this type of system needs infrastructure as code, infrastructure automation, and policy automation. These pillars of a well governed self-service platform must sit on a foundation of simplicity. The figure below visualizes what this means.

This figure also shows the many CNCF technology to support these pillars. At the foundation are tools for simplicity like Knative, Backstage, and Buildpacks. Policy automation is supported by Kyverno or Open Policy Agent, infrastructure automation is supported by Argo and Operator Framework, and infrastructure as code is supported by Crossplane and Helm.

A closer look at each tool

Tools for developer platform simplicity
  • Backstage
    • What is it? Backstage is a tool for building developer portals. A developer portal is the entry point to your developer platform. Backstage formalizes developer portals by bringing software cataloging, project scaffolding, and technical documentation under one roof.
    • How does it help? I think the most compelling thing about Backstage is it’s ability to reduce the number of decisions someone has to make to start a new project and ship it to production. It is a powerful tool for enabling replatforming of legacy code.
  • Buildpacks
    • What is it? Buildpacks abstracts away the tedium of containerizing applications.
    • How does it help? With Buildpacks, applications can be containerized without writing a Dockerfile, eliminating a significant hurdle to shipping a containerized app to production.
  • Knative
    • What is it? Knative is a serverless platform for Kubernetes.
    • How does it help? In a nutshell, serverless applications with Knative abstract away all of the details of shipping a containerized application to production. Knative uses Buildpacks to produce images. With Knative functions, developers do not need in depth knowledge of Kubernetes or containers. Knative functions even abstract away the need to know the HTTP library for the language used to create the function.
Tools for Infrastructure as Code
  • Crossplane
    • What is it? Crossplane is a tool for provisioning and composing infrastructure using the Kubernetes API.
    • How does it help? Infrastructure frequently lives outside of Kubernetes. For example, an organization may use managed database solutions like AWS RDS. Crossplane provides an interface to provision these types of resources with Kubernetes API objects, freeing developers from learning cloud specific provisioning tools or tools like Terraform. With Crossplane, everything is Kubernetes.
  • Helm
    • What is it? Helm is a tool for defining, installing, and upgrading tools in Kubernetes with infrastructure as code.
    • How does it help? Helm dramatically simplifies the process of installing complicated applications.
Tools for Infrastructure Automation
  • ArgoCD
    • What is it? ArgoCD is a continuous deployment tool for Kubernetes.
    • How does it help? Big picture; combined with the other tools described in this article like Crossplane, ArgoCD can manage not only what is deployed to other clusters, but the clusters themselves. It offers one central interface to manage a developer platform. ArgoCD is the embodiment of GitOps. Infrastructure as code describes the desired state of the world, and ArgoCD ensures that the current state matches the desired state across your deployment environments.
  • Operator Framework
    • What is it? A framework for building Kubernetes operators.
    • How does it help? Operators automate the lifecycle management of an application. Think of Kubernetes operators as functions of a site reliability engineer codified into an application which manages the deployment of another application.
Tools for Policy Automation (AKA Governance & Compliance)
  • OPA
    • What is it? A general purpose policy engine.
    • How does it help? If the description “general purpose policy engine” sounds a little vague, I agree. OPA ships with it’s own domain specific language (DSL) called Rego. I think an alternative description is policy based automation. OPA requires explicit integrations with applications.
  • Kyverno
    • What is it? “Kubernetes Native Policy Management”
    • How does it help? If OPA is a general purpose policy engine, Kyverno is a special purpose policy engine for Kubernetes. What I find most compelling about Kyverno is the ability to enforce requirements for kubernetes resources that would typically be scrutinized in code reviews, freeing infrastructure teams from need to review every pull request against the infrastructure as code repository. Kyverno has many policies you can fetch off the shelf, like the best practices polices which cover a number of common rules infrastructure teams typically want to enforce.
Tools for Observability
  • Prometheus
    • What is it? Prometheus is an ecosystem of monitoring components for time series metrics. The main component is Prometheus server, which collects and stores time series data. Additionally, there is an alert manager, push gateway for short lived jobs, and a myriad of metrics exporters and client libraries available.
    • How does it help? Prometheus is a common if not defacto metrics format for Kubernetes. There are many off the shelf dashboards built around these metrics. You can even track cloud costs in Prometheus with tools like OpenCost.
  • Cortex
    • What is it? Multi-tenant long term storage for Prometheus.
    • How does it help? Prometheus server is for short lived or ephemeral metrics with a set retention period typically in the order of hours or days. Cortex offers long term retention of Prometheus metrics with multi-tenancy in mind.
  • Thanos
    • What is it? Long term storage for Prometheus.
    • How does it help? Thanos is very similar to Cortex. Cortex places a greater emphasis on multi-tenancy.

Who needs all of this?

While there are many ways to build a developer platform, the important thing to keep in mind is that it will develop organically if it is not developed thoughtfully. Lack of formality eventually manifests in friction. For example, when governance and compliance requirements are introduced, individuals are frequently installed as gatekeepers to maintain compliance, and the notion of self-service goes out the window. Suddenly, getting things shipped involves far too many individuals, and the development cycle slows to a crawl.

I cannot think of a single company that I have worked for that would not have benefited from investing in formalizing a self-service developer platform. I have worked for companies small enough to fit in a five thousand square foot office to companies large enough to have their own campus, and just about every size in between. Every company I have worked for needed and actively demanded a self-service platform to increase the rate of software delivery.

Conclusion – It’s all about the developer experience

Self-service, simplicity, and safety are all about improving the developer experience. One of the first things that drew me to Kubernetes was the fundamental reshaping of the developer experience. Even with the crude methods I used at that time to deploy to and interact with Kubernetes I saw how much easier it was to build a developer platform to suite the needs of the organization.

Kubernetes and the CNCF have created a wellspring of tools for building a better developer experience by building a better developer platform. Choosing CNCF technology affords an organization a great deal of flexibility. With care, a CNCF based developer platform can be portable between cloud providers with minimal changes. For organizations already invested in CNCF technology like Kubernetes, extending the adoption of CNCF projects is a no-brainer.