Kubernetes Operators - Part 1
One of the many great things about Kubernetes is its customizability by default - it is pretty easy to make Kubernetes work for you. :) At work we were dealing with clients who needed to automate the creation of multiple app instances, and this is exactly where the operator model shines. It lets us easily manage the lifecycle of those app instances, fully automated. And the bonus: it is also a pretty good way to understand how Kubernetes works under the hood.
This is the very first article in a series where I want to show you what operators are all about and how to use them. This article is meant to give a rough theoretical overview. The following parts will then be more practical – showing you in a tutorial style how to build an operator that automatically deploys example apps.
So let’s get started with the theory :)
What are Kubernetes operators?
The official Kubernetes documentation gives us the following definition:
Kubernetes is designed for automation. Out of the box, you get lots of built-in automation from the core of Kubernetes. You can use Kubernetes to automate deploying and running workloads, and you can automate how Kubernetes does that.
Kubernetes’ operator pattern concept lets you extend the cluster’s behaviour without modifying the code of Kubernetes itself by linking controllers to one or more custom resources. Operators are clients of the Kubernetes API that act as controllers for a Custom Resource.
In short: to build an operator you define a Custom Resource Definition (CRD) to model the API and implement a controller that watches Custom Resource (CR) instances and performs the reconciliation logic to create or update the underlying Kubernetes resources.
What operators can be used for
Basically, they can help us conduct a broad range of routine tasks we want to avoid doing manually, e.g.:
- deploying an application on demand (like we needed at work :))
- taking and restoring backups of an application’s state
- handling upgrades of an application’s code alongside related changes such as database schemas or extra configuration settings
- simulating failures in all or parts of your cluster to test its resilience
How operators can be built
The nice thing is: operators can be built in basically every language / runtime for which there is a client for the Kubernetes API. So if you are eager to build it in C or .NET – well, you could ;) As far as I know, many operators are built in Go or Python, but you could also use Haskell, Java, JS, Perl or Ruby. There are even community-driven client libraries .
When scanning the internet looking for blogs with operator examples, you will quite likely stumble across kubebuilder and kopf , but there are more depending on the language you are using. More details can be found here .
The bits and pieces - CRD, CR & reconciliation logic
As mentioned above, there are three main components to look at when talking about building an operator. Let’s have a closer look.
Custom Resource (CR)
In general, a resource within Kubernetes is nothing else than an endpoint in the Kubernetes API which stores a collection of objects of a specific kind – for example, the Pod resource contains the collection of all Pod objects in your cluster.
When building operators, we define a custom resource, so we are adding our own type as an extension to the Kubernetes API. You can think of it as inventing a new first‑class citizen next to Pods, Deployments & friends – for example an AppInstance, a BackupJob or a Database.
The custom resource itself is just data. It only describes the desired state. It does not by itself create the resource - that is where the controller comes in.
A very small (and simplified) example of a custom resource could look like this:
apiVersion: apps.example.com/v1
kind: AppInstance
metadata:
name: demo-app
spec:
replicas: 3
image: ghcr.io/example/demo:latest
This object says: “Dear cluster, I want an AppInstance called demo-app with 3 replicas running this image.” It is then up to the operator to make that happen.
CustomResourceDefinition (CRD)
So where does Kubernetes learn what an AppInstance even is? This is what the CustomResourceDefinition (CRD) is for.
The CRD is another Kubernetes resource that:
- registers your new type (group, version, kind) with the API server
- defines what fields are allowed in
specandstatus(via an OpenAPI schema) - optionally adds validation rules, defaults and extra behaviour (like subresources)
Only after the CRD is created, the API server will accept and store objects of that new kind. You can then kubectl get appinstances just like you would kubectl get pods.
In short:
- CRD: the blueprint / type definition (what is valid?)
- CR: a concrete instance of that type (what do we want right now?)
Reconciliation logic
Now that we can store our desired state, we need a piece of code that watches these custom resources and turns them into real Kubernetes objects. Hello Controller :wave:
The heart of a controller is the reconciliation loop. Very simplified, it does something like this:
- Watch for changes to your custom resources (create, update, delete).
- For each change, read the latest desired state from the CR.
- Look at the actual state in the cluster (existing Deployments, Services, PVCs, …).
- Compute the difference and apply the necessary changes so that actual state matches desired state.
Important: this is not a one‑time “if‑else” script. The controller runs this reconcile step over and over again. If someone manually deletes a Deployment that belongs to your AppInstance, the next reconcile run will notice the drift and recreate it. This is what makes operators self‑healing and robust.
Good reconciliation logic is:
- idempotent - running it twice has the same effect as running it once
- level‑based - it aims for a final state instead of reacting to single events only
- focused - it only touches the resources that belong to its CRs
Best practices
There is a lot more to building solid operators than just wiring up a CRD and a loop. A few best practices that I personally find very helpful (and that are also highlighted in this great article from Google Cloud ):
- Keep the API simple and focused: Start with the minimum set of fields you really need. It is much easier to add options later than to remove them.
- Separate
specandstatuscleanly: Users should only write tospec. Your operator writes tostatusto report what is going on (conditions, observed generation, last backup timestamp, …). - Handle failures explicitly: Things will fail - images are not found, PVCs cannot be bound, etc. Reflect that in
statusand surface clear error messages instead of just crashing the controller. - Use finalizers for clean‑up: When a CR is deleted, finalizers give your operator a chance to clean up external resources (for example cloud databases or buckets) before the object is finally removed.
- Be a good Kubernetes citizen: Reuse existing building blocks where possible (Deployments, StatefulSets, Jobs) instead of inventing your own scheduling or storage logic.
Following these principles keeps our operator understandable - both for our future self and for everyone else running it.
Summing up
That was the very basis – and very theoretical. It’s time now to get our hands dirty, so next step (and also the next article to follow): building a very basic operator to deploy a Hello World application into our cluster.
*image created by buddy ChatGPT
