Dedicate Karpenter Nodes for Specialized Apps

We will learn how we can utilize affinity, taints and tolerations with Karpenter NodePools/Provisoner. The goal is to dedicate Kubernetes Nodes for specialized workloads.

Dedicate Karpenter Nodes for Specialized Apps
Photo by Austin Ramsey / Unsplash

Recently I shifted one of my EKS clusters to autoscale completely using Karpenter. Only one nodegroup to have Karpenter controller pods. Its a great alternative to cluster autoscaler, especially if you are completely on AWS cloud and do not to worry about right sizing your clusters. Here are the things I like:

  1. Easy provisioning of reliable spot capacity
  2. Use a wide array of instances matched for specific requests from pods.
  3. Create different configurations (NodePools/Provisioners) for different scaling needs.
  4. Automatic node updates.

Specialized Workloads

When do we need specialized nodes? Imagine all your application workloads are stateless and you are leveraging 100% spot instances to run them. Now you want to deploy observability stack in this cluster. You definitely want this stack to be highly available so you choose to provision them on on-demand instances.

The problem is, by default Kubernetes scheduler will try to schedule pods on any node which has available capacity. It doesn't discriminate between spot and on-demand nodes. Thus, your application pods will start getting scheduled on on-demand and observability stack pods on spot and there's nothing to stop this behavior. Or is there?

Affinity, Taints and Toleration

Kubernetes provides the ability to bind pods to specific nodes by using the concepts of affinity, taints and toleration.

Affinity is the property of a pod that makes it attracted to a node. Similarly there is anti-affinity which does the opposite. It comes in two forms:

  1. Node affinity: This pod spec contains the labels of node which it wants to be scheduled on.
  2. Pod affinity: This pod spec contains the labels of pod which is wants to be scheduled with.

Note that affinity and anti-affinity do not restrict other pods, that do not have this spec, to be scheduled on the same nodes. So, this is useful when you want to schedule pods on specific nodes based on availability zones, instance types (GPU, compute or memory intensive), etc. This is not very useful in a case like the observability stack above.

Taints is the property of a node to that tells the scheduler not to (or try not to) schedule pods on that node.

Tolerations is the property of a pod, that when matches exactly with the taint of the node, tells the scheduler to ignore the restriction and schedule the pod on tainted node.

Using affinity, taints and tolerations together, we can achieve or dedicated node strategy. This official guide describes the basic process to do this.

Using Taints and Tolerations with Karpenter

The guide shows how to taint nodes after they have been created. In our case, Karpenter creates the nodes just-in-time. We need to make sure that Karpenter launches our dedicated node with taints and listens to affinity and tolerations of pods while scheduling.

Let's start with creating a NodePool with appropriate labels and taints.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: observability
spec:
  template:
    metadata:
      labels:
        role: observability
    spec:
      nodeClassRef:
        name: observability
      taints:
        - key: role
          value: observability
          effect: NoSchedule
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["r"]
        - key: "karpenter.k8s.aws/instance-cpu"
          operator: In
          values: ["4"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: Gt
          values: ["4"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-west-2a", "us-west-2b"]
        - key: "kubernetes.io/arch"
          operator: In
          values: ["arm64", "amd64"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["on-demand"]

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s
    expireAfter: 720h

And a NodeClass for this pool:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: observability
spec:
  amiFamily: Bottlerocket
  role: "KarpenterNodeRole-my-cluster"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster-sg"
  tags:
    app: observability

This NodePool will add a taint role=observability:NoSchedule to every node it creates and attaches to the EKS cluster. It will also add a label role=observability to the node which we can use to define affinity in our pods.

Let's create a Prometheus pod as an example of our obervability app that will be scheduled by this Karpenter NodePool.

apiVersion: v1
kind: Pod
metadata:
  name: prometheus
  labels:
    app: prometheus
spec:
  containers:
    - name: prometheus
      image: quay.io/prometheus/prometheus:latest
      resources:
        requests:
          memory: "24Gi"
          cpu: "3500m"
      ports:
        - containerPort: 9090
  
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: role
                operator: In
                values:
                  - observability
  # For such a simple use case we can also use nodeSelector
  # instead of affinity
  # nodeSelector:
  #   role: observability
  
  tolerations:
    - key: "role"
      operator: "Equal"
      value: "observability"
      effect: "NoSchedule"
💡
An important thing here is that we used requiredDuringSchedulingIgnoredDuringExecution which is also called hard affinity as this criteria MUST be met during scheduling. If we used preferredDuringSchedulingIgnoredDuringExecution or the soft affinity, then scheduler will try to respect the affinity but it is not guaranteed.

Karpenter controller checks the affinity and tolerations of this pods and picks the correct NodePool with matcing labels and taints automatically. It creates new nodes as required with that NodePool and then the Kubernetes scheduler does its job.

Parting Words

In this blog, we went through the process of adding taints and labels on Karpenter provisioned nodes. We also saw how we can add hard affinity to pods so that they only get scheduled on specific nodes.

Doing both these steps makes us dedicate nodes for specific use-cases like high memory, compute intensive, GPU centric workloads, etc.