Scale Deployment to Zero

If a Deployment's Pods are seen crashing multiple times it usually indicates there is an issue that must be manually resolved. Removing the failing Pods and marking the Deployment is often a useful troubleshooting step. This policy watches existing Pods and if any are observed to have restarted more than once, indicating a potential crashloop, Kyverno scales its parent deployment to zero and writes an annotation signaling to an SRE team that troubleshooting is needed. It may be necessary to grant additional privileges to the Kyverno ServiceAccount, via one of the existing ClusterRoleBindings or a new one, so it can modify Deployments. This policy scales down deployments with frequently restarting pods by monitoring `Pod.status` for `restartCount`updates, which are performed by the kubelet. No `resourceFilter` modifications are needed if matching on `Pod`and `Pod.status`. Note: For this policy to work, you must modify Kyverno's ConfigMap to remove or change the line `excludeGroups: system:nodes` since version 1.10.

View on GitHub

Policy Definition

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: scale-deployment-zero
  annotations:
    policies.kyverno.io/title: Scale Deployment to Zero
    policies.kyverno.io/category: Other
    policies.kyverno.io/severity: medium
    policies.kyverno.io/subject: Deployment
    kyverno.io/kyverno-version: 1.7.0
    policies.kyverno.io/minversion: 1.7.0
    kyverno.io/kubernetes-version: "1.23"
    policies.kyverno.io/description: "If a Deployment's Pods are seen crashing multiple times it usually indicates there is an issue that must be manually resolved. Removing the failing Pods and marking the Deployment is often a useful troubleshooting step. This policy watches existing Pods and if any are observed to have restarted more than once, indicating a potential crashloop, Kyverno scales its parent deployment to zero and writes an annotation signaling to an SRE team that troubleshooting is needed. It may be necessary to grant additional privileges to the Kyverno ServiceAccount, via one of the existing ClusterRoleBindings or a new one, so it can modify Deployments. This policy scales down deployments with frequently restarting pods by monitoring `Pod.status`  for `restartCount`updates, which are performed by the kubelet. No `resourceFilter` modifications are needed if matching on `Pod`and `Pod.status`. Note: For this policy to work, you must modify Kyverno's ConfigMap to remove or change the line  `excludeGroups: system:nodes` since version 1.10."
spec:
  rules:
    - name: annotate-deployment-rule
      match:
        any:
          - resources:
              kinds:
                - v1/Pod.status
      preconditions:
        all:
          - key: "{{request.operation || 'BACKGROUND'}}"
            operator: Equals
            value: UPDATE
          - key: "{{ sum(request.object.status.containerStatuses[*].restartCount || [`0`]) }}"
            operator: GreaterThan
            value: 1
      context:
        - name: rsname
          variable:
            jmesPath: request.object.metadata.ownerReferences[0].name
            default: ""
        - name: deploymentname
          apiCall:
            urlPath: /apis/apps/v1/namespaces/{{request.namespace}}/replicasets
            jmesPath: items[?metadata.name=='{{rsname}}'].metadata.ownerReferences[0].name | [0]
      mutate:
        targets:
          - apiVersion: apps/v1
            kind: Deployment
            name: "{{deploymentname}}"
            namespace: "{{request.namespace}}"
        patchStrategicMerge:
          metadata:
            annotations:
              sre.corp.org/troubleshooting-needed: "true"
          spec:
            replicas: 0

Related Policies

MutateMedium

Add Safe To Evict

The Kubernetes cluster autoscaler does not evict pods that use hostPath or emptyDir volumes. To allow eviction of these pods, the annotation cluster-autoscaler.kubernetes.io/safe-to-evict=true must be added to the pods.

Pod

MutateMedium

Add CAST AI Removal Disabled

CAST AI will not downscale a node that includes a pod with the autoscaling.cast.ai/removal-disabled="true" label on it, this protects sensitive workloads from being evicted and can be attributed to any pod to protect against unwanted downscaling. This policy will mutate jobs and cronjobs to add the removal-disabled label to protect against eviction.

Job

CleanupMedium

Cleanup Empty ReplicaSets

ReplicaSets serve as an intermediate controller for various Pod controllers like Deployments. When a new version of a Deployment is initiated, it generates a new ReplicaSet with the specified number of replicas and scales down the current one to zero. Consequently, numerous empty ReplicaSets may accumulate in the cluster, leading to clutter and potential false positives in policy reports if enabled. This cleanup policy is designed to remove empty ReplicaSets across the cluster within a specified timeframe, for instance, ReplicaSets created one day ago, ensuring the ability to rollback to previous ReplicaSets in case of deployment issues

ReplicaSet