Tag Archives: Kubernetes

helm header image

Helm upgrade –reuse-values Fails with Nil Pointer Error After a Chart Version Bump

If you have been running a Helm chart for a while and using --reuse-values to carry your previous configuration forward on upgrades, you may have hit an error like the one below after bumping to a new chart version:
Error: UPGRADE FAILED: template: acme-web-proxy/templates/deployment.yaml:22:15: executing "acme-web-proxy/templates/deployment.yaml" at <include "acme-web-proxy.podAnnotations" .>: error calling include: template: acme-web-proxy/templates/_helpers.tpl:41:71: executing "acme-web-proxy.podAnnotations" at <include (print $.Template.BasePath "/configmap.yaml") .>: error calling include: template: acme-web-proxy/templates/configmap.yaml:18:6: executing "acme-web-proxy/templates/configmap.yaml" at <include "acme-web-proxy.metrics.config" .>: error calling include: template: acme-web-proxy/templates/configmap.yaml:34:25: executing "acme-web-proxy/templates/configmap.yaml" at <.Values.server.metrics.enabled>: nil pointer evaluating interface {}.enabled
Running the same upgrade with an explicit values file works without issue:
helm upgrade -n my-namespace acme-web-proxy acme/web-proxy \
  --version 1.5.0 \
  -f helm/acme-web-proxy-values.yaml
Release "acme-web-proxy" has been upgraded. Happy Helming!
Read on to understand why these two commands behave differently and what you can do about it.

The Issue

When upgrading a Helm chart using --reuse-values, the upgrade fails with a nil pointer error. The error traces back to a template trying to access a values key that does not exist in the stored release values, in this case .Values.server.metrics.enabled. The same upgrade succeeds when you pass a values file explicitly using -f.

The Cause

The difference comes down to how Helm builds the values set that gets rendered into your chart templates. With helm upgrade --reuse-values, Helm takes only the user-supplied values stored from the previous release and uses those as the complete set of overrides. It does not start from the new chart version’s values.yaml defaults. Any key introduced in the new chart version is simply missing. With helm upgrade -f values.yaml, Helm starts from the new chart’s values.yaml defaults and merges your file on top. Keys added in the new chart version are populated with their default values before your overrides are applied. In the example above, chart version 1.5.0 added a new server.metrics.enabled key. The chart template accesses it directly without a nil guard:
{{- if .Values.server.metrics.enabled }}
  # metrics configuration block
{{- end }}
When you upgrade with --reuse-values, the server.metrics map does not exist in the stored values at all. Go’s template engine cannot evaluate .enabled on a nil pointer and the render fails immediately. This is expected behaviour. The Helm documentation states that --reuse-values reuses the last release’s values and merges in any overrides from --set. Merging in new chart defaults is not part of what it does.

The Fix

There are three approaches depending on your workflow.

Option 1: Always upgrade with an explicit values file

In my opinion, this is the most reliable approach. Keep a values file that captures every override you need and pass it on every upgrade:
helm upgrade -n my-namespace acme-web-proxy acme/web-proxy \
  --version 1.5.0 \
  -f helm/acme-web-proxy-values.yaml
Helm loads the new chart’s values.yaml defaults first and then applies your file on top. New keys get their defaults and your existing overrides stay intact.

Option 2: Supply the missing key with –set

If you want to keep using --reuse-values, you can backfill the missing key on the command line. Check the new chart’s values.yaml for the expected default and pass it in:
helm upgrade -n my-namespace acme-web-proxy acme/web-proxy \
  --version 1.5.0 \
  --reuse-values \
  --set server.metrics.enabled=false
This resolves the immediate error, but it is a fragile approach for ongoing upgrades. Each time a new chart version introduces a key that a template does not nil-guard, you will hit the same problem again.

Option 3: Use –reset-then-reuse-values (Helm 3.14+)

Helm 3.14 added the --reset-then-reuse-values flag. It resets to the new chart’s defaults first and then re-applies your previously stored overrides on top:
helm upgrade -n my-namespace acme-web-proxy acme/web-proxy \
  --version 1.5.0 \
  --reset-then-reuse-values
If you are on Helm 3.14 or later, this flag handles the new defaults problem without requiring you to maintain a full values file. You can check your Helm version with helm version.

Why –reuse-values is risky across chart version bumps

--reuse-values was designed for cases where you want to re-apply the same set of overrides without listing them again. It works well when upgrading within the same chart version or when a new version does not introduce any new required template values. Once a chart adds a new key and the template author accesses it without a nil guard such as {{- if .Values.server.metrics }}, any upgrade using --reuse-values will break for anyone who does not have that key in their stored values. It is partly a chart authoring problem, but you will encounter it regardless and need to know how to unblock yourself. The most consistent approach is to treat your values file as the source of truth and always pass -f on every upgrade. Your intent is explicit, the file is reviewable in source control, and you will not get caught out when a chart adds new keys. Regards Follow me on Bluesky Dean Lewis
Kubernetes Header Image

How to Increase CPU & Memory Limits and Set Node Selector for Splunk Operator on Kubernetes

The Issue

When deploying a Splunk instance using the Splunk Operator on Kubernetes, the default resource limits are set to 4 CPUs and 8GB of RAM. Users often want to increase these limits to better utilize available hardware resources. Additionally, users may want to schedule the Splunk pods on a specific Kubernetes node by using a nodeSelector.

However, attempts to set nodeSelector directly in the Splunk Operator’s Custom Resource (CR) manifest result in errors, and the operator does not apply the node selection as expected. This leads to deployment failures or pods not being scheduled on the desired node.

The Cause

The root cause is that the Splunk Operator’s Custom Resource Definition (CRD) for Standalone does not support the nodeSelector field inside the spec section of the CR manifest. When you try to add nodeSelector there, Kubernetes rejects the manifest with errors like:

The request is invalid: patch: Invalid value: ...: strict decoding error: unknown field "spec.nodeSelector"

This happens because nodeSelector is not defined in the manifest according to the CRD schema, and the Splunk Operator currently does not expose nodeSelector as a configurable field in the CR.

The Fix

To increase CPU and memory limits for your Splunk instance, update the resources section under spec in your Splunk Standalone manifest like this:

spec:
  resources:
    limits:
      cpu: "6"              # Max 6 CPUs allowed
      memory: "12Gi"        # Max 12 GB memory allowed

This change is supported by the operator and will apply the resource limits correctly.

For node selection, since the operator does not support setting nodeSelector in the CR, you need to manually patch the StatefulSet that the operator creates. Use the following kubectl patch command to restrict the pods to run only on a specific node (replace the hostname with your target node):

kubectl patch statefulset splunk-splunk-01-standalone -n splunk --type='merge' -p='{
  "spec": {
    "template": {
      "spec": {
        "nodeSelector": {
          "kubernetes.io/hostname": "ip-10-1-1-15.us-east-2.compute.internal"
        }
      }
    }
  }
}'

This patch adds the nodeSelector to the pod template spec of the StatefulSet, ensuring pods are scheduled only on the specified node.

 

Kubernetes Header Image

Fixing “Kubernetes configuration file is group-readable or world-readable” warnings

The Issue

When using kubectl or oc you may see warnings that your Kubernetes configuration file is readable by group or by everyone.

WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/user/cluster/admin-kubeconfig
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/user/cluster/admin-kubeconfig

The Cause

The kubeconfig file has permissions that allow access for group or others. The tools expect your kubeconfig to be readable and writable only by your user.

You can confirm this with a long listing. If you see read permission for group or others, the file is too open.

ls -l /home/user/cluster/admin-kubeconfig
-rw-r--r--  1 user  staff   12345  Sep  3 14:05 /home/user/cluster/admin-kubeconfig
# ^ group and others have read access

The Fix

  1. Restrict the file permissions so only your user can read and write it.
    chmod 600 /home/user/cluster/admin-kubeconfig
  2. Optionally restrict the directory that holds the file.
    chmod 700 /home/user/cluster
  3. Verify the new permissions. The output should show owner read and write only.
    ls -l /home/user/cluster/admin-kubeconfig
    -rw-------  1 user  staff   12345  Sep  3 14:05 /home/user/cluster/admin-kubeconfig
    
  4. Consider moving the kubeconfig into your home configuration folder for easier use, then point your tools at it.
    mkdir -p ~/.kube
    mv /home/user/cluster/admin-kubeconfig ~/.kube/admin-kubeconfig
    export KUBECONFIG=~/.kube/admin-kubeconfig
    

    If you work with several kubeconfigs, you can join them in an environment variable.

    export KUBECONFIG=~/.kube/admin-kubeconfig:~/.kube/other.kubeconfig
  5. Keep your kubeconfig private. Do not share it, and do not commit it to a source control system.

Regards


Bluesky Icon
Follow me on Bluesky

Dean Lewis

veducate header

Safely Clean Up Orphaned First Class Disks (FCDs) in VMware vSphere with PowerCLI

vSphere Orphaned First Class Disk (FCD) Cleanup Script

Orphaned First Class Disks (FCDs) in VMware vSphere environments are a surprisingly common and frustrating issue. These are virtual disks that exist on datastores but are no longer associated with any virtual machine or Kubernetes persistent volume (via CNS). They typically occur due to:

  • Unexpected VM deletions without proper disk clean-up
  • Kubernetes CSI driver misfires, especially during crash loops or failed PVC deletes
  • vCenter restarts or failovers during CNS volume provisioning or deletion
  • Manual admin operations gone slightly sideways!

Left unchecked, orphaned FCDs can consume significant storage space, cause inventory clutter, and confuse both admins and automation pipelines that expect everything to be nice and tidy.

🛠️ What does this script do?

Inspired by William Lam’s original blog post on FCD cleanup, this script takes the concept further with modern PowerCLI best practices.

You can download and use the latest version of the script from my GitHub repo:
👉 https://github.com/saintdle/PowerCLI/blob/saintdle-patch-1/Cleanup%20standalone%20FCD

Here’s what it does:

  1. Checks if you’re already connected to vCenter; if not, prompts you to connect
  2. Retrieves all existing First Class Disks (FCDs) using Get-VDisk
  3. Retrieves all Kubernetes-managed volumes using Get-CnsVolume
  4. Excludes any FCDs still managed by Kubernetes (CNS)
  5. For each remaining “orphaned” FCD, checks if it is mounted to any VM (even if Kubernetes doesn’t know about it)
  6. Generates a report (CSV + logs) of any true orphaned FCDs (not in CNS + not attached to any VM)
  7. If dry-run mode is OFF, safely removes the orphaned FCDs from the datastore

The script is intentionally designed for safety first, with dry-run mode ON by default. You must explicitly allow deletions with -DryRun:$false and optionally -AutoDelete.

❗ Known limitations and gotchas

Despite our best efforts, there is one notorious problem child: the dreaded locked or “current state” error.

You may still see errors like:

The operation is not allowed in the current state.

This happens when vSphere believes something (an ESXi host, a failed task, or the VASA provider) has an active reference to the FCD. These “ghost locks” can only be diagnosed and resolved by:

  • Using ESXi shell commands like vmkfstools -D to trace lock owners
  • Rebooting an ESXi host holding the lock
  • Engaging VMware GSS to clear internal stale references (sometimes the only safe option)

This script does not attempt to forcibly unlock or clean these disks for obvious reasons. You really don’t want a script going full cowboy on locked production disks. 😅

So while the script works great for true orphaned disks, ghost FCDs are a special case and remain an exercise for the reader (or your VMware TAM and GSS support team!).

⚠️ Before you copy/paste this blindly…

Let me be brutally honest: this script is just some random code stitched together by me, a PowerCLI enthusiast with far too much time on my hands, and enhanced by ChatGPT. It’s never been properly tested in a production environment.

 

Regards


Bluesky Icon
Follow me on Bluesky

Dean Lewis

Learn Kubevirt - migrating from VMware - header image

Learn KubeVirt: Deep Dive for VMware vSphere Admins

As a vSphere administrator, you’ve built your career on understanding infrastructure at a granular level, datastores, DRS clusters, vSwitches, and HA configurations. You’re used to managing VMs at scale. Now, you’re hearing about KubeVirt, and while it promises Kubernetes-native VM orchestration, it comes with a caveat: Kubernetes fluency is required. This post is designed to bridge that gap, not only explaining what KubeVirt is, but mapping its architecture, operations, and concepts directly to vSphere terminology and experience. By the end, you’ll have a mental model of KubeVirt that relates to your existing knowledge.

What is KubeVirt?

KubeVirt is a Kubernetes extension that allows you to run traditional virtual machines inside a Kubernetes cluster using the same orchestration primitives you use for containers. Under the hood, it leverages KVM (Kernel-based Virtual Machine) and QEMU to run the VMs (more on that futher down).

Kubernetes doesn’t replace the hypervisor, it orchestrates it. Think of Kubernetes as the vCenter equivalent here: managing the control plane, networking, scheduling, and storage interfaces for the VMs, with KubeVirt as a plugin that adds VM resource types to this environment.

Tip: KubeVirt is under active development; always check latest docs for feature support.

Core Building Blocks of KubeVirt, Mapped to vSphere

KubeVirt Concept vSphere Equivalent Description
VirtualMachine (CRD) VM Object in vCenter The declarative spec for a VM in YAML. It defines the template, lifecycle behaviour, and metadata.
VirtualMachineInstance (VMI) Running VM Instance The live instance of a VM, created and managed by Kubernetes. Comparable to a powered-on VM object.
virt-launcher ESXi Host Process A pod wrapper for the VM process. Runs QEMU in a container on the node.
PersistentVolumeClaim (PVC) VMFS Datastore + VMDK Used to back VM disks. For live migration, either ReadWriteMany PVCs or RAW block-mode volumes are required, depending on the storage backend.
Multus + CNI vSwitch, Port Groups, NSX Provides networking to VMs. Multus enables multiple network interfaces. CNIs map to port groups.
Kubernetes Scheduler DRS Schedules pods (including VMIs) across nodes. Lacks fine-tuned VM-aware resource balancing unless extended.
Live Migration API vMotion Live migration of VMIs between nodes with zero downtime. Requires shared storage and certain flags.
Namespaces vApp / Folder + Permissions Isolation boundaries for VMs, including RBAC policies.

KVM + QEMU: The Hypervisor Stack

Continue reading Learn KubeVirt: Deep Dive for VMware vSphere Admins