So basically I had a customer hit by a known HP-AMS driver issue (KB2085618), the symptons were as follows;
- Unable to vMotion a VM to another host, gives “operation timed out error”
- If power off VM, vMotion a machine to another host, and then power on, you get the following error “Could not start VMX: msg.vmk.status.VMK_NO_MEMORY”
ESXi memory leak
This issue is caused by a memory leak in the driver which fills the SWAP memory of the ESXi host, making it unable to response to any requests at all,
For example, trying to enable SSH;
So basically here’s the fix, update to the latest HP-AMS Driver. However you will find just importing this into Update Manager, and trying to update a host will fail, because you can’t migrate your machines, but if you turn them off, it still fails with an “Error 99” message which if you google, will point you here after reading a community post KB2043170.
Just reboot the host, after powering off your machines, as you will find if you log into the ESXi console directly (Remote iLO/iDRAC or Physically).
/bin/sh: can't fork
So the host is completely locked up.
After one host has rebooted, update with the latest patches, then vMotion your machines onto this host, bring them up, update your other hosts.
A technically simple fix, but a pain in the backside if you’re hit with the issue once the memory leak has caused its damage.