Strategies for hypervisor patching -- minimize user impact

asked 2017-08-08 12:48:56 -0500

I recognize that the answer will vary by environment (maturity of tools, size of environment, uptime criticality, etc) but looking for some guidance on industry trends and/or learnings (pro's/con's) on how they handle hypervisor patching. I am specifically interested in workload mitigation of the guest VMs. For example, as part of the hypervisor patching process, if we need to restart key processes such as KVM or Open vSwitch the guest VMs will be impacted. From my initial thinking there are a few options:

1) Mitigate VM/Services before patching hypervisor In this scenario, we must mitigate all services running on all VMs within a hypervisor before being able to patch the hypervisor itself. This is the safest but highest LOE as every VM Service type that is within Openstack must be on-boarded into the automated patching tool to safely patch the hypervisors. For this scenario, we would probably want to patch the VMs as they are being mitigated to reduce the number of times the VMs have to be taken offline (i.e. since we're taking the VM out of service to patch the hypervisor, may as well patch the guest OS).

2) Patch the hypervisor without mitigating services In this scenario, service owners need to account for hypervisor failures of X% of the fleet. We simply patch and force a restart of the service(s) or reboot of the hypervisor. The larger services should be able to handle this but the larger services / single VMs would require work by the service owner to handle this scenario.

3) Live patch but unless it's a vulnerability that's caught by unauthenticated scan, don't force restart/reboot In this scenario, the fleet will potentially have differing versions of packages actively running as the updates take effect at process restart or host reboot. Except for Open vSwitch and KVM, this is relatively easy to implement. For those packages, they should be excluded as patching them may immediately interrupt workload (as is the case with Open vSwitch). We could then schedule downtime to handle the reboot (if needed).

For a sense of scale, we're running a few thousand hypervisors in a customer facing production environment. While we are running Ubuntu-16.04/KVM combination incase there are any nuances.

Thanks for any guidance, ideas or just ruminations. :-)

Thanks, Jeffery

edit retag flag offensive close merge delete