Introduction:


Purpose:
Further optimise SystemVM template generation, lifecycle, patching, seeding and upgrades.

With https://github.com/apache/cloudstack/pull/4329 in 4.16, significant systemvmtemplate lifecycle changes were introduced to simplify systemvmtemplate install/seed and upgrades, as well as making it cloud-init enabled for CKS use-case. In addition, the management server ssh public key is now also patched via cmdline string. Bundling of systemvmtemplates while makes the packages turnkey, users don't have to register/upgrade it manually now but it still has old/new pain-points:

  • The size of cloudstack-management rpm/deb pkg increased by 1.5GB. This is not an issue but makes users wait
    for pkgs to download/install/upgrade.
  • Noredist packaging jobs require downloading of 3x systemvmtemplates for KVM, VMware, XenServer, which
    increases the packaging time by at least half-hour.
  • Packer based systemvmtemplate building takes 30-60mins, making template development/iterations slower.
  • Mgmt server public key is patched via cmdline string but systemvm.iso is still used to patch scripts and agent/jars.
  • Some new additional pkgs are installed but may not be necessary.
  • The logic for systemvm communication is highly Hypervisor-dependent.
  • The built-in template is now very old, and due to openssh changes, now it can't even be used.

Any major feature/change should be done in the main branch, however, optimisations and operational improvements can
be done in current LTS branch for future minor releases.


References:

Future work

  • Use of squashfs for in-place upgrades.
  • Unify initial patching via cloud-init+config-drive and deprecate/remove cmdline.
  • Install/patch non-common userspace dependencies upon patching via scp/ssh (for ex. jre deb pkg etc) or serve host static repo/cache in mgmt server web server
  • To build systemvmtemplate during mvn/packaging itself so hosting is no longer required
  • CKS changes: since CKS is not tied with systemvmtemplates, we can:
    • Explore: Make CKS not use cmdline string-based patching, switch to user-data or config-drive alone (Vmware disks issue); for this to happen change cloud-early-config to by default treat the template as cloud-init enable template if/when cmdline can be obtained/processed
    • Remove dockerd, docker-cli if CKS can be patched/setup using containerd alone (can reduce pkg size by 50-100MBs)

Document History:

VersionAuthor/ReviewerDate
1.0Rohit Yadav20 Nov 2021




Feature Specifications

In the following sections, we define the scope and high-level spec and well as discuss/propose design & implementation.


SystemVM build and packaging changes:

  • Explore debootstrap, nbd and chroot based systemvm template building. PoC shows promise, builds template under 5mins. (vs 30-60mins before), refer to PoC tree for export logic too.
  • Instead of bundling all 3x systemvmtemplates, we can bundle one of the hypervisors and build/export the image for other hypervisors using qemu-img upon cloudstack-management pkg install/upgrade (we use qemu-img to convert the template to different hypervisor images during build too):
    • Bundle qcow2 file in cloudstack-management pkg
    • Use export mechanism that uses qemu-img, tar and bzip2 to export to other hypervisor formats - ova and vhd. Check for qemu-img compatibility/feasibility for all support mgmt server distros. This can be done during automated template upgrades/updates or when cloudstack-management pkg is installed or upgraded.
    • In testing, we found this entire process of conversion takes 10-30seconds in total; which may significantly save on bandwidth of over 1G in total cloudstack-management pkg/file being downloaded
  • During packaging, this can simplify things so if systemvm.iso is deprecated then we bundle the ova only when -Dsystemvm or -P systemvm is enabled


SystemVM template upgrades:

  • Trigger auto-upgrades or installs based on ini file or some file/property specifies min. require template (compare against global setting and the pkg specific file/property). Same can be hard-coded/specified/used by mvn, so it doesn't auto-trigger upgrades when upgrades may not be necessary, esp for minor releases (say upgrade from4.16.0->4.16.1...).
  • Good to have: Old systemvmtemplates should be allowed to be removed/deleted by the admin


Patching Changes:

  • Remove systemvm.iso
    • With ssh/scp setup via public ssh key of mgmt server, sent to cmdline string; we don't need systemvm.iso.
    • Deprecate and remove systemvm.iso file, move systemvmtemplate patching logic via ssh/scp.
    • Don't remove ISO drive in systemvm VM spec (so can be used in future by config drive etc)
  • SSVM, CPVM CA/cert patching changes:
    • Using CA framework create certificate and pass them (pub/priv, ca cert) as key/value params in the cmdline string for all systemvms and VRs. This is do-able if we know the hostname and IP address (esp the private and link-local nic) of the systemvms/VRs.
    • The certificates can be used in future by VR agent, and currently by SSVM/CPVM agents
    • Deprecate and remove CSR-based ssvm/cpvm certificate setup method
  • Live and Automatic Patching:
    • Once ssh/scp based patching is implemented and there exists internal Java Service/APIs to trigger this; why can't we do this on the fly too for example during minor-release upgrades or upgrades where systemvmtemplate is not upgraded.
    • A new script/tool/utility to facilitate live patching (the tool is scp'd to systemvm/VR too).
    • It may also be useful to introduce a new live API or extend current API such as upgradeRouterTemplate with a new live=true|false parameter. This can allow root-admins to manually select and live-upgrade their routers without restarting network with cleanup=true.
    • Explore automatic upgrade/patching of routers, which can be triggered via some checksum-checks; or simply when restartNetwork is triggered without cleanup (i.e. VR is not destroyed) but with a new parameter to enable/allow live patching of scripts/software.
    • Live patching mechanism to explore pre/post patch and validation hooks, ability to restore upon patchingfailure (heal & failsafe)
    • If live patching uses ssh, can it be used to patch when upgrade older systemvms/systemvmtemplates and VR:
      • ssvm/cpvm: their internal nics, config etc hasn't changed in ages, in general, all it requires is (a) update the jars, (b) update certs incl ca-certificates, (c) updating maybe the JRE and (d) restart the cloud process
      • routers: their internal nics order hasn't changed in ages, in general it may require (a) update cloud scripts, (b) restarting all enabled services; however between major ACS versions the userspace software (haproxy, apache, dnsmasq etc) has significantly changes so those may not be guaranteed to work.
  • Investigate and explore multi-hop ssh jump to scp payload to VR/systemvms, otherwise continue to use hypervisor-specific patching/scp mechanism.


NOTE:

Amongst several changes made to the systemVM template in terms of the upgrade workflow and patching process, a few things to bear in mind as developer / RMs would be:

  • No labels