Bug Reference

https://issues.apache.org/jira/browse/CLOUDSTACK-6278

 

Branch

4.5

Introduction

CloudStack Baremetal can only work in basic networking mode currently, the main challenge to support advanced networking mode is to find a way to program vlan when provisioning/deprovisioning a baremetal instance. Vlan programming is a simple task in virtualization, because switches where vms directly connect to are virtualized and all hypervisor vendors provide means to program vlan on virtual switches. In baremetal, there are totally no virtualization technology involved, baremetal instances are connecting to physical switch so the only way to program vlan is to talk to physical switch. Given there are lot of switch vendors in market, this feature will provide a framework where switch vendor can plug their specific product in by writing a small piece of code.

Purpose

This is functional specification of Baremetal Advanced Networking Support, which has JIRA ID 6278

References

Document History

Date

Revision

Author

Description of the change

3/24/2014

0.1

Frank Zhang

Initial Draft

Glossary

Term

Definition

Baremetal

the technology that manages baremetal host using CloudStack infrastructure

CS

CloudStack

Feature Specifications

  • put a summary or a brief description of the feature in question 
    This feature is about CloudStack network plugin for baremetal advanced networking. With this plugin, CloudStack can automatically program vlan on physical switch to which baremetal instances connect when creating/destroying baremetal instance. This feature cannot work standalone, it needs support from physical switch itself either from vendor's SDK or from an in-switch agent for whitebox switch. When using this feature, baremetal instances gain L2 isolation methods provided by CloudStack advanced networking which is particularly useful in public cloud that wants to provide baremetal as a service.

  • list what is deliberately not supported or what the feature will not offer - to clear any prospective ambiguities
    Below network related categories of API are supported in this feature:
    • NAT

    • VPN

    • Load Balancer

    • Firewall

    • Router

    • Network

       

          For details for these categories, please refer to CloudStack API Doc.

  • list all open items or unresolved issues the developer is unable to decide about without further discussion
    At time this functional spec is being written, Dell S4810 is chosen as switch backend in first version. Other switch vendors are not currently supported.

  • quality risks (test guidelines)
    To test this feature, tester must have physical switch supported by first version implementation equipped on top of rack and make sure all baremetal hosts are connected to the switch.
    1. create baremetal zone with advanced networking

    2. prepare baremetal infrastructure following instructions in reference link to Baremetal Kickstart

    3. provide host-to-switch details either using API addBaremetalRct

    4. create user vm using different user accounts and verify they are isolated

      This feature is transparent to end user, they should not feel any difference than creating a virtualized instance when creating a baremetal instance. For administrator, a few extra works need to be done, including setup a baremetal advanced zone and provide network topology between host and TOR switch.  

  • specify supportability characteristics:
    • to troubleshoot, admin needs access to physical switch using a direct connected cable or remote ssh which is decided by physical switch vendor. Log in physical switch, comparing vlan on the port where baremetal instance connects to and vlan allocated to vm by CloudStack can troubleshoot most connectivity issues
    • this feature will use periodical task to full sync vlan configuration on switch ports and vlan allocation in CloudStack database, to reduce possible fault caused by admin's misconfiguration on physical switch. The task's interval can be configured through a global setting, a special value 0 indicates shutdown this feature

  • explain configuration characteristics:
      • baremetal.vlan.sync.interval is introduced to control interval of periodical task which sync vlan configuration on switch ports and vlan allocation in CloudStack database
      • there may be some configuration needs to be done on physical switch to enable SDK or remote vlan programming. It's up to switch vendor and out of scope of CloudStack.

  • deployment requirements (fresh install vs. upgrade) if any
    • this feature would be available to both CloudStack fresh install and upgrade from an older CloudStack version

  • port security
    • we recommend admin to lock port to baremtal mac on switches for security. The instructions to do this is different among switch vendors, please refer to switch's user guide
  • virtual router
    • Baremetal uses virtual router to provide all network services including PXE/DHCP, SNAT, PortForwarding and so forth(For details list of VR functions please look up CloudStack administrator guide), as this time being only VMWare will be supported as VR provider. The behind reason is virtual router of Xenserver/KVM uses link local address for inter-communication while baremetal requires management nic to access  internal http server which stores kickstart file and installation ISO. To mix Xenserver/KVM  up with baremetal, user must deploy baremetal instance before any virtualization instance to ensure VR is created on VMware host.
    • For virtual router HA, there is no specific thing to baremetal, admin could use the same means of virtualization
    • The VR template size will be increased from 2G to 4G in order to store proper number of kernel/initrd of baremetal template

  • LLDP is not supported in first version implementation

  • If baremetal host has multiple nics, only one nic which sits in the same vlan with VR can get IP address; However, admin can configure kickstart file to set static ip or use DHCP server outside CloudStack for extra nics

  • Baremetal has no concept of CloudStack storage including primary storage, secondary storage, volume, snapshot. For details, see functional spec of "Baremetal Kickstart Support"

  • Baremetal cluster is a simple aggregation of hosts which have similar hardware. We recommend admin to put hosts of same cluster in the same IPMI subnet, and connect all 'guest nic'(the nic that plays role as guest nic after provisioning guest OS) to the same TOR switch. 

  • IPv6 is not supported in this feature

  • The max number of concurrently deploying baremetal is undetermined, from our experience we recommend user not to deploy more than 10 baremetal instances at the same time

Use cases

This feature is transparent to end user. There is not change in workflow of creating a CloudStack instance in advanced networking zone, except user creates a baremetal instance instead of virtualized instance. The workflow of creating a baremetal instance is described in specification Baremetal Kickstart.

For admin, the workflows are:

  1. Admin creates the compute offering
  2. Admin creates a Network Offering w/PXE & DHCP services and VR as the service provider
  3. Admin set IP of internal http server in network offering and global setting 'baremetal.internal.storage.server.ip'. The one in network offering will override the one in global setting
  4. Tenant can create a network w/ the above network offering
  5. Tenant deploys BM instance
  6. CloudPlatform Management Server programs VR w/DHCP and PXE boot information
  7. CloudPlatform Management Server creates a source NAT that traffic from guest gateway with destination ip of 'baremetal.internal.storage.server.ip' will be source NATed to management nic. Then traffic towards to internal http server goes through VR's management nic instead of public nic. The source nat will only bind to guest ip of provisioning instance to prevent network sniffer from other VMs in the same network.
  8. Programs TOR w/necessary guest VLAN
  9. IPMI powers BM host
    1. Sets Host to PXE boot
    2. Restarts the host
  10. Host boots up and reaches DHCP (VR)
  11. Gets IP and PXE info using DHCP options
  12. Downloads Linux Kerner (init.rd)
  13. Gets KS file from PXE server (VR)
  14. BM host gets packages using the info provided from the KS file
  15. When provisioning is done, post-provision script in KS file runs a script which sends a notification(http request) to an agent running in VR. The agent will drop the source nat created in step 7 and notify CloudStack management server that provisioning is complete. CloudStack management server will also set a TTL on source nat created in step 7. If the provision-done notification is not received after TTL is expired, management server will instruct agent in VR to drop the source NAT and treat provisioning as failure.

Work Flow Diagram:

Architecture and Design description

CloudStack advanced networking typically uses vlan as L2 isolation method, this can be simply achieved in virtualized environment as all hyperivisor vendors provide means to configure vlan on virtual switch programmabely. In baremetal world, the case is more challenging as baremetal instances are connecting to physical switches, as there is no generic way to programming vlan during baremetal instance's provisioning/deprovisioning phrase. Below is an architecture overview for this feature:


The main efforts are divided in four parts:

  • CloudStack Network Plugin
    This plugin is hooking into network lifecycle of baremetal instance. In CloudStack, the orchestration code will call NetworkGuru interface during different lifecycle phases, for example, NetworkGuru.desigin/NetworkGuru.implement/NetworkGuru.allocate are called when provisioning an instance, NetworkGuru.deallocate/NetworkGuru.release are calling when destroying an instance. By implementing a specific NetworkGuru, we can create our own network plugin to program physical switch in vm's provisioning/deprovisioning phase. 

    A new BaremetalAdvancedNetworkingGuru will be created and inherit GuestNetworkGuru which is used for virtualization advanced networking. When creating a baremetal instance,  BaremetalAdvancedNetworkingGuru.implement will instruct Baremetal Network Backend Framework to configure vlan of instance's guest network on switch port where instance's destination host connects to. When destroying a baremetal instance, BaremetalAdvancedNetworkingGuru.release will instruct backend framework to remove vlan configured in former provisioning phase

  • Baremetal Network Backend Framework
    This framework is a layer between BaremetalAdvancedNetworkingGuru and vendor specific plugin which runs in a standalone web process. When a request(implement/release) come from BaremetalAdvancedNetworkingGuru, the framework(BaremetalBackendManager) will convert request to tuple of (switch identity, switch port, host mac, vlan id) and pass it to BaremetalSwitchBackend.allocate through http call. Based on the tuple, switch has enough information to configure/remove a vlan on any port of any switch. In short words, BaremetalBackendManager is the decision maker, it implements all business logic that translates CloudStack terms(baremetal instance, host, cluster, guest network) to switch terms(switch identity, switch port, host mac, vlan id). BaremetalSwitchBackend is the executor, it simply executes request from BaremetalBackendManager using its own method(varies from switch vendor), it knows nothing about CloudStack terms and business logic. 

    The goal of decoupling BaremetalBackendManager and BaremetalSwitchBackend is to simplicity and shielding license issue. Switch vendor can write there plugin to BaremetalSwitchBackend  without understanding CloudStack business logic, and if they use some library having license against CloudStack's, the web process is the shield. 


  • VirtualRouter
    The regular CloudStack Virtual Router will be enhanced to support baremetal PXE/DHCP service. It will also provide all network services, which have been provided to virutalization instance,  to baremetal instance transparently.
    As virtual router needs to communicate to internal IP address, admin must use VMWare as supporting hypervisor to create VR because only VMWare VR has management nic in CloudStack. Baremetal uses SSH to remotely execute command in VR, for example, prepare PXE server, setup source nat.

  • Provisioning Completion Notification
    Admin must configure first command in kickstart post install section to be the script which notifies CloudStack completion status of provisioning. The script sends a simple http request that can be issued by curl, wget using username/password authentication. An example of curl is like

    curl -u username : password http://ip_of_default_gateway/baremetal/provisioningstatus?status=done

    once the agent in VR receives the notification, it identifies the instance by its ip address, then a notification will be sent from VR to management server to indicate the completion of provisioning. Management server will then transmit the state instance from Starting to Running. If there is no notification sent to management server or the notification fails to reach management for some reason(for example, a network outage), management server will shutdown the instance and transmit its state to Error after TTL expired.

    For security reason, admin should keep the credential to send http notification request safely, and never let any customer script run before provisioning completion notification script in post install script.
  • System Account: baremetal-system-account

    baremetal-system-account is created during management server booting phase, if the account is not existing in database. This account is used for virtual router sending provisioning completion notification to management server. As at this time being, CloudStack
    doesn't have a comprehensive IAM, baremetal-system-account is created as user account to give minimal needed permission. This account will show up in Account page of UI, admin should never delete it; otherwise, management sever will not receive provisioning completion notification and will treat the provisioning as a failure after TTL expired. If the account is deleted accidentally, restarting management server will recreate it; 

  • Dedicated plugin for physical switch vendor
    This is plugin is referred as BaremetalSwitchBackend in above paragraph. Depends on switch vendor, BaremetalSwitchBackend can be a in-process plugin running in the same process space with CloudStack management server or a out of process agent running on switch operating system.

  • Http rack configuration repo
    To program vlan for each baremetal instance, CloudStack must understand the network topology at least on rack level. To gather the topology, CloudStack needs two parts of information:
    • switch identity and credential
    • host-switch port mapping

    The switch identity and credential give CloudStack a way to access switch, it's usually an ip address with username/password which vary from switch vendor. Host-switch port mapping indicates which host(identified by host mac) connects to which switch port(identified by port number). 

    both switch identity/credential and switch-host port mapping can be provided by http rack configuration repo. The repo is a ordinary http link whose target is a structured text file(rack configuration text, referred as RCT in following content). The format is JSON.

    {
        "racks": [
            {
                "l2Switch": {
                    "ip": "192.168.0.1",
                    "username": "root",
                    "password": "password",
                    "type": "Force10"
                },
                "hosts": [
                    {
                        "mac": "b8:ac:6f:9a:fa:6b",
                        "port": "tengigabytesinterface:1/3/5"
                    }
                ]
            }
        ]
    }

    The RCT is registered into CloudStack through API addBaremetalRCT. RCT is globally available to all baremetal zones, no redundant information of zone/pod/cluster needs to be specified as host mac is good enough for figuring out other needed facts. When RCT is changed on http server, simply call addBaremetalRCT again can update RCT with the same URL in CloudStack instantly.

    As you have seen in above example. the RCT is in format JSON. The whole RCT is a map with a single entry 'racks'. The 'racks' is an array that contains a set of rack definitions. A rack definition is a map made up of two parts: a "l2Switch" map that has fields 'ip', 'username', 'password', 'type':

    • ip: ip address of switch management port
    • username: username used to login switch management port
    • password: password used to login switch management port
    • type: the switch vendor type. Now the only vendor type supported is 'Force10'



    and a 'hosts' array that contains a set of host-mac-port-pair:
    • mac: the mac address of host nic that connects to switch port
    • port: the switch port identity. For Force10, the port identity is in format of 'port type colon port id'. Force10 S4810 has three port types: gigabitethernet, tengigabitethernet, fortyGigE; the port id is defined as stackUnit/slot/port. Please refer to S4810 user guide for detailed information.

  • Provision Done Notification Script:
    Putting below script in the beginning of post install section of kick start file will enable provision done notification:

    cmdline=`cat /proc/cmdline`
    for pair in $cmdline
    do
        set -- `echo $pair | tr '=' ' '`
        if [[ $1 == 'ksdevice' ]]; then
            mac=$2
        fi
    done
    gw=`ip route | grep 'default' | cut -d " " -f 3`
    curl "http://$gw:10086/baremetal/provisiondone/$mac"

     

     

Web Services APIs

AddBaremetalRCT

field name

description

rctUrl

A http link pointing to RCT on accessible http server

UI flow

a new button to call AddBaremetalRCT:

This button should be showed up when we click "Baremetal Rack Configuration" in "Select View" scroll bar at "Global settings".Baremetal_add_RCT.png

a new button to call deleteBaremetalRCT:

This button will delete an existing RCT.Baremetal_delete_RCT.png

a new global setting for "baremetal.internal.storage.server.ip":

This global setting is the same to all others. I don’t even know if we need any UI change.

IP Clearance

  • Depends on switch vendor. IP Clearance may be needed if any vendor private library used

Difference between baremetal basic zone and baremetal advanced zone

As CloudStack divides its networking model into basic zone and advanced zone, there are also some difference between baremetal  basic zone and advanced zone.

  1. No need to setup external DHCP/PXE server in advanced zone. Baremetal basic zone(see Baremetal Kick Start) needs admin to setup external DHCP/PXE server and register it to CloudStack. This is not necessary for baremetal advanced zone. CloudStack virtual router has enhanced to provide PXE/DHCP service, whenever a new baremetal instance is being created, a new virtual router will be created automatically by CloudStack if there is not a one in network.

  2. Need supporting hypervisor cluster in advanced zone. As CloudStack virtual router can only be created on hypervisor based host, in advanced zone,  baremetal cluster needs a supporting hypervisor cluster which is vmware cluster at this time to start virtual router. This is not necessary for basic zone which use external DHCP/PXE

  3. Networking topology is different. Generic speaking, in basic zone, baremetal instances are setting on the same layer 2 network, the layer 3 on top layer 2 uses gateway provided by customer's infrastructure which means baremetal instances are usually reachable by outside. In advanced network, all baremetal instances are sitting behind source nat of virtual router, which means they are not directly reachable unless traffic comes from the same subnet. For details of difference of network topology between basic zone and advanced zone, please reference CloudStack admin guide.  

 

Appendix

Appendix A:

Appendix B: 

  • No labels

2 Comments

  1. 1. It would be better to have a Network offering with “PXE & DHCP services provided with VR as the service provider and all other applicable services” made available out of the box.

     Is this something we plan on doing ?

     

    2. Can we explicitly call out that there is no support for shared networks ?

     

    3. Also it would be very helpful to document the high level work flow involved when deploying a baremetal VM ( to highlight that there would be additional calls being made to VR for pxe, creating source nat  , not needing createVolume / startVM calls that we typically have with usual VM deployments)

  2. There is a new user - baremetal-system-account that is being introduced .

    Provide information on how this user account is being used for providing Baremetal services in Advanced zone.

    Following are some of the questions relating to the creation of this account:

    1.This account is created as a regular user and is seen when listing accounts and users. This may be confusing to admins who are used to seeing only one "admin" user created out of the box.

    2.In the event of this user being deleted , is there going to be a disrupt in providing Baremetal services ? How will this case be handled?

    3.Can this user be created only when a Baremental host is being added in an advanced zone ?