Catalyst SD-WAN On-Prem Design

Since the release of version 20.12.x, Cisco has renames the controllers to the following:


vManage Manager

vSmart Controller

vBond Validator

vAnalytics Analytics

On prem SD-WAN Design Deployment

With On Prem you can deploy using ESXi or KVM and as VM’s or Containers.

max-control-connections 0 command is used if you are using a transport which cannot obtain a control plane connection back to Manager. So in other words you will not be communicating to Manager using that transport, but with this instance you will communicate via control plane using another transport like Internet.

  • The Manager and Controller controllers use a public color on their tunnel interfaces. This ensures they will always use public IP addresses to communicate with any WAN Edge devices. There is no concept of color on the Validator interface.
  • It is a requirement for Validator to communicate to Controller and Manager through their public addresses so the Validator can learn those IP addresses and pass those public IP addresses to the WAN Edge devices wanting to connect into the overlay.
  • Manager and Controller communicate to each other via their NATed public IP addresses. This is due to their public color configuration and their site ID configurations being different. If their site IDs were equal, they would be communicating via their private IP addresses, bypassing the gateway for that communication.

On-Premise Controller Deployment

STUN server acts like a proxy where if you had the controllers hosted privately in a MPLS for example environment and other branch devices are sitting behind Internet transport you could setup another Validator on the Internet which will redirect the Private IP of the controllers. This is useful when you have a new device which is being onboarded to SD-WAN and needs to connect back to the Controllers in a Private network such as MPLS.

Controller Redundancy/High Availability

Validator

  • Validator redundancy is done by using FQDN and A records, is it recommended to spin Validator in different geographic regions and data centres. This ensures at least one Validator will be available for registering to join the overlay.
  • Always recommended to use Validator in FQDN instead of IP addressing, in DNS there will be multiple IP’s attached to the FQDN of the Validator it will go through each IP until a successful connection is formed.

Controller

  • It is recommended to use Controller controllers in different geographic regions if managed from the cloud or in different geographic locations/data centers if deployed on-premise to maintain proper redundancy.
  • By default, a WAN Edge router will connect to two Controller controllers over each transport. If one of the Controller controllers fails, the other Controller controller seamlessly takes over handling the control plane of the network.
  • Controller controllers maintain a full mesh of DTLS/TLS connections to each other, over which a full mesh of OMP sessions are formed. Over the OMP sessions, the Controller controllers stay synchronized by exchanging routes, TLOCs, policies, services, and encryption keys.
  • By default each Wan edge can make two control connections in VPN 0.

Controller Affinity

Essentially you can group the Controllers into groups and allow failover, however best practice is to place Controllers in different Regions/DCs with the WAN edge connecting to one Controller in one group and another Controller in another group/DC.

The following is configured on the WAN Edge router:

●     max-omp-sessions 2: the WAN Edge device can attach up to 2 different Controller controllers (there is one OMP session established per Controller, regardless of the number of DTLS/TLS sessions formed between two devices).

●     max-control-connections 2: the WAN Edge device can attach to two Controller controllers per TLOC.

●     controller-group-list 1 2 4: indicates which control groups the WAN Edge router belongs to, in order of preference. The router is able to connect to controllers that are in the same controller group. The WAN Edge router attempts to attach to all controller groups not explicitly excluded based on the current state of the controller and the WAN Edge configuration session limits. In this example, the router first attempts to connect to a Controller controller in group 1 and then one in group 2 in each transport.

●     exclude-controller-group-list 3: indicates to never attach to controller-group-id 3.

If a Controller controller in controller-group-id 1 becomes unavailable, the WAN Edge router will attempt to connect to another Controller controller in controller-group-id 1. If controller-group-id’s 1 and 2 are both unavailable, the WAN Edge router will attempt to connect to another available group in the controller-group-list (4) excluding controller-group-id 3, or any other group defined by the exclude-controller-group-id command. If no other controller groups are listed in the controller-group-list, the router loses connection to the overlay.

Manager Network Management System (NMS)

  • All Manager in a cluster will operate in Active mode.
  • It provides redundancy against a single Manager failure. But not a cluster level.
  • Clustering across Geographic locations is not recommended as it requires 4ms or less latency. So members of clusters should reside at the same site.
  • Redundancy is achieved through Active and backup in standby mode.
  • General rule of thumb is less than 2000 routers then one Manager in Active aand another Manager in standby.
  • If more than 2000 routers then Manager as cluster and another cluster in standby via two different geographic locations.
  • Depending on the network, application visibility and statistics can be CPU intensive on Manager, thus reducing the number of WAN Edge routers supported by a single Manager.
  • To prefer a specific tunnel interface to use to connect to Manager, use a higher preference value. Try to use the highest bandwidth link for the Manager connection and avoid cellular interfaces if possible. A zero value indicates that tunnel interface should never connect to Manager. At least one tunnel interface must have a non-zero value.

Manager clustering

  • When clustering other than the two interfaces for VPN 0 and 512 you need a third interface to connect and sync to other Manager servers within the cluster – least 1Gb and recommended 10Gbs (4ms or less)
  • If deploying on ESXi use VMNET 4 adapter as it supports 10Gbps.
  • In a cluser the config and statitics should be run on at least 3Managers and each service must run/support odd number of routers to ensure data consistency during write operations.

Disaster Recovery

  • Validator and Controller are stateless so snapshots can be made before any maintenance or config changes or their config can be copied and saved if running in CLI mode.
  • Manager is stateful therefore backup cannot be deployed in active mode, snapshots should be taken and the database backed up regularly.
  • When you have active and backup in two different DCs you will have Validator and Controller too in both DC’s so the Manager will establish with whichever active to respond first of Validator and Controller.
    • Administrator-triggered failover (Manager cluster) (recommended)– Starting in the 19.2 version of Manager code, the administrator-triggered disaster recovery switchover option can be configured. Data is replicated automatically between the primary and secondary Manager clusters. When needed, a switchover is manually performed to the secondary Manager cluster.

Controller Deployment Examples

  • Minimal controller design (<= 2000 devices) – this design contains 1 active and 1 standby Manager, 2 Validator orchestrators, and 2 Controller controller, split between two different regions.
  • contains 3 Validators, 3 Controllers, and 1 active and 1 standby Manager. Controller affinity is used so WAN Edge devices connect to the Controllers in the two closest geographical areas (North America and Europe, or Europe and Asia as examples).
  • contains 1 active and 1 standby Manager cluster, each with 3 Manager instances. One Manager in the cluster could be disabled but the rest of the cluster could support the WAN Edge devices. It also includes 4 Validator orchestrators, 4 Controller controllers, split between multiple sites within a region or globally. Controller affinity is used to so WAN Edge devices can connect to Controller controllers in the two closes geographical areas. 

https://www.cisco.com/c/en/us/td/docs/solutions/CVD/SDWAN/cisco-sdwan-design-guide.html

Leave a Reply

Your email address will not be published. Required fields are marked *