Cisco Catalyst SD-WAN Analytics

As more and more Applications and workloads are being shifted to the Cloud, Enterprises are tasked with maintaining and operating these applications.

This is how Cisco SD-WAN Analytics with SD-WAN can help with increased visibility and insights for better control and troubleshooting issues a lot faster when they occur.

SD-WAN Analytics is broken into three pillars:

  • Visibility
  • Troubleshooting
  • Predictive Networking

Edge routers send their data to SD-WAN Manager which in turn uses secure APIs to then transferred data to Analytics, Analytics is hoses in AWS Cloud.

Visibility

Summary Dashboard is the landing page/homepage when you login to Manager, this provides a quick snapshot of your entire application and network. This allows Enterprises to quickly access the health of their overall network and applications.

It is organised into 4 key widgets:

  • Applications
  • Sites
  • Circuits
  • Users

Each widget gives you the capability and displays the change in metric compared to the previous time period. The time period can be from 12 hours to 1 month. You can also choose to view data for up to 7 days in the past 3 months.

The summary image above allows the user to view a specific site instead of the whole of SD-WAN network. This allows the user to see what applications are being used, circuit health and the clients on the Service side VPN.

On the Applications Dashboard allows the user to view the application QoE with a score based on how well the application is performing.

1 – Application summary by QoE with health score and usage through the GUI, usage stats for Apps.

2 – Get an aggregated view into how a group of similar applications is performing.

3 – trending applications that is showing a drop or rise in QoE.

4 – Compare and analyse usage QoE across different apps over time

Visualisation diagram shows what are the top used remote sites for a specific application for a specific site. This also offers how traffic is being distributed by QoE and what Colours (Transport) is being used for that application.

Troubleshooting

On the Dashboard Overview – you can view the bottom performing Applications. using SalesForce as an example.

To gain further insights, you can utilise the SD-WAN Underlay Visibility feature that allows you to drill all the way down to the SD-WAN tunnel’s underlay path and view metrics such as loss and latency on a hop by hop basis. This capability will also be available on SD-WAN Manager on an on-demand basis while SD-WAN Analytics will offer a historical view of your underlay path.

Predictive Insights/WAN Insights

This is a Cisco Product which is a close collaboration between SD-WAN Anaylytics and ThousandEyes. This is available on v20.12.x

WAN Insights predicts path recommendation features, monitors network path performance over the historical time (minimum 24 hours), which is then applied with predictive daya modelling to it. Essentially forcasts likely issues in the future and makes appropriate recommendations for alternate network paths of applications.

Enterprises can then take advantage of these recommendations to fine-tune their SD-WAN policies to avoid any future performance issues which results in optimised application experience.

Forecase future WAN Bandwidth needs

Using Artificial Inteligence and Machine Learning, you can forecase bandwidth usage. Dotted line is the forecasted bandwidth for a circuit, Solid line is the actual bandwidth of a circuit. This will be useful especially for MSPs or ISPs that is providing transports as well as SD-WAN managed Service.

SD-WAN AppQoE

AppQoE’s main goal is to improve application Quality of Experience.

AppQoE Features in SD-WAN

BFD

BFD is used for path liveliness and to measure the quality of the link. Whether the link is up/down, loss/latency/jitter

BFD runs between the WAN Edges as well as the Edge Cloud routers.

  • Within IPSec tunnels
  • Echo mode only
  • As soon as IPsec tunnel is stablished BFD will be activated
  • There is NO option to disable this

BFD uses Hello intervals, poll interval and multiplier for detection.

Application Aware Routing/Enhanced Application Aware Routing

I have already written a Blog about EAAR (http://jaychou.co.uk/?p=613) – This section is based on standard AAR.

A better example to understand AAR is having a scenario where there are multiple Transports such as Biz-Internet, Internet and LTE. Should there be an issue with one of the Transports which will impact the user experience. With AAR, you create a threshold for a specific Application so should the transport not meet SLA then it will switch to another Transport (fastest is 10minutes unless using EAAR which is 10s).

AAR is measured against Latency/Jitter/Loss, when the user configures the Application Aware Routing Policy then you can set the threshold before the transport switches over. This ensures the SLA is compliant through the SD-WAN fabric.

BFD is used for AAR, which has two timers:

  • BFD timer on Transport Tunnels, this is used to define the BFD frequency – such as BFD colour, hello-interval in milliseconds and the multipler in terms of frequency it happens.
  • BFD timer for AAR which defines often BFD polls all data plane tunnel stratistics and is used to collect packet latency, loss, and jitter.

Forward Error Correction (FEC)

FEC uses the XOR cipher. If the Cipher is the same then no change has been made. An example of XOR operation below:

This helps to understand how FEC is operated, XOR allows FEC to create a parity packet which then reconstructs the lost packet.

FEC helps the following :

  • Protects against packet loss
  • Operates per-tunnel
  • Supports multiple transports
  • Can be invoked as and when
  • Applied within the Data Policy

FEC can only reconstruct 1 packet out of 4.

Packet Duplication

As the name suggests, this allows duplcating packets for critical traffic/application such as Credit or ATM transactions and sending the duplicated path over a second path.

This can work when there is little or limited of Critical traffic compared to the capacity of the network. If there is multiple circuits then SD-WAN will choose the best transport. Best as in the least amount of packet loss to replicate the packets to.

When transferring, duplicate the packets of the primary tunnel and send simultanousely, the secondary/duplicated tunnel is chosen based on MTU. Duplication happens only if the secondary tunnel MTU is greater or equal to that of first to avoid fragmentation. When the receiving router receives the first packet to the LAN whether it is duplicate or original it will drop the other one.

QoS

Queuing

Is used when Shaping is being utilised. This allows the packets to sit in a queue waiting to be sent in the egress interface. Uses Weighted Round-Robin, when the queue gets dropped it uses Random Early Discard.

Shaping

Is used when you do not want to drop the packet if there is a queue and exceeded the configured Shaper rate. Essentially if there are no more tokens in the bucket it will be placed in a queue. The queued packets will operate in Weighted Round-Robin. This is not supported on Sub-interfaces.

Policing

Is used if you want to completely drop the traffic if does not conform the policer rate.

Link Bonding

You can bond both transport links together – this essentially means it will be per packet load sharing, with the receiving host ordering the packets if packets are being send out of order.

None-conforming traffic will spill over to a different circuit.

DSCP Marking an Remarking

DSCP operates on Layer 3, so as a packet is being mapped into a forwarding class, you can modify this to another DSCP rule.

COS (802.1p) Marking and Re-marking

You can remark COS (Layer 2) frames.

Path Quality and Liveliness Detection

Each WAN Edge router sends BFD Hello packets for path quality and liveliness detection, the packets will be echoed back by the receiving router. Using Hello interval and multiplier will determine how many BFD packets need to be lost in order to declare whether the IPsec tunnel is down.

The number of hello intervals that fit inside the Poll interval determins the number of BFD packets considered for establising poll interval average path quality.

The App Route multuplier determines the number of poll intervals for establishing the ocerall average path quality.

TCP MSS Adjust

This is used to help the need in fragmentation of packets, Routers on the WAN edge will signal the appropriate MTU based on the host/application on the LAN. This in turn will be forwarded to the receiving router in terms of the appropriate MTU it needs to be.

Per-Tunnel QoS support on SD-WAN

This allows the site to dynamically adjust the sending rate of its traffic to acomodate lower bandwidth circuits at remote sites.

Cloud onRamp for SaaS

Cloud on Ramp allows quality probing towards popular SaaS Application, the WAN Edge router chooses the best porforming path towards the popular Saas Applications.

CoR works by using the following 3 components:

  • DNS Resolution
  • Performance Visibility
  • Path Selection

An example is if you had two DIA transports to the SaaS application, CoR monitors the edge to the SaaS application. This in turn then picks the best porforming metrics such as loss and delay.

Perfomance Visiibilty works by the WAN edge reuesting DNS in VPN0 and sends a DNS reuest for the pre-configured SaaS application.

DNS requests are duplicated and sent to all transports to get the application server address.

HTTP pings are sent to the application such as SalesForce servers on both the DIA links for performance measurements.

The results and score is measured from 0-10 with 10 being the best.

0-5 RED

5-8 Yellow

8-10 Green

TCP Optimisation

TCP optimisation fine tunes the processing of TCP traffic which in turn decreases the round-trip latency to improve throughput.

An example is if you are using a high latency link like a satelite transport attempting to connect to a SaaS based or Server, the TCP handshake is formed from Client to the Router. This TCP connection will be terminated and in turn the router will then form another TCP handshake with the remote router. This TCP handshake from WAN Edge to WAN Edge will be cached. The remote router will form a TCP handshake with the Server.

SD-AVC – Advanced Visibilility Control

Essentially using NBAR2 to classify and regonise applications, it uses DPI plus different techniques such as:

  • DNS Snooping
  • ML
  • Behavioural classification
  • Learning of main services and servers
  • Customisations

Multi-Region Fabric Formerly known as Hierarchical SD-WAN

Currently with SD-WAN deployments, it will be delivered in a ‘flat’ layer network where all Edge routers connects to each other regardless of where location and country.

An example with diagram below is that you could have multiple sites across Europe as well as sites in Asia. Both regions will be connected to wherever the Controller and Manager is.

Multi-Region Fabric

Introducing MRF, in how this works. The first iteration of MRF is introduced in v20.7.x.

With MRF, we have introduced new terminlogies and roles. To begin, we have:

  • Border Routers – This is the edge of each region where it connects to the backbone of the network/middle mile. It is responsible for the routes within the region itself.
  • Core Region – This is the middle mile where you are expected to have high speed back bone network whether you are traversing the Cloud or just huge network pipes.
  • Edge Routers – Sites with vEdge or cEdge devices.
  • Intra-region – Sites that connect and send traffic within the same region.
  • Inter-region – Sites from different regions and sends traffic to the Border router which in turn sends across to the backbone and to another region.


Another example/simplified diagram of MRF below. MRF also introduces region numbers. The Core region will always be 0. Other regions will need to connect to the Core region with Border Routers. You can compare this with OSPF Areas.

Another new addition is introducing Cisco Controllers (vSmarts) to each region, traditionally you would utilise either 1 or a cluster of vSmarts to serve the whole SD-WAN network. WIth MRF, each region will have its own Controller(s) and will serve only for the region it belongs to.

Benefits of MRF

  • Current SD-WAN acts as a flat overlay model, essentially site to site tunnels are connected to each other.
  • Most use case is sufficient for flat overlay model, however with larger Enpterprise Busineses that operate globally, this will introduce some limitations such as:
  • OMP Limitations
  • Config Complexity
  • Control Policy Complexity
  • FLat overlay does not scale after a certain number of tunnels

Secondary Regions

This feature was released in v20.8.1 and 17.8.1a for IOS-XE.

With MRF we have introduced to basic multi-regions, now with Secondary region you have the ability to connect two Edge or more sites in different regions to one Secondary region.

Example above shows that three edge devices can connect directly or form a single secondary region, with OMP it will always choose the direct path first, therefore it may not allow the route to be installed on the forwarding table via the Border router path. You can disable the comparison of number of hops so it will become ECMP.

Secondary Region allows:

  • Load balancing using Primary and Secondary region paths.
  • Directing specific Applications to use the Secondary Path which could have a faster perfomance underlay like a Lease line at 1Gbps.

Caveats of Secondary Region

  • Only to Edge routers not Border routers
  • A router can only belong to one Seconday region only.
  • Controller cannot be part of any primary or access regions , recommended to utilise a separate Controller for a Secondary Region.

Transport Gateway

This feature was released in v20.8.1 and 17.8.1a for IOS-XE.

Transport Gateway is used if within a region and Edge routers do not have a direct connection to each other. Transport Gateway can help to facilitate this by essentially bridging the two networks together.

Transport Gateway only works for IOS-XE.

Router Affinity

This feature was released in v20.8.1 and 17.8.1a for IOS-XE.

If you have multiple exit paths, you can advertise to the Border Routers to prefer one path over another.

For example with the above, DC is advertising the two subnets. With the two BR’s you could set BR1 to prefer ER1 and should ER1 fail then failover to ER2. For BR2 this is vice versa where it will prefer ER2 then ER1.

https://www.cisco.com/c/en/us/td/docs/routers/sdwan/configuration/hierarchical-sdwan/hierarchical-sdwan-guide/router-affinity.html

Catalyst SD-WAN On-Prem Design

Since the release of version 20.12.x, Cisco has renames the controllers to the following:


vManage Manager

vSmart Controller

vBond Validator

vAnalytics Analytics

On prem SD-WAN Design Deployment

With On Prem you can deploy using ESXi or KVM and as VM’s or Containers.

max-control-connections 0 command is used if you are using a transport which cannot obtain a control plane connection back to Manager. So in other words you will not be communicating to Manager using that transport, but with this instance you will communicate via control plane using another transport like Internet.

  • The Manager and Controller controllers use a public color on their tunnel interfaces. This ensures they will always use public IP addresses to communicate with any WAN Edge devices. There is no concept of color on the Validator interface.
  • It is a requirement for Validator to communicate to Controller and Manager through their public addresses so the Validator can learn those IP addresses and pass those public IP addresses to the WAN Edge devices wanting to connect into the overlay.
  • Manager and Controller communicate to each other via their NATed public IP addresses. This is due to their public color configuration and their site ID configurations being different. If their site IDs were equal, they would be communicating via their private IP addresses, bypassing the gateway for that communication.

On-Premise Controller Deployment

STUN server acts like a proxy where if you had the controllers hosted privately in a MPLS for example environment and other branch devices are sitting behind Internet transport you could setup another Validator on the Internet which will redirect the Private IP of the controllers. This is useful when you have a new device which is being onboarded to SD-WAN and needs to connect back to the Controllers in a Private network such as MPLS.

Controller Redundancy/High Availability

Validator

  • Validator redundancy is done by using FQDN and A records, is it recommended to spin Validator in different geographic regions and data centres. This ensures at least one Validator will be available for registering to join the overlay.
  • Always recommended to use Validator in FQDN instead of IP addressing, in DNS there will be multiple IP’s attached to the FQDN of the Validator it will go through each IP until a successful connection is formed.

Controller

  • It is recommended to use Controller controllers in different geographic regions if managed from the cloud or in different geographic locations/data centers if deployed on-premise to maintain proper redundancy.
  • By default, a WAN Edge router will connect to two Controller controllers over each transport. If one of the Controller controllers fails, the other Controller controller seamlessly takes over handling the control plane of the network.
  • Controller controllers maintain a full mesh of DTLS/TLS connections to each other, over which a full mesh of OMP sessions are formed. Over the OMP sessions, the Controller controllers stay synchronized by exchanging routes, TLOCs, policies, services, and encryption keys.
  • By default each Wan edge can make two control connections in VPN 0.

Controller Affinity

Essentially you can group the Controllers into groups and allow failover, however best practice is to place Controllers in different Regions/DCs with the WAN edge connecting to one Controller in one group and another Controller in another group/DC.

The following is configured on the WAN Edge router:

●     max-omp-sessions 2: the WAN Edge device can attach up to 2 different Controller controllers (there is one OMP session established per Controller, regardless of the number of DTLS/TLS sessions formed between two devices).

●     max-control-connections 2: the WAN Edge device can attach to two Controller controllers per TLOC.

●     controller-group-list 1 2 4: indicates which control groups the WAN Edge router belongs to, in order of preference. The router is able to connect to controllers that are in the same controller group. The WAN Edge router attempts to attach to all controller groups not explicitly excluded based on the current state of the controller and the WAN Edge configuration session limits. In this example, the router first attempts to connect to a Controller controller in group 1 and then one in group 2 in each transport.

●     exclude-controller-group-list 3: indicates to never attach to controller-group-id 3.

If a Controller controller in controller-group-id 1 becomes unavailable, the WAN Edge router will attempt to connect to another Controller controller in controller-group-id 1. If controller-group-id’s 1 and 2 are both unavailable, the WAN Edge router will attempt to connect to another available group in the controller-group-list (4) excluding controller-group-id 3, or any other group defined by the exclude-controller-group-id command. If no other controller groups are listed in the controller-group-list, the router loses connection to the overlay.

Manager Network Management System (NMS)

  • All Manager in a cluster will operate in Active mode.
  • It provides redundancy against a single Manager failure. But not a cluster level.
  • Clustering across Geographic locations is not recommended as it requires 4ms or less latency. So members of clusters should reside at the same site.
  • Redundancy is achieved through Active and backup in standby mode.
  • General rule of thumb is less than 2000 routers then one Manager in Active aand another Manager in standby.
  • If more than 2000 routers then Manager as cluster and another cluster in standby via two different geographic locations.
  • Depending on the network, application visibility and statistics can be CPU intensive on Manager, thus reducing the number of WAN Edge routers supported by a single Manager.
  • To prefer a specific tunnel interface to use to connect to Manager, use a higher preference value. Try to use the highest bandwidth link for the Manager connection and avoid cellular interfaces if possible. A zero value indicates that tunnel interface should never connect to Manager. At least one tunnel interface must have a non-zero value.

Manager clustering

  • When clustering other than the two interfaces for VPN 0 and 512 you need a third interface to connect and sync to other Manager servers within the cluster – least 1Gb and recommended 10Gbs (4ms or less)
  • If deploying on ESXi use VMNET 4 adapter as it supports 10Gbps.
  • In a cluser the config and statitics should be run on at least 3Managers and each service must run/support odd number of routers to ensure data consistency during write operations.

Disaster Recovery

  • Validator and Controller are stateless so snapshots can be made before any maintenance or config changes or their config can be copied and saved if running in CLI mode.
  • Manager is stateful therefore backup cannot be deployed in active mode, snapshots should be taken and the database backed up regularly.
  • When you have active and backup in two different DCs you will have Validator and Controller too in both DC’s so the Manager will establish with whichever active to respond first of Validator and Controller.
    • Administrator-triggered failover (Manager cluster) (recommended)– Starting in the 19.2 version of Manager code, the administrator-triggered disaster recovery switchover option can be configured. Data is replicated automatically between the primary and secondary Manager clusters. When needed, a switchover is manually performed to the secondary Manager cluster.

Controller Deployment Examples

  • Minimal controller design (<= 2000 devices) – this design contains 1 active and 1 standby Manager, 2 Validator orchestrators, and 2 Controller controller, split between two different regions.
  • contains 3 Validators, 3 Controllers, and 1 active and 1 standby Manager. Controller affinity is used so WAN Edge devices connect to the Controllers in the two closest geographical areas (North America and Europe, or Europe and Asia as examples).
  • contains 1 active and 1 standby Manager cluster, each with 3 Manager instances. One Manager in the cluster could be disabled but the rest of the cluster could support the WAN Edge devices. It also includes 4 Validator orchestrators, 4 Controller controllers, split between multiple sites within a region or globally. Controller affinity is used to so WAN Edge devices can connect to Controller controllers in the two closes geographical areas. 

https://www.cisco.com/c/en/us/td/docs/solutions/CVD/SDWAN/cisco-sdwan-design-guide.html

Enhanced Application Aware-Routing

In version 20.12, Catalyst SD-WAN created Enhanced Application Aware-Routing. The issue with standard AAR (Application Aware-Routing) is that it uses BFD tunnel performance measurements, this can take from 10minutes – 60minutes for convergance times to detect SLA breaches and failover. If you start configuirng the performance timer then this could create false positives.

AAR is measured based on:

  • Loss – The number of BFD echoes that failed to reply
  • Latency – How long it takes to recieve the BFD echo and Hello (RTT)
  • Jitter – Measuring the delay of the packet arrival times, also measures irregularity of the packet as it is being transmitted and received.

With EAAR, improvements have been made which are the following:

  • Performance metrics (Loss/Latency/Jitter) is improved by introducing Inline Data. Inline data is the traffic that is being inspected at the edge of the network. Instead of traffic being routed to a central location for analysis and security checks, Inline data is being inspected and forward/data place decisions is being made at the edge. Loss is being measured with two differences. Loss is being measurements uses Per queue Adaptive-QoS Metrics which includes Per Queue path loss – this means it will be able to differentiate whether it is local loss or loss on the WAN circuit. Latency is RTT with Patented method where we insert metadata for measurements. Wheres Jitter is measured Unidirectional
  • Peformance poll-interval has decreased to minimum of 10s, as mentioned above standard AAR is minimum of 10minutes. Therefore should there be a breach of threshold then it will only take a matter of seconds instead of minutes.
  • SLA Dampening – Same principle as BGP route dampening. Essentially is the SLA is being breached back and forth instead of re-adding the transport back it will need to stabilise first before being added back. This helps stability and prevents disruptions.

Caveats

EAAR is available on IOS-XE 17.12.1a. If one has EAAR disabled or an older version then it will not work and will default to standard AAR.

If both versions is 17.12.1a but not enabled with EAAR then it will default to AAR.

https://www.cisco.com/c/en/us/td/docs/routers/sdwan/configuration/policies/ios-xe-17/policies-book-xe/m-enhanced-application-aware-routing.pdf

Configuration Groups

Configuration Groups was created in v 20.9.x, this is due to the lack of flexibility in feature Tempates. With Feature Templates you could assign them to different devices but lacking the flexibility. Now with Configuration Group (image below from Cisco Site) you can essentially create a Group which consists of Feature Profiles that allows you to mix and match for each device. Features allows the user to add/remove such as MPLS, LTE or Internet under the Feature Profile WAN.

Example below is if you had a group of devices and within a region, they required the same Feature such as Loggin and Banner, you could share the same Feature Profile but in a separate Configuration Group. Likewise you could just create one Feature Profile with Features under the same Feature profile for one set or group of devices.

Example below shows :

  • Feature Profile – Transport Profile 1 – Only West Coast Configuration Group devices is using it.
  • Feature Profile – System Profile – Logging and Banner Features is being used by West Coast and East Coast but in 2 different Configuration Group.

Transport Profile = Colour or the transports you will be using such as MPLS, Business Internet, and which VPN it is under (0 Transport or Management 512).

System Profile = Consists of AAA, Banner, BFD, Logging, NTP etc

Service Profile = Consists of the Overlay Servive VPN like 1-500 etc

When you choose which Features to add into the Feature Profile this is called ‘Parcel’. So if you wanted to setup Logging you would include the logging Parcel into the System Profile (Feature Profile).

From Cisco Website – Definitions

The Configuration Group feature provides a simple, reusable, and structured approach for the configurations in Cisco Catalyst SD-WAN.

  • Configuration Group: A configuration group is a logical grouping of features or configurations that can be applied to one or more devices in the network managed by Cisco Catalyst SD-WAN. You can define and customize this grouping based on your business needs.
  • Feature Profile: A feature profile is a flexible building block of configurations that can be reused across different configuration groups. You can create profiles based on features that are required, recommended, or uniquely used, and then put together the profiles to complete a device configuration.
  • Feature: A feature profile consists of features. Features are the individual capabilities you want to share across different configuration groups.

Restrictions for Configuration Groups

  • You can associate a device to either a configuration group or a device template, but not both.
  • You can add a device to only one configuration group.
  • You can add only one tag rule to a configuration group.
  • (Minimum supported release: Cisco Catalyst SD-WAN Manager Release 20.12.1) You can only apply the dual device configuration group to a site with two or less devices. For additional devices in the same site, use a single device configuration group.

https://www.cisco.com/c/en/us/td/docs/routers/sdwan/configuration/system-interface/ios-xe-17/systems-interfaces-book-xe-sdwan/configuration-groups.html