All posts by ciscoweirdness

About ciscoweirdness

Love digging into the weirdness of networking and bizarre behaviors. I've been working with Cisco networking equipment for over 20 years. Thus, I've seen a lot of weird stuff. I'm currently working on an ACI deployment and battling through that weirdness.

Tracker command

Cisco SD-WAN (Formally Viptela) has a tracker command that can be very helpful.

The thing is there is 1 important requirement in the tracker name that is not called out anywhere.  The name mus be all lower case characters.

# tracker Google-Tracker
(config-tracker-Google-Tracker)# endpoint-dns-name
(config-tracker-Google-Tracker)# threshold 150
(config-tracker-Google-Tracker)# interval 30
(config-tracker-Google-Tracker)# commit
Aborted: ‘system tracker Google-Tracker’ : Invalid tracker name


# tracker google-tracker
(config-tracker-google-tracker)# endpoint-dns-name
(config-tracker-google-tracker)# threshold 150
(config-tracker-google-tracker)# interval 30
(config-tracker-google-tracker)# commit
Commit complete.

From the documentation:
Tracker Name tracker-name
Name of the tracker. tracker-name can be up to 128 alphanumeric characters. You can configure up to eight trackers. You can apply only one tracker to an interface.




Cisco Viptela vManage stuck processes

I’ve been using the Viptela product for over a year now.  It is a really good product.  It actually does what the sales people say it will do.

Ran into an issue recently where when I was applying template changes to devices.  Sometimes over a 100 and sometimes as few as 5; that a process gets stuck on vManage.

When a process is stuck you can not make any changes to existing applied templates, or even bring a device online.  The only option that I had until today was call support and have them kill the process.  Depending on who you get taking the ticket it could be a 5 minute or less wait or a couple of days.

If you need to kill a stuck process on your vManage here is the process:

To see what process is running on vmanage go to the following URL


You will see something similar to the following:
{“runningTasks”:[{“userSessionUserName”:”xxxxx”,”detailsURL”:”/dataservice/device/action/status”,”tenantName”:”Provider”,”processId”:”push_feature_template_configuration-3af60b4b-3947-4ab5-b7db-cdd9dc73c88c”,”name”:”Push Feature Template Configuration”,”tenantId”:”default”,”userSessionIP”:”″,”action”:”push_feature_template_configuration”,”startTime”:1522439470351,”endTime”:0,”status”:”in_progress”}]}

Look for the following: “processId”: “xxxxx” The information after the processID”: that is within the quotes is the process that is running. From the above example the process is:

Take the process and add it to the following URL:


Afer the equal sign paste in your process that is stuck running. From the example above it is: push_feature_template_configuration-3af60b4b-3947-4ab5-b7db-cdd9dc73c88c

So the complete URL in this instance is:

You will then get the following after the process has be terminated:

DHCP issue in tenants

Ran into an issue where my common tenant EPGs had no issues getting DHCP addresses.  All other tenants would not get DHCP address.

DHCP server is Windows 2012 R2 which supports option 82.

After working with TAC it appears the sub option 5 needs to be configured for my ACI tenants to get addresses.

I’ll let you all know when I get it working.

vMotioned VMs dropping off the network

When a server is vMotioned to another blade chassis the server can connect to other devices within the EPG but not outside the EPG.

This was occurring for LINUX and Windows servers.

The quick and easy fix is to bounce the network interface on the LINUX servers.  On Windows servers this did not always fix the problem.

What is really happening is that the endpoint location is not being updated in the COOP table on the spines correctly.  And get this it’s a known bug with no fix at the moment.

So how do you fix it inside the fabric?

On your boarder leaves run the following command on both of them as close to the same time as possible.

leaf1# bash

leaf1# clear system internal epm endpoint key vrf YOURVRFHER:VRFNAME ip IPADDRESS

To verify that the VPC leaf is actually passing the traffic correctly use the following steps:

Rrun the following ELAM on the two leaves that the device is connected to see if ARP packets are coming in and see if the “status” triggered. You would have to do it on both leafs at same time because it’s in vpc.

1. vsh_lc

2. debug platform internal ns elam asic 0

3. trigger reset

4. trigger init ingress in-select 3 out-select 0

5. set outer l2 dst_mac ffff.ffff.ffff src_mac YOUR DEVICE MAC ADDRESS HERE

6. start

7. status < — to see if it triggered or stays as Armed //Armed means no traffic has meet what was defined in step 5

8. report | egrep “ce_|ar_”

Watch out for DOCKER hosts

Had an issue with endpoint learning that was perplexing.  I traced the MAC address to a VM that was running DOCKER.

Interestingly enough the IP address that I did the show endpoint for does not exist in the fabric.  I masked the IP addresses so they are not the actual IPs but you’ll see the results.

Leaf_105# show endpoint ip
O – peer-attached H – vtep a – locally-aged S – static
V – vpc-attached p – peer-aged L – local M – span
s – static-arp B – bounce
VLAN/ Encap MAC Address MAC Info/ Interface
Domain VLAN IP Address IP Info
105 vlan-1615 0050.56bf.30d7 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 10.300.112.19 LV po7
common:CM_Primary_PN vlan-1615 10.300.88.40 LV po7
common:CM_Primary_PN vlan-1615 10.300.88.33 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 10.300.156.71 LV po7
common:CM_Primary_PN vlan-1615 10.300.88.20 LV po7
common:CM_Primary_PN vlan-1615 10.300.88.35 LV po7
common:CM_Primary_PN vlan-1615 LV po7
common:CM_Primary_PN vlan-1615 10.400.120.116 LV po7
common:CM_Primary_PN vlan-1615 10.300.112.32 LV po7
common:CM_Primary_PN vlan-1615 10.400.120.42 LV po7
common:CM_Primary_PN vlan-1615 10.300.9.163.106

<80 more lines of the same stuff>

Solution was to check the “enforce subnet check for IP learning” check box in the bridge domain L3 configuration tab.


You can read up on DOCKER fun-ness

This does not occur in “traditional” networks because the endpoint learning is in the hardware now and it learns IP’s many different ways.

Another ACI bug

Love being the 1st to find these 🙂
The main issue is with the new code version 1.3(1g) binding vCenter to an EPG brings up the expected screen but there is now a 2nd required field (Primary VLAN) that was not required previously.


Work around options for now:
1. create the association as dynamic.
2. include junk info, then modify it.
3. Use the REST API.

Bug ID CSCuz47137

EPG learnng disabled

If you are getting the 1197 errors in your fabric then the ACI fabric has disabled learning on 1 or more EPGs.

In my case it was caused by MAC flapping from VMware. With the DVS health check enable (which it is by default) The DVS spams the fabric on each VLAN but with the same MAC address. This causes the fabric to disable learning to protect itself.

The VMware KB on it is:

In my case the trace had the following characteristics:

All of the non-broadcast protocol 0x8922 packets from Src: Vmware_d0:c3:a0 to Dst: Vmware_d0:07:f8 came in on encapsulating vlan-1510
The VMware broadcast 0x8922 packets were sent in untagged from Src: Vmware_d0:c3:a0 to Dst: Vmware_d0:07:f8
Then there were random vmware mac addresses trying to reach 2 specific vmware mac addresses (00:50:56:d0:44:40 and 8 bits later 00:50:56:d0:44:48) (00:50:56:d0:83:f0 and 8 bits later 00:50:56:d0:83:f8) using protocol 0x8922 and were multicasting to [unassigned multicast] and were playing Distributed Interactive Simulation (DIS) which is an IEEE standard for conducting real-time platform-level wargaming across multiple host computers and is used worldwide, especially by military organizations but also by other agencies such as those involved in space exploration and medicine.

A Bunch of 0x8922 packets being broadcast from Source: Vmware_d0:27:e8 across vlan 49, 55, 57, 59, 61, 62, 98, 107, 131, 132, 133, 138. This would cause mac flapping across the vlans.

The same source mac address broadcast without vlan tags.
There were a lot of vms responding to the source mac address in 1 using vlan 450, 451, 1402, 1209, 1212, 1213, 1223, 1230, 1402, 1424

I picked one to see if it looped. eth.addr == 00:50:56:d0:c3:a0 showed it was across the vlans. It looks like you used a specific source ip address instead of letting the switch use its node id as the last octet of the address.
The ERSPAN source can be either a specific IP or subnet prefix. If a specific source IP is configured, all leaf switches in the vPC will use the same IP address as the source IP address in the ERSPAN packet headers.
If a subnet prefix is configured, leaf switches will try to use their own node ID if possible as the last octet in the address. This allows you to differentiate between which leaf switch sent the packet to the destination ip address.

Long and the short of it is disable VMware health checks in Vcenter for the DVS that is causing the problems.

Update: 24-May-16
VMware released a document about this specific issue after we pointed it out to them.

When you have VC tunnel mode connecting into Cisco ACI, there are some scenarios you need to pay attention in order to have the right connectivity.

We conducted some testing in DCA-Lab and this is some information to help you with understanding the nature of the issue.

This problem nature is very similar to this VC advisory when working with layer 2 load balancer/bridging device.

Also same applies to vCloud Director Network Isolation (vCDNI) which is MAC-in-MAC encapsulation.

How to do a packet capture on a leaf

When I say packet capture I literally mean 1 packet that matches a specific criteria.  Why would you only want to capture 1 packet?  To show to other (mainly server folks) that the fabric is sending the packet to the correct port.

Leaf_103# vsh_lc
module-1# debug platform internal ns elam a 0

module-1(NS-elam)# trigger init ingress in-select <<< Which header to look at
3 4 5 6 7

3 Outerl2-outerl3-outerl4
4 Innerl2-innerl3-innerl4
5 Outerl2-innerl2
6 Outerl3-innerl3
7 Outerl4-innerl4

module-1(NS-elam)# trigger init ingress in-select 3 out-select 0

module-1(NS-elam-insel3)# set outer
arp ipv4 ipv6 l2 l4

module-1(NS-elam-insel3)# set outer ipv4 src_ip ds
dscp dst_ip
module-1(NS-elam-insel3)# set outer ipv4 src_ip dst_ip

module-1(NS-elam-insel3)# start

module-1(NS-elam-insel3)# stat
Status: Triggered
module-1(NS-elam-insel3)# report

NOTE:  Output has been greatly condensed to show only the proof that the packet is coming from the right place to the right place

GBL_C++: [INFO] ip_da: 0000000000000000000000000A072732 <<< Destination IP address in HEX
GBL_C++: [INFO] ip_sa: 0000000000000000000000000A072615 <<< Source IP address in HEX
GBL_C++: [INFO] ip_v6_hbh: 0
GBL_C++: [INFO] ce_da: 0022BDF819FF <<< Destination MAC address
GBL_C++: [INFO] ce_sa: 000C29083F09 <<< Soure MAC address

Another ACI bug initiated by me :)

This is a good one where a delete a network and the bridge domain and the route still lives in the routing table.

Wipe the leafs that have the stale routes using the leaf-specific portion of the instructions found here:

Just to clarify, full wipe of fabric is not required. Just wipe of the leafs that contain the “stale” route.