Massive Cisco SD-WAN Bug CSCvu29389

Early this week our SD-WAN “Powered by Cisco” started have strange issues that manifested as VRRP being stuck in INIT/INIT state. And random routers becoming unavailable over the normal data plane.

The routers had control connections and were accessible hopping through vManage because the control plane was up. The command from vmanage is request execute ssh username@vedgeIP

My setup is with 2 routers at each remote site with TLOC extensions between them.

In my configuration under the vrrp config we had the track-omp configured.
vrrp 2
priority 110
track-omp

Even with OMP routes the VRRP would not transition over to the back up router.
Show OMP routes
10 0.0.0.0/0 omp – – – – 256.257.258.259 biz-internet ipsec –
10 0.0.0.0/0 omp- – – – 256.257.258.260 public-internet ipsec


Show BFD sessions returns no output.

Thinking it was related to being stuck in init/init state we removed the configuration from all our routers. But VRRP would just stay active on the master router. We still had random sites drop the next day.

You can get access back by typing clear control connections but that is very reactive. After another day of devices disconnecting randomly we decided to reboot all our routers. Rebooting the routers fixed the issue for now. Had no random reboots the day after rebooting our entire environment.

The bug CSCvu29389 currently has no fix or patch available as of this writing. You can’t even read about this bug. Must be internal only.

Weird dhcp-helper SVI configuration

Recently I was trouble shooting some DHCP issues on a VLAN.  While looking at the configuration I saw the following:  (Can you spot the weirdness??  IP’s changed to protect the innocent)

ip dhcp relay address 17.9.19.10
ip dhcp relay address 17.9.19.4
ip dhcp relay address 17.9.21.30
ip dhcp relay address 17.9.19.10

I didn’t think it was possible to have the same IP listed as a helper more that once.  But there is was.  Anyone else ever seen something like this?

I fixed it by removing all the addresses and putting only the ones back that I wanted.

Cisco ve1000 CLI software upgrade error

Ran into an issue where I pulled a ve1000 router out that was on the shelf for a couple of years and of course it would not connect to vManage.  The reason is because the old router was on 16.2 code and my vManage instance is 18.3 code.  Therefore, in this case you need to get the ve1000 into at least 17.2 version.

To get the software upgraded I needed to use the usb port to get the upgrade file onto the ve1000 router.

These are the steps I followed:

STEP 1:  Enable the usb slot on the ve1000.

vedge# conf t

Entering configuration mode terminal

vedge(config)# system

vedge(config-system)# usb-controller

vedge(config-system)# commit

The following warnings were generated:

‘system usb-controller’: For this configuration to take effect, this command will cause an immediate device reboot

Proceed? [yes,no] yes

 

STEP 2: Verify that USB controller is enabled:
vedge# show running-config system usb-controller

system

usb-controller

vedge# show hardware environment |tab

HW

DEV

HW CLASS HW ITEM INDEX STATUS MEASUREMENT

——————————————————————————————————-

Temperature Sensors DRAM 0 OK 39 degrees C/102 degrees F

Temperature Sensors Board 0 OK 35 degrees C/95 degrees F

Temperature Sensors Board 1 OK 36 degrees C/97 degrees F

Temperature Sensors Board 2 OK 34 degrees C/93 degrees F

Temperature Sensors Board 3 OK 34 degrees C/93 degrees F

Temperature Sensors CPU junction 0 OK 47 degrees C/117 degrees F

Fans Tray 0 fan 0 OK Spinning at 5040 RPM

Fans Tray 0 fan 1 OK Spinning at 4980 RPM

PEM Power supply 0 OK Powered On: yes; Fault: no

PEM Power supply 1 Down Powered On: no; Fault: no

PIM Interface module 0 OK Present: yes; Powered On: yes; Fault: no

USB External USB controller 0 OK 2 USB Ports

STEP 3: Copy the vedge mips image to the USB stick (formatted in FAT fs) [NOTE: A 2Gb USB stick works.  100Gb stick does not]

 

STEP 4: Insert the USB stick into the vedge

tail -100 /var/log/kern.log

If you tail the /var/log/kern.log, you should see these messages and the stick will be auto mounted. If it does not, please remove the USB stick, reboot the node and then when the device is backup insert the USB stick.

kern.info: Jun 21 16:05:33 vedge kernel: usb-storage 3-1:1.0: USB Mass Storage device detected

kern.info: Jun 21 16:05:33 vedge kernel: scsi3 : usb-storage 3-1:1.0

kern.notice: Jun 21 16:05:34 vedge kernel: scsi 3:0:0:0: Direct-Access     Kingston DataTraveler 2.0 1.00 PQ: 0 ANSI: 2

kern.notice: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] 3913664 512-byte logical blocks: (2.00 GB/1.86 GiB)

kern.notice: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] Write Protect is off

kern.debug: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] Mode Sense: 03 00 00 00

kern.err: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] No Caching mode page found

kern.err: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through

kern.err: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] No Caching mode page found

kern.err: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through

kern.info: Jun 21 16:05:34 vedge kernel:  sdb: sdb1

kern.err: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] No Caching mode page found

kern.err: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] Assuming drive cache: write through

kern.notice: Jun 21 16:05:34 vedge kernel: sd 3:0:0:0: [sdb] Attached SCSI removable disk

vedge:vshell

vedge:~$ df -h

Filesystem      Size  Used Avail Use% Mounted on

rootfs          5.9G   35M  5.6G   1% /

none             64K     0   64K   0% /dev

/dev/sda1      1013M  116M  847M  12% /boot

/dev/loop0       64M   64M     0 100% /rootfs.ro

/dev/sda2       5.9G   35M  5.6G   1% /rootfs.rw

aufs            5.9G   35M  5.6G   1% /

tmpfs            64K     0   64K   0% /dev

shm             1.5G   24K  1.5G   1% /dev/shm

tmp             1.5G  488K  1.5G   1% /tmp

tmpfs           1.5G  120K  1.5G   1% /run

tmpfs           1.5G  120K  1.5G   1% /run/netns

/dev/sdb1       1.9G  100M  1.8G   6% /media/sdb1

Verify the code is on the USB stick and visible to the vEdge

vedge:~$ cd /media/sdb1

vedge:/media/sdb1$ dir

System\ Volume\ Information  viptela-17.2.5-mips64.tar.gzStep 5: copy the image to /home/admin

STEP 5: copy the image to /home/admin

vedge:~$ cd /media/sdb1/

vedge:/media/sdb1$ ls

System Volume Information  viptela-17.2.5-mips64.tar.gz

vedge:/media/sdb1$ cp 17.2.5-mips64.tar.gz /home/admin

vedge:/media/sdb1$ exit

STEP 6: Activate the new image

vedge# request software install /home/admin/viptela-17.2.5-mips64.tar.gz reboot

 

What do you do if you get the following error?

vedge# request software install /home/admin/viptela-17.2.10-mips64.tar.gz
gzip: invalid magic
tar: Child returned status 1
tar: Error is not recoverable:

This is very frustrating because it’s extremely vague.  The short answer is that you should re-download the IOS file and try again.  But wait and verify that the copy is complete from the USB to the /home/admin directory.  Then it will work (or at least it did for me).

Cisco vEdge software upgrade failures

Recently I’ve been updating to vEdge 1000 routers to 18.4.302 code.  My last batch was 22 successful and 36 failures.

Those failures required connecting to the console of each device and seeing the following message or just a blank screen:

pressing the “Enter” key.  Then they booted successfully into the new code.  Very strange for sure.

This makes was used to be a fairly straight forward and smooth process into a process that take way too much time.  In my case 2 hours to do 58 devices.  Good thing is that I had access to all those sites’ console ports via an out of band network.

vEdge software trials

Let’s just say that since Cisco purchased and “integrated” Viptela into their ecosystem, things have been dicey.

Although Cisco has come out with more frequent updates to the software, things have been less stable.

Two prime examples:
1.  18.2 code was released and VRRP broke for a lot of customers.  Sure it was an issue where multiple subinterfaces used the same VRRP group number, but it worked before.

2. Since 17.2 code there has been a feature called tracker.  Currently I’m at 18.3.4 and it still doesn’t work correctly. Every case I’ve opened I’ve been told it’s or it’ll be fixed in the next version.   Basic problem is that the tracker works by sending an http ping over an Internet facing interface.  If that ping fails with the specified timers configured, then the default route is removed and traffic can egress over a different interface.

This feature works well when route is removed, but rarely does it put the route back in. I even had no tracker configured and applied to an interface yet the “show int | tab” showed that there was a tracker configuration applied.  In IOS environments this would be accomplished by using ipsla configurations.

Hopefully Cisco will get this feature right.  This 1 issue has proved to be the source of most of my SD-WAN trouble tickets.

When NBD support is 8 days later

 

Not sure what is going on with Cisco parts availability these days but Next Business Day replacement should not be 8 days later.

I have 2 theories:
1.  They are selling the Viptela devices faster than they can be produced.
2.  Trump tariffs are affecting parts availability.

From Cisco TAC:

Good Day,

We have received your Service Order XXXXXX which is related to Service Request XXXX. Due to part availability, part number(s) VEDGE-1000-AC-K9= are not able to be shipped at this time.

Our Logistics planner has advised the parts should be arriving in our warehouse no later than 10/08/2018.

With stock arriving in our NBD depot on the 10/08/2018​, our Logistics team will still be actively looking into numerous options to identify part(s) for your Service Order.

 

We sincerely apologize for any inconvenience this may cause you or your customer.

 

Please call the Logistics Department at the number listed below with any additional questions or concerns you may have.

Tracker command

Cisco SD-WAN (Formally Viptela) has a tracker command that can be very helpful.

https://sdwan-docs.cisco.com/Product_Documentation/Command_Reference/Configuration_Commands/tracker

The thing is there is 1 important requirement in the tracker name that is not called out anywhere.  The name must be all lower case characters.

# tracker Google-Tracker
(config-tracker-Google-Tracker)# endpoint-dns-name google.com
(config-tracker-Google-Tracker)# threshold 150
(config-tracker-Google-Tracker)# interval 30
(config-tracker-Google-Tracker)# commit
Aborted: ‘system tracker Google-Tracker’ : Invalid tracker name

 

# tracker google-tracker
(config-tracker-google-tracker)# endpoint-dns-name google.com
(config-tracker-google-tracker)# threshold 150
(config-tracker-google-tracker)# interval 30
(config-tracker-google-tracker)# commit
Commit complete.

From the documentation:
Tracker Name tracker-name
Name of the tracker. tracker-name can be up to 128 alphanumeric characters. You can configure up to eight trackers. You can apply only one tracker to an interface.

 

 

Cisco Viptela vManage stuck processes

I’ve been using the Viptela product for over a year now.  It is a really good product.  It actually does what the sales people say it will do.

Ran into an issue recently where when I was applying template changes to devices.  Sometimes over a 100 and sometimes as few as 5; that a process gets stuck on vManage.

When a process is stuck you can not make any changes to existing applied templates, or even bring a device online.  The only option that I had until today was call support and have them kill the process.  Depending on who you get taking the ticket it could be a 5 minute or less wait or a couple of days.

If you need to kill a stuck process on your vManage here is the process:

To see what process is running on vmanage go to the following URL

https://<vmanage-ip>/dataservice/device/action/status/tasks

You will see something similar to the following:
{“runningTasks”:[{“userSessionUserName”:”xxxxx”,”detailsURL”:”/dataservice/device/action/status”,”tenantName”:”Provider”,”processId”:”push_feature_template_configuration-3af60b4b-3947-4ab5-b7db-cdd9dc73c88c”,”name”:”Push Feature Template Configuration”,”tenantId”:”default”,”userSessionIP”:”1.2.3.4″,”action”:”push_feature_template_configuration”,”startTime”:1522439470351,”endTime”:0,”status”:”in_progress”}]}

Look for the following: “processId”: “xxxxx” The information after the processID”: that is within the quotes is the process that is running. From the above example the process is:
push_feature_template_configuration-3af60b4b-3947-4ab5-b7db-cdd9dc73c88c

Take the process and add it to the following URL:

https://<vmanage-IP>/dataservice/device/action/status/tasks/clean?processId=

Afer the equal sign paste in your process that is stuck running. From the example above it is: push_feature_template_configuration-3af60b4b-3947-4ab5-b7db-cdd9dc73c88c

So the complete URL in this instance is:
https://<vmanage-IP>/dataservice/device/action/status/tasks/clean?processId=push_feature_template_configuration-3af60b4b-3947-4ab5-b7db-cdd9dc73c88c

You will then get the following after the process has be terminated:
{“Success”:true}

DHCP issue in tenants

Ran into an issue where my common tenant EPGs had no issues getting DHCP addresses.  All other tenants would not get DHCP address.

DHCP server is Windows 2012 R2 which supports option 82.

After working with TAC it appears the sub option 5 needs to be configured for my ACI tenants to get addresses.

I’ll let you all know when I get it working.