Cisco Nexus 1000v Module in "Other" state

What you don't need while you are checking your morning emails and drinking your first cup of coffee of the day is to receive an email saying that the VSM for the Nexus 1k has rebooted.

By the time we logged on to the Nexus 1000v, it was back up. "show system redundancy status" showed both VSM's (supervisors) as being up and HA. The "show version" and "show logs" definitely revealed a reload though - the uptime was less than an hour, and the logs were only as recent as the reload.

Well that's good then. It's back up, it's in HA. No service disruptions - the VEM's kept forwarding traffic, exactly as they should. Pressure off, let's just figure out why it happened and make sure it doesn't happen again. Cue one of the VMware guys shouting over - "Is there any reason why we can't vMotion any VM's? It says the distributed switch is unavailable!"

Further investigation needed then. And we found this:

```NEXUS1kV# sh mod Mod Ports Module-Type Model Status


1 0 Virtual Supervisor Module Nexus1000V active * 2 0 Virtual Supervisor Module Nexus1000V ha-standby 3 332 Virtual Ethernet Module NA other 4 332 Virtual Ethernet Module NA other 5 332 Virtual Ethernet Module NA other 6 332 Virtual Ethernet Module NA other 7 332 Virtual Ethernet Module NA other 8 332 Virtual Ethernet Module NA other 9 332 Virtual Ethernet Module NA other 10 332 Virtual Ethernet Module NA other 11 332 Virtual Ethernet Module NA other 12 332 Virtual Ethernet Module NA other 13 332 Virtual Ethernet Module NA other 14 332 Virtual Ethernet Module NA other 16 332 Virtual Ethernet Module NA other 17 332 Virtual Ethernet Module NA other 18 332 Virtual Ethernet Module NA other 19 332 Virtual Ethernet Module NA other


That doesn't format very well, but the Status column shows "other" for
all modules. At this point we took all of the logs, dumps, show techs,
etc and got a TAC case raised. While we were waiting for them to get
back to us we kept digging.

Point to note about the show tech - there is also a show system-cores
which is useful for TAC. This is obtained by entering a TFTP server into
global config mode: `system cores tftp://10.92.96.187/1kvcores` and then
running `sh system cores` from exec. Also, I always redirect show tech
where I can so it's not hammering output for no reason like this: `show tech-support details > ftp://nick@10.92.96.187/1kvtech"`

`show svs connections` showed that the VSM's were correctly synced with
the vCentre. `sh svs neighbors` showed all of the VEM's.

We stumbled across this blog post from Nonstop
Networks: <http://www.nonstopnetworks.net/blog/?p=322>

It stated that all you have to do is remove the feature Network
Segmentation Manager using "no feature network-segmentation-manager".
Network segmentation manager is used for vCloud Director, which is not
used in this environment, so it seemed pretty safe. However, this being
our primary VMware environment, it was far more than my job was worth to
accidentally bring down the VEM's. NFS storage also ran through these
VEM's, so the potential consequence of an outage was far more than a
simple network outage - all VM's would effectively have their storage
yanked from underneath them.

We sent the blog post to TAC and they eventually labbed it and confirmed
it to be non-disruptive. As soon as we disabled the feature, everything
started to fall back into place:

```NEXUS1kV(config)# 2014 Dec 17 16:40:30 inf-vs001a-nor %VEM_MGR-2-VEM_MGR_DETECTED: Host esx2006 detected as module 19
2014 Dec 17 16:40:30 inf-vs001a-nor %VDC_MGR-2-VDC_CRITICAL: vdc_mgr has hit a critical error: SPROM data is invalid. Please reprogram your SPROM!
2014 Dec 17 16:40:31 inf-vs001a-nor %VEM_MGR-2-MOD_ONLINE: Module 19 is online
2014 Dec 17 16:40:33 inf-vs001a-nor %VEM_MGR-2-VEM_MGR_DETECTED: Host ESX3102 detected as module 17
2014 Dec 17 16:40:33 inf-vs001a-nor %VDC_MGR-2-VDC_CRITICAL: vdc_mgr has hit a critical error: SPROM data is invalid. Please reprogram your SPROM!
2014 Dec 17 16:40:34 inf-vs001a-nor %VEM_MGR-2-MOD_ONLINE: Module 17 is online
2014 Dec 17 16:40:35 inf-vs001a-nor %VEM_MGR-2-VEM_MGR_DETECTED: Host ESX2001 detected as module 6
2014 Dec 17 16:40:35 inf-vs001a-nor %VDC_MGR-2-VDC_CRITICAL: vdc_mgr has hit a critical error: SPROM data is invalid. Please reprogram your SPROM!
2014 Dec 17 16:40:36 inf-vs001a-nor %VEM_MGR-2-MOD_ONLINE: Module 6 is online
2014 Dec 17 16:40:37 inf-vs001a-nor %VEM_MGR-2-VEM_MGR_DETECTED: Host ESX2002 detected as module 5
2014 Dec 17 16:40:37 inf-vs001a-nor %VDC_MGR-2-VDC_CRITICAL: vdc_mgr has hit a critical error: SPROM data is invalid. Please reprogram your SPROM!
2014 Dec 17 16:40:38 inf-vs001a-nor %VEM_MGR-2-MOD_ONLINE: Module 5 is online

inf-vs001a-nor(config)# show mod
Mod Ports Module-Type Model Status
--- ----- -------------------------------- ------------------ ------------
1 0 Virtual Supervisor Module Nexus1000V ha-standby
2 0 Virtual Supervisor Module Nexus1000V active *
3 332 Virtual Ethernet Module NA other
4 332 Virtual Ethernet Module NA other
5 332 Virtual Ethernet Module NA ok
6 332 Virtual Ethernet Module NA ok
7 332 Virtual Ethernet Module NA other
8 332 Virtual Ethernet Module NA other
9 332 Virtual Ethernet Module NA other
10 332 Virtual Ethernet Module NA other
11 332 Virtual Ethernet Module NA other
12 332 Virtual Ethernet Module NA other
13 332 Virtual Ethernet Module NA other
14 332 Virtual Ethernet Module NA other
16 332 Virtual Ethernet Module NA other
17 332 Virtual Ethernet Module NA ok
18 332 Virtual Ethernet Module NA other
19 332 Virtual Ethernet Module NA ok

It took 5-10 minutes, but this all dropped back into place. vMotion's started working, VMware was happy again. There was no service interruption throughout this entire process.

TAC are still investigating the root cause, but this definitely looks like some bug!

Share this post

  • Share to Facebook
  • Share to Twitter
  • Share to Google+
  • Share to LinkedIn
  • Share by Email