Nexus 7000 Software Bug - Flash RAID Errors - 7k Reboot and Failover

It's been a mad couple of weeks with Nexus 7000's. My client hit a software bug on their Nexus 7k, which turned out to be a most impressive bug. It basically causes the flash drives to be erroneously marked as faulty, which then causes them to be remounted in read only. The first symptom was that you could not save the running configuration by running "copy run start". There were other symptoms in the logs, such as being unable to write to some SNMP and AAA files.

The bug reference is CSCus22805 (Cisco account needed).

So my clients site has a pair of Nexus 7k's, connected as vPC peers. Each 7k has 2 N7K-SUP2 supervisor cards. Each supervisor card has 2 flash drives, configured in RAID 1 (mirrored). What could go wrong with this much redundancy? Well...a lot.

After we found this problem on one 7k, we started to investigate both of them. There is a hidden command that is not in context sensitive help (at least on this code version) which is "show system internal raid". At the bottom of that output is some logging, which will log any problems. The real tell in this output though is the line near the top which says "RAID data from CMOS". This shows two hex values. I have no idea what the first is, but the second will be one of 4 codes:

  • 0xf0 == No failures reported
  • 0xe1 == Primary flash failed
  • 0xd2 == Alternate (or mirror) flash failed
  • 0xc3 == Both primary and alternate failed

Obviously what you want to see is 0xf0. The output of this command refers to the two flash's on one sup. So if you run the command, it will show you the status only of the active sup. To see the status of the individual sups, you can prefix it with the slot:

  • slot 5 show system internal raid
  • slot 6 show system internal raid

On the "B" 7k, there was one sup showing 0xf0 - this one was reseated a few months ago due to some issues - it was the secondary sup. The primary sup showed 0xd2 - so the mirror flash had failed.

This was painless and easy to recover. We lined up a maintenance window with the business, just in case, and we initiated "system switch-over" to fail over to the standby sup. This is seamless - although not so seamless for the management interface if the switchport at the other end of the standby sup management interface cable is in the wrong VLAN! After a few minutes of head scratching we fixed that, and we were good - there had been no impact to production traffic. After this we did out-of-service module 5, to shut the module down. We physically removed the module and reinserted it, and it booted back up, reinitialised it's RAID, and config sync'd from the active sup. All was good.

The "A" 7k was more fun. There were dual flash failures (0xc3) on BOTH sups. That's bad. We worked with Cisco TAC to be told the only way to recover was to....reload the entire chassis. This wasn't particularly what we wanted to hear, but c'est la vie. Worse, the engineer then told us that because it is a dual failure, there is a small possibility of the 7k coming back up with no config.

We did some pretty thorough checks before we entertained the idea of a 7k failover - it doesn't get done often (read: ever). Here is a decent checklist that got us by:

  • Running-configs: a good place to start. Line by line, side by side, go through the running configs and check everything looks good, both match where they should.
  • GLBP: We use GLBP, some prefer to use HSRP. Whichever it is, take the full output of the show commands and check everything line by line. Make sure one switch knows it is active, the other knows it is standby.
  • VPC: Do a show VPC and check the status and consistency checks of each VPC to make sure they will failover correctly.
  • Port-Channels: Make sure all port channels are configured correctly. Make sure no ports are marked "I". Make sure LACP is being used (if the switch DOES lose all config, and the bonding on the far end device is "On", then there is a risk the device would start sending traffic to an unconfigured switch if it saw the interface come up).
  • Spanning-tree: Study the topology of the network. Work out how other switches will react. These were the root switches - for VPC connected things, the root bridge is hidden, but the bridge ID isn't, so there will still be a small reconvergance. Pay particular interest to any legacy parts of the network that may be using old style STP. What about where you are sitting? How will your own network connectivity be affected? We ended up patching my desk into the "B" 7k directly.
  • VLAN's and SVI's: Sounds simple, but check all VLAN's that should exist on both switches, do. Same with SVI's. This topology used some point to point VLAN's for routing peers, which only existed on one of the pair, which was by design.
  • Firewalls: This network used active/standby Checkpoint firewalls, and each member of the firewall cluster was connected to a single 7k. We elected to do a manual cluster failover before reloading the 7k to minimise the impact.
  • Routing: For statics, check that all necessary static routes are on both 7k's. For dynamic protocols, check the peers. Assuming point to point VLAN's, then check that the neighbouring devices can see both 7k's as full neighbours.

As an aside, Notepad++ with the compare plugin is your friend here.

A special mention is deserved to storage. One of the biggest worries we had was that my client's VMware platform sits on Cisco UCS infrastructure, utilising NFS storage presented by NetApps connected to the 7k's. A lot of checking was done around the vPC's to the NetApps and to UCS. VMware Tools changes a registry setting in Windows to increase the disk freeze timeout to 60 seconds - i.e. the time you have before Windows considers it's storage unreachable and blue screens. It does similar for Linux, except the number is 180 and it's a config change not a registry tweak. Details on that across these three docs:

We actually had already tried to recover this ourselves. We had an idea that if we reseated the standby sup and let it recover, then we could fail onto it and reseat the other and we'd be good. That doesn't work. When the standby sup boots it tried to config sync. When it fails to config sync because the active sup has disk issues, the active sup forces the standby sup to reload. So you end up with a supervisor that just keeps power cycling over and over.

Our maintenance window went something like this:

  1. Fail over the Checkpoint firewall clusters, so that the active members we connected to 7k "B" (the one we weren't touching).
  2. Take full backup configurations of just about everything you can.
  3. Have a console connection available to both supervisors simultaneously.
  4. Run fping scripts to ping at least one thing in every section of the network in parallel.
  5. Enter the command "system standby manual boot" on the 7k. This will allow the second supervisor to fully boot. This was just to allow us to do some checks on our power cycling supervisor to make sure it was healthy. You probably don't need to do this if you are the leap of faith kinda person.
  6. Console onto the standby sup which has now booted and verify the health of the RAID (look for 0xf0).
  7. One the active sup, type reload. Confirm with Y. Panic, hope, pray, cross fingers, do whatever. Watch the pings with horror and wish you were somewhere sunny.
  8. This will take 10 minutes or so to come back. Just keep waiting until "show module" shows both sups as active and standby.

What about outages?

When the 7k went down, it was quick. We lost a couple of legacy areas (mainly ILO stuff) for 40 seconds or so - those switches run old STP, so had to do a full STP recalculation. vPC's were all but immediate - so much so that VMware didn't even notice a flicker in it's storage connectivity, and the NetApp had nothing to say either. UCS saw some interfaces go down, but it handled it exactly as it should. The longest drops were the standby firewalls - but they were only connected to one 7k, so that was fuly expected. Across our range of pings, we saw some things totally unaffected, others took a few second hit, and the legacy and single connected stuff was longer.

Interestingly, about 3-4 minutes after the 7k came back up fully, it started taking over it's GLBP's and stuff. This was surprisingly disruptive and probably caused a longer drop than when it went down. It was still only 6 seconds or so, and even then it wasn't all services, but it was a surprise. The good news though, was that it came back up fully configured and ready to go, and all flash drives were reporting as healthy.

I've removed the IP's, but here is one of the ping scripts we were using, for anyone interested in seeing the actual results. (It's 17 MB and just raw pings of about 45 devices)

So, frustrating as it was, the service impact was very minimal. All flash drives now report healthy. If you got to the end of this, well done - it was long. Just take one thing away - go on your 7k's and run the show system internal raid on both sup's. Check their health. One error is an easy fix. 4 errors is a headache.

EDIT - 12/08/15 - This wasn't actually the end - here is a continuation where we had to run a flash recovery tool to fully restore: Nexus 7000 Software Bug – Flash RAID Errors – Part 2

Share this post

  • Share to Facebook
  • Share to Twitter
  • Share to Google+
  • Share to LinkedIn
  • Share by Email