Lessons Learned From A Recent 10 Hour Cutover

November 16, 2020 Ryan Harris

I recently had the displeasure of one of the most harrowing cutovers of my career. We told the client 2-3 hours starting at 4 pm and we didn’t finish until 2 in the morning.

So what was the cutover and why did it take so long? It was a fairly simple network refreshment with only a couple of switches in total but our strategy for managing the network was entirely new to everyone involved.

You see, this is my first deployment of DNA center using Plug and Play and this particular project will involve the replacement of over 1000 switches at more than a hundred sites. You can imagine why we might not want to manually upgrade and configure every switch we’re going to deploy so we’ve made the decision to leverage Cisco’s DNA Center for its Plug and Play and Templates features to ease the manual interventions needed.

PnP can be very powerful but given the requirements of this network, we’ve decided to go for more of a hybrid approach that requires minimal configuration to bootstrap the switches enough to contact DNA so that it can then download a Golden software image and then have its final configuration provisioned.

Although this workflow has been tested multiple times in the lab prior to go-live, reality hit like an 18-ton mac truck barreling down the highway. We had problems with every step in the process but we believe we’ve developed solutions to counteract or side-step every problem we faced.

Problem Number 1

This first problem is a little bit silly in retrospect but highlights that even the best lab scenarios are often not aligned with reality. You see, we had this great idea that we would use the management port for bootstrapping switches to DNA center. Unfortunately, this kind of requires that another switch is near enough to this switch to plug that management port in. Great in a lab when all of the switches are in the same rack, not so great in the real world where switches are often in another closet or even building. So much for that idea.

Fortunately, this was an easy problem to overcome but did require us to change a couple of things in our bootstrapping configuration.

Problem Number 2

We’re deploying Catalyst 9500 switches for the distribution layer using StackWise Virtual at these sites and this led us to our first major problem. We’d tested upgrading these switches using PnP several times, we tested bootstrapping these switches several times, and we tested configuring StackWise Virtual several times. The problem arose when we deviated from the specific order that we tested in the lab.

The night of the cutover, we first configured StackWise Virtual, rebooted the switches, and then ran the bootstrap configuration to have the switches talk to DNA center. Unfortunately, something went wrong in the onboarding upgrade process and it appears that only one of the two switches upgraded correctly. The situation was further exacerbated by the fact that Cisco Catalyst 9500 switches don’t have the best error handling when it comes to version mismatches in a StackWise Virtual switch. So instead of us immediately being able to identify the problem, we were stuck trying to guess at the underlying issue. It even went as far as us replacing the offending switch with a spare. Serendipitously, because I knew that this switch was on a different version of code, I decided to bootstrap it separately from the other switch so that it could upgrade to the latest version of code before joining it to the stack. This worked flawlessly and has become our new process to avoid this issue in the future.

Problem Number 3

This one is more of a bug that was hit rather than a fault in our process. Our closet switch stacks are made of Catalyst 9300 switches. During the PnP onboarding process, a couple of commands were deployed to this switch, specifically a file location under the archive sub-config. After rebooting for the upgrade to the golden software image, the switch was having trouble once it got the “bulk-sync” of configuration to the standby switch. Unfortunately for us, the software then tries to correct this by reloading the standby switch. This doesn’t end up fixing anything and only ends up in a reboot loop of the standby switch until eventually, it fails to rommon.

Unfortunately, it turns out that it took me far too long to have my console cable plugged into the correct port at the correct time to notice this little error message before the switch rebooted. I even replaced the secondary switch with a spare before seeing the error message, Googling, and finding bug CSCue10556. I might add that this was around hour 11 that I finally found this bug and I was very, very tired to say the least.

Summation

As bad as this cutover was, it’s nice to know that we were able to walk away with such useful information to carry over into the next site refreshment. In the words of Helmuth von Moltke, “No plan of operations reaches with any certainty beyond the first encounter with the enemy's main force”. We proved that spectacularly.