This Bridge is the Root

View Original

Post-Mortem and Thoughts on a Recent Outage

I’ve held the opinion for a while that if you’re not occasionally causing outages, you’re not doing any work.

That’s not to say that causing an outage is a good thing or that you shouldn’t strive to avoid causing outage. What I mean is that it’s nearly impossible to avoid causing outages 100% of the time.

The good news is that each time you cause an extended outage to a network, you’re given the opportunity to learn a new way to learn to avoid that particular scenario.

I’d like to talk about a recent outage that I caused in a customers network, what caused the outage, and what I learned from it.

The Setup

To set up the scenario, I’m currently working on a project for a customer to deploy L3 switches at the internet edge of their network to participate in BGP with their provider and to enable redundant core network services. In other words, this customer had a non-redundant configuration where all network traffic destined to the internet were routed through a single building with no redundancy.

As part of the project, we deployed an addition Cisco Nexus switch as a core, upgraded and enabled high-availability in their Palo Alto firewall, and replaced their edge router with two Cisco Catalyst 9300 switches to participate in BGP with their provider and to turn up a second internet circuit with that provider.

At a high-level, the work has been completed and were just tying up some loose ends at this point in time. One of those loose ends was that the previous core configuration had not used OSPF in the connection to the firewall. The change window that we were in was to enable this OSPF adjacency so that the firewall would provide a default route to the network and enable dynamic failover. As soon as I enabled this OSPF adjacency, the customer lost internet connectivity and I had to wait for someone with a hotspot to get online so that we could investigate and fix.

To explain the series of events that took place and caused this outage, I need to take you back to when we were first setting up the 9300 edge switches. In order to bring the switches online so that we could access them remotely and have them reach out to the Cisco Smart licensing portal, the customer configured an IP address on an interface and a static default gateway on each of the switches. This was all well and good for pre-production but because these switches were intended to be situated outside the firewall on this network, we took the steps to move the configured IP address to the out-of-band management interface on the switch and to configure the default route under the “Mgmt-vrf" VRF.

This is the point at which the mistake was made. I failed to remove the static default route from the default VRF on one of the switches.

We moved these switches into production and had to troubleshoot a couple of things but we had it working just fine and tested failover a couple of times and we were satisfied that everything was working.

The design has the default route for outbound traffic being learned via BGP from the provider, and then this is shared into OSPF to the firewall by using the command “default-information originate”. In the failure scenario that the primary internet connection either goes down or loses it’s BGP adjacency for any reason, the switch will withdraw the default route from OSPF and the secondary switch will become active. We were happy with the functionality of this, unaware that there was a landmine in the configuration.

The reason that the misconfigured default route was not active and causing issues at this time was that the switch did not have a route to the next-hop address. Normally, when we create a static route the next-hop address is in a directly connected subnet and as long as the interface to that subnet is up, the route is active and installed in the routing table. The switch will recursively look up the next-hop address and look for the egress interface and then perform send an ARP request for the MAC address of the next hop. There is no requirement that a configured next-hop needs to be in a directly connected subnet though. The switch will just perform a second recursive route lookup to find the egress interface. If the switch can’t recurse the next-hop, it just doesn’t enter it into the routing table.

The Outage

When I enabled the OSPF adjacency on the core switch, it exchanged it’s LSDB with the firewall which then in return exchanged it with the edge switches. At this point in time, the edge switch with the misconfigured default route was suddenly able to recursively resolve the outgoing interface associated with the static route’s next-hop and it suddenly and unexpectedly to me installed that route into the table.

Because we were using the command “default-information originate” rather than redistributing a route from BGP into OSPF, the default route was not removed from the LSDB so that traffic could fail over to the second switch and internet circuit.

I resisted the urge to immediately go and break the OSPF adjacency that triggered this outage because I knew that I needed to have an explanation as to why this happened because the OSPF adjacency is important to the overall network design and we would need to make the change again. Luckily I was able to find the offending static route within a few minutes of getting into the switches and remove it to restore access.

What I’m taking away from this

There’s a couple of things I want to take away from this outage, first being that cleaning up of old config is really important. I should have noticed that the static route was there. I don’t remember seeing it during the first change window but what I can do to avoid it in the future is at the end of a change window, just run a quick “show run” and scan through the config. I know what should be there and what shouldn’t be there.

I know from experience that there’s this thought of, it’s not doing any harm by being in the config so there’s no need to worry about it. But this incident should serve as proof that with so many interconnected systems, small mistakes can cause the whole things to grind to a stop.

The second point that I’d like myself to take away from this is that it’s OK to make mistakes occasionally as long as you’re not making the same mistakes over and over again. I tend to beat myself up after experiences like this. This isn’t meant to downplay the importance of striving to not make mistakes but to internalize that anyone can make a mistake. The consequences could be minor, or they could cause a widespread outage that’s the luck of this sort of thing.