Troubleshooting DNA Center Plug and Play
Cisco built Plug and Play (PnP for short) into DNA Center as a way to easily onboard new devices without a configuration to be used either with the templating engine or with the SD-Access architecture. PnP can be quite useful but in my experience, it can be quite temperamental.
I wrote about a project recently that involves using PnP in DNA Center, and while I feel like we’ve been able to tame the beast in most scenarios, I was faced with the reality of calling TAC today because all of my tricks that I have been using failed me and I was unable to get any switches to progress through the onboarding workflow.
While Cisco switches are perfectly capable of using a discovery method like DHCP or DNS to bootstrap themselves to the PnP server, in this project, we’re using a manual bootstrap method of copying in a few lines of config because we lack a configured DHCP server in this portion of the network.
After applying our bootstrap config, the switch appeared immediately in the Plug and Play screen in DNA Center but it was sticking at 10% of the Onboarding workflow. Diving into the device history revealed it was “Securing Device” but it wasn’t progressing; a behavior I’ve never seen before.
I figured at this point I must have done something wrong and tried to reset the switches and try again.
If you’re not aware, Cisco has a set of commands that will essentially reset a switch or router with no previous config so that PnP can be retried. From Cisco’s documentation:
config terminal no pnp profile pnp-zero-touch no crypto pki certificate pool config-register 0x2102 end delete /force vlan.dat delete /force nvram:*.cer delete /force stby-nvram:*.cer write erase (answer no when asked to save) reload
Unfortunately, this didn’t do the trick for me and at this point, I was suspecting that the problem lied somewhere within DNA Center and not the switches or transit network. At this point, I had to engage TAC to help me out as unfortunately the DNA commands aren’t particularly well documented publicly and I also lacked experience in this area.
Pretty quickly though, my assigned engineer started letting on a couple of tricks that I thought were awesome. First off, the above block of commands can be replaced with one awesome macro on code versions 16.12.1 and newer:
pnpa service reset
CAUTION! Be aware, this little guy is like ‘write erase’ on steroids. It’s going to completely blow away any semblance of config and then it’s going to reload your network device without a confirmation. Definitely make sure you’re copying that into the right terminal window.
The next little trick that I picked up was a way to view all of the commands that DNA Center was running on the switch during the PnP process. This has been a little bit of a sticking point in my mind because this is kind of a black box and I it’s difficult to effectively troubleshoot issues without being able to see what’s going on.
Fortunately, my TAC engineer had an awesome little EEM script that would log every commands received by the switch to the CLI. This little code snippet should be must-have information for anyone working with Cisco DNA Center automation.
config t event manager applet catchall event cli pattern ".*" sync no skip no action 1 syslog msg "$_cli_msg" end
Remember to remove the commands above when you’re done troubleshooting as it’s going to be both annoying to you and anyone looking at the syslog of this switch in the future.
Ok, so after a couple hours of troubleshooting, the TAC engineers were able to uncover the issue. Ultimately, DNA Center had failed to automatically refresh several internal certificates for things like kubernetes and other internal processes and they all expired. Refreshing the certificates was the trick to clearing up my issue, however, I do still need to upgrade the cluster to a newer version that has a patch in place for this bug.
A large number of PKI errors were noticed in the maglev service logs and then certificate expiration dates were confirmed.
magctl service logs -r pki | lql sudo maglev-config certs info ! Refresh certs using: sudo maglev-config certs refresh
We retried the Plug and Play workflow and confirmed that we had a healthy enough DNA Center cluster to proceed with. There’s still some work for me to do to make sure this doesn’t reoccur in a year though.