Troubleshooting DNA Center Plug and Play

Cisco built Plug and Play (PnP for short) into DNA Center as a way to easily onboard new devices without a configuration to be used either with the templating engine or with the SD-Access architecture. PnP can be quite useful but in my experience, it can be quite temperamental.

I wrote about a project recently that involves using PnP in DNA Center, and while I feel like we’ve been able to tame the beast in most scenarios, I was faced with the reality of calling TAC today because all of my tricks that I have been using failed me and I was unable to get any switches to progress through the onboarding workflow.

While Cisco switches are perfectly capable of using a discovery method like DHCP or DNS to bootstrap themselves to the PnP server, in this project, we’re using a manual bootstrap method of copying in a few lines of config because we lack a configured DHCP server in this portion of the network.

After applying our bootstrap config, the switch appeared immediately in the Plug and Play screen in DNA Center but it was sticking at 10% of the Onboarding workflow. Diving into the device history revealed it was “Securing Device” but it wasn’t progressing; a behavior I’ve never seen before.

I figured at this point I must have done something wrong and tried to reset the switches and try again.

If you’re not aware, Cisco has a set of commands that will essentially reset a switch or router with no previous config so that PnP can be retried. From Cisco’s documentation:

config terminal
no pnp profile pnp-zero-touch
no crypto pki certificate pool
config-register 0x2102
end
delete /force vlan.dat
delete /force nvram:*.cer
delete /force stby-nvram:*.cer
write erase  (answer no when asked to save)
reload

Unfortunately, this didn’t do the trick for me and at this point, I was suspecting that the problem lied somewhere within DNA Center and not the switches or transit network. At this point, I had to engage TAC to help me out as unfortunately the DNA commands aren’t particularly well documented publicly and I also lacked experience in this area.

Pretty quickly though, my assigned engineer started letting on a couple of tricks that I thought were awesome. First off, the above block of commands can be replaced with one awesome macro on code versions 16.12.1 and newer:

pnpa service reset
unnamed.png

CAUTION! Be aware, this little guy is like ‘write erase’ on steroids. It’s going to completely blow away any semblance of config and then it’s going to reload your network device without a confirmation. Definitely make sure you’re copying that into the right terminal window.


The next little trick that I picked up was a way to view all of the commands that DNA Center was running on the switch during the PnP process. This has been a little bit of a sticking point in my mind because this is kind of a black box and I it’s difficult to effectively troubleshoot issues without being able to see what’s going on.

Fortunately, my TAC engineer had an awesome little EEM script that would log every commands received by the switch to the CLI. This little code snippet should be must-have information for anyone working with Cisco DNA Center automation.

config t
event manager applet catchall
event cli pattern ".*" sync no skip no
action 1 syslog msg "$_cli_msg"
end

Remember to remove the commands above when you’re done troubleshooting as it’s going to be both annoying to you and anyone looking at the syslog of this switch in the future.

Ok, so after a couple hours of troubleshooting, the TAC engineers were able to uncover the issue. Ultimately, DNA Center had failed to automatically refresh several internal certificates for things like kubernetes and other internal processes and they all expired. Refreshing the certificates was the trick to clearing up my issue, however, I do still need to upgrade the cluster to a newer version that has a patch in place for this bug.

A large number of PKI errors were noticed in the maglev service logs and then certificate expiration dates were confirmed.

magctl service logs -r pki | lql
sudo maglev-config certs info

! Refresh certs using:
sudo maglev-config certs refresh

We retried the Plug and Play workflow and confirmed that we had a healthy enough DNA Center cluster to proceed with. There’s still some work for me to do to make sure this doesn’t reoccur in a year though.

Ryan Harris

I’m Ryan and I’m a Senior Network Engineer for BlueAlly (formerly NetCraftsmen) in North Carolina doing routing, switching, and security. I’m very interested in IPv6 adoption, SD-Access, and Network Optimization. Multi-vendor with the majority of my work being with Cisco and Palo Alto.

Previous
Previous

Cisco DNA Templates and the Art of Order of Operations

Next
Next

What's In My Bag 2020