How to go about upgrading a multi-node ISE deployment using the automated process.
Traditionally I have always upgraded Cisco ISE using the command line. The process is simple, you copy the upgrade bundle to the local drive, you run a “prepare” command which extracts and moves files to the correct places for install, and you press go. The output to the shell is verbose, it has auto stop and checks in place, and overall, I have not had too many issues with this process.
This year we had a seven node ISE deployment to upgrade. It consisted of a PAN and SAN and five policy service nodes dotted around the country. The linear “copy the file on and enter a command” approach was not going to cut it for this job.
There are some pros to going down this route. Each ISE node is handled independently on the command line, you can decide exactly when the upgrade happens and when the ISE box reloads, and you can manually make sure that once upgraded the ISE is added back to the deployment when you are ready. You pay the price in time, however, executing one node an evening to minimise downtime would take well over a week, and throughout the process, your deployment would be running with reduced resilience.
Since ISE 2.1, Cisco has a solution to this problem. Using the GUI, we can upgrade the deployment in a mostly automated fashion. We lose some control – we cannot specify a time a specific node should reboot for example – but we reduce the upgrade time from days to hours.
There are a few tasks I like to do before starting an upgrade. I prefer to use a local repository rather than a remote one, as the GUI has a hidden timeout on file transfers. With upgrade bundles of around 10GB in size, a slightly slow WAN link can easily cause an upgrade to fail at the first hurdle. I configure the local repository in the GUI and then log on to each node via an SSH shell to create the directory.
I then spend some time manually copying the upgrade bundle from our FTP servers to the local repository using the command line. This is time-consuming, but I have found it to be much more reliable.
Next, I log into the GUI and take manual backups before disabling the scheduled ones. This means I can restore the database if I need to, and that a backup will not try to run while I am upgrading the nodes. I also take a copy of the certificate store; this is done via the CLI on the PAN.
Finally, and most importantly, I ensure I have remote console access via the CIMC. In the event of an upgrade failure, the main recovery method will be a recovery via console, which will either have to be onsite, or using the remote console server.
Once preparation is complete, I then use the GUI to kick off the upgrades. The first screen is a checklist to make sure we have done adequate preparation and as a final reminder, which can be useful.
Part two is the download phase, and this is an area where I had some trouble. I tried to use a remote repository as mentioned above but unfortunately, I ran into file transfer timeouts. This ends up costing a lot of time as the timeout is around 3 hours. I found it to be much more reliable to use the local repository, which is also a recommendation on the Cisco site.
The final page of the upgrade is where the power of this method really shines through. You are presented with a sequence table and your ISE nodes. The first section must contain the SAN. This node will upgrade and become the PAN of the new deployment. The next section is for the policy service nodes and you can break this down into as few or as many sub sections as you like. If you put all the PSN’s in one big group, they will all upgrade in parallel. If you put each PSN in its own sub section, they will upgrade in series.
For this deployment, we had a PSN that served as a secondary node for most of our sites. I decided to upgrade this node first, by itself. I then put the other PSN’s in a group to be upgraded together, saving hours of downtime.
Finally, the PAN goes in the last section. This will upgrade and become the SAN of the new deployment. There is also an option to stop the whole upgrade if a PSN should fail, which I decided not to use.
There are no time schedules or reboot schedules, once you hit go it will start until it either fails or finishes, bringing down the various nodes in the process. You get an approximate time to completion which is reasonably accurate, in our case over 10 hours. I therefore kicked it off on a Thursday evening, crossed my fingers and left it to its own devices!
This process is really the only way to upgrade a large deployment. The manual CLI process gives you much more control, but the planning and time investment required to make sure each node reboots at the correct time and re-joins the deployment when you are ready is too much. This works for smaller deployments, and I would argue is a better process for smaller deployments because of the level of control it gives you.
I think the automatic upgrade process is good and a step in the right direction but it does rely on some flexibility on the part of the sites you are disrupting. Unless you sit and watch the upgrade screen for 10 hours, you have no idea when each node is reloading, or even if the upgrade worked! If Cisco added some options for stage gates based on time, I think it would improve the process and make it more manageable. If you could define at what time a unit should reload, you would be able to feedback more information to the site as to downtime and also have a better knowledge of when to check in with the process.
– Recovering an ISE node after an upgrade failure
– ISE 2.6 Live Logging changes