HiveOS update to 6.1r2 failed

  • 2
  • Question
  • Updated 5 years ago
  • Answered
I tried to update my HiveManager appliance (1HU, blue) from 6.1r1 to 6.1r2.
After the system rebootet my entire Hive crashed. All APs were in green state, the HiveManager webinterface not available. I was able to ping the appliance, though.

I logged into the console via ssh and restored the HM back to 6.1r1, what went smoothly. all APs were back up too, but lost the configuration. I was able to reconfigure them too, so my hive is back to functional as it was before the update.

Now my question: what went wrong? We had no general networkproblems or power outage during the whole process.
Photo of Lenya

Lenya

  • 18 Posts
  • 5 Reply Likes
  • frustrated

Posted 5 years ago

  • 2
Photo of Brian Powers

Brian Powers, Champ

  • 396 Posts
  • 92 Reply Likes
I'd say Aerohive Support would be best equipped to possibly answer the "why" here. But I ALWAYS perform a complete backup before performing any upgrades. I once upgraded a VM image and during the process, the VM program crashed (VM Fusion on a Mac). And upon restarting it, the VM image was completely hosed and wouldnt even completely load the OS to get back into HM to restart the process. Luckily it was a test bed and no harm was done in the end, but it was all I needed to ensure I always back up prior to updates/downgrades.

There is however a CLI command that lets you see the progress of the update and sometimes once it seems like it should be done, it is still doing work behind the scenes. Especially if you have a large deployment that it is trying to restore the database to. I forget the CLI number commands to get to it and I dont have access to a HM to dig it up for you.

Curious, how do you know the whole "Hive crashed"? What defined that? The fact that the WebGUI wasnt available? In my experience, that was one of the last parts to be restored to working order (I say this because I've been able to SSH into the HM as you did and browse around the CLI for upwards of 20-30 minutes prior to the WebGUI being available).
Photo of Lenya

Lenya

  • 18 Posts
  • 5 Reply Likes
If I do an update/upgrade, I always do a full backup beforehand. That's routine, so I didn't see to mention it, my bad.

Thanks for the hint with the CLi command, I'll look for it in the HM help section.
I know full upgrades need some time, but after two hours passing by I was wondering what's the deal.

I say the Hive crashed because it was not only the HM appliance unavailable until I pulled and replugged the power, but all APs did a reboot too (before the hard reset) and were in green state. I did every upgrade in the past from version 3.x on and it never happened that the APs went down while updating the HM appliance only.

There's one AP down the hallway where I'm sitting, so I can go and watch it easily. After a swift walk to other locations and after the HM was finally back up and running, I could see they all were down and lost their configrations.
So yes, I think something went horribly wrong.
Photo of Brian Powers

Brian Powers, Champ

  • 396 Posts
  • 92 Reply Likes
Yeah, I'd say something went awry. Wish I could be more help in determining the problem. Glad you had backup!

The longest update I was part of was for a school that had 400+ APs on the old original 1U HM appliance going from the previous major release to whichever major release was needed for the AP121/141s (the numbers allude me right now). But that update probably took 3+ hours before it was completely done and usable...

But I've never ran into long delays in the incremental updates (6.x -> 6.x+1, etc...)
Photo of Lenya

Lenya

  • 18 Posts
  • 5 Reply Likes
I was astonished how smoothly the major updates (3.x -> 6.1r1) all went back then, so imagine my shock as this one failed. We "only" have 82 AP120 at the moment, but plan to raise the numbers significantly (and get different types of APs), because we're expanding.
So I think going to a bigger appliance might be in order, but that's a different story.
I'm just happy I got the Hive back up :D
Photo of Brian Ambler

Brian Ambler

  • 245 Posts
  • 126 Reply Likes
Lenya,

When the APs had a green LED, how did you ascertain that they had rebooted and lost their config after the HiveManager upgrade? While the HiveManager is down, the normal LED state on the APs (at least on the AP110/120/330/350) will be green as they still have a backhaul connection, but their CAPWAP connection is down.

While 6.1r2 is still a new release, I have had a number of customer already upgrade without issue. So though it is certainly possible that you ran into an issue while upgrading (while uncommon it is not completely unheard of), it seems likely that your HiveManager was still in the process of restoring your database into the new 6.1r2 partition. For the record, the following will happen once "Update" is clicked (the order of steps 1 and 2 might be off, but this is my recollection):

1) The update package is uploaded to the HiveManager
2) An internal backup of the HiveManager is taken (full or config only depending on the option selected)
3) The HiveManager uses the upgrade package to upgrade the standby partition of the HiveManager to 6.1r2
4) The HiveManager copies the database backup taken at the beginning of the upgrade over to the newly installed 6.1r2 partition
5) The HiveManager reboots into the standby partition, which becomes the new active partition
6) The HiveManager restores the database backup into the newly installed 6.1r2 partition

*these steps are for 6.1r2, but you could substitute any version of code in place of 6.1r2

It is step 6 (along with step 2) that will take the longest to complete. Even if you have less than 100 APs, other factors (a large AVC database, historical reporting data, lots of old HiveOS images, etc.) can cause the HiveManager database to grow quite large. If you were to console into the HiveManager appliance (or pull up the VMware console if it were a virtual appliance) you would see "Starting Tomcat..." instead of a login prompt. To watch what the HiveManager is doing after it reboots during the upgrade (as Brian mentioned), SSH into the HiveManager and navigate to "3) Advanced Product Configuration > 1) Configure HiveManager > 16) Display HM Update/Restore Progress". This backend monitor will show you what the HiveManager is doing after it reboots and appears to be unreachable. It may take some time for the backend monitor to show progress, but eventually it will start to show you line by line which section of the database backup it is restoring and then give you an all clear once the restore has finished. It is at this point where (or shortly thereafter) the HiveManager GUI would prompt you to log in to the system.

Without knowing exactly what happened I am unable to tell you what happened, but as an actual crash during or following an upgrade it quite rare, I would assume that the HiveManager was simply taking a long time to restore the database (which is much more common, especially with the introduction of AVC in 6.0r2 and up). I would be curious to see what happened if you tried the upgrade again, this time monitoring the restore process through the backend monitor from the HiveManager CLI, but this is up to you.

Hope this helps
Photo of Lenya

Lenya

  • 18 Posts
  • 5 Reply Likes
Brian,

I'm sure the APs did reboot since the HM displayed an AP uptime matching to the HM uptime once it was back (2 minutes). Since the other major HM OS updates (e.g. from 5.1x to 6.1r1) didn't cause the APs to do anything weird, I think this time something didn't work as supposed, or expected at least.
I didn't notice any AP downtimes/reboots, or CAPWAP disconnects the last times I updated HM.

Of course the updates take a lot of time and there is a phase the HM is not reachable via webinterface, but I was not even able to ping it or to ssh into it until I did a cold reset (after two hours waiting). After this I was able to ssh into the HM.

I think I'll set up a new maintenance window and do a second try. I'll try to watch the process on the console and hope I was just unlucky the first time.

--Lenya
Photo of Jornt Weyts

Jornt Weyts

  • 26 Posts
  • 3 Reply Likes
Brian,

Our update on an HiveManager VM (64b) seems to be frozen in "Starting Tomcat...". If I use SSH to Display HM Update/Restore Progress all i get is 'no update or restore operation.' repeating infinitely.
Is that what I'm supposed to see? It has been like this for over 20min

Jornt
Photo of Brian Ambler

Brian Ambler

  • 245 Posts
  • 126 Reply Likes
Jornt,

If the HiveManager is sitting on "Starting Tomcat..." please be patient. It is not uncommon that the backend monitor may not show any progress right away, sometimes it takes a while for the restore process to start. Do not reboot the HiveManager when you see it sitting at this screen at the console as if the restore has started but for some reason is not displaying you could cause the database restore to fail and boot into a broken system. As I said earlier, it it not uncommon for a HiveManager with a large database to sit at "Starting Tomcat..." for 2+ hours if the database is significantly large.

I have requested an improved backup/restore process to address concerns such as these, but for now patience is a virtue.

Hope this helps
Photo of Loren

Loren

  • 48 Posts
  • 2 Reply Likes
Crap I think mine just blew up.....argh How long do you have to wait???

as I'm typing this is came to life....whew!!!

I guess I better put a ticket in about my inability to complete a backup. Ha Ha.
Photo of Brian Ambler

Brian Ambler

  • 245 Posts
  • 126 Reply Likes
Loren,

Please provide us with more specifics other than "I think mine just blew up" as that is not helpful. What makes you say that it "blew up"? If you truly feel that your HiveManager backup has failed in a significant way, calling Aerohive Support or your supporting partner is going to be your best bet.

But, since we're all here already, please SSH into the HiveManager using the above mentioned steps to check the progress of the upgrade through the backend monitor. How long you have to wait depends entirely on the size of your HiveManager database; it could take as little as 15 minutes or it could take hours.

Hope this helps
Photo of Loren

Loren

  • 48 Posts
  • 2 Reply Likes


I am seeing this when I try to run any backup. Had this happen the last couple upgrades
Photo of Brian Ambler

Brian Ambler

  • 245 Posts
  • 126 Reply Likes
Loren,

That usually indicates exactly what is says, that another Administrator is either running a backup. I take it this is not the case? If you started to run a backup/restore/upgrade operation and either the GUI timed out for some reason or you rebooted the HiveManager, the process will still continue to run, either in the background or once the HiveManager comes back online after a reboot. Do you get this error message the very first time you try to run a backup? Or did a previous action take place before trying to run the backup that spawned this message?

Thanks in advance