v17 Upgrade Disaster

Just sharing my odyssey from v16.5 MR-8 to v17 MR-2.  In short, it was a total disaster, but first, the backstory.

I upgraded to v17 GA several days after it was released after having run it on my home XG thru the beta and RC phase and was satisfied it was sufficiently stable.  My production XG 210 ran for almost a week, then totally froze up.  After rebooting I changed back over to the v16.5 MR-8 image to wait for fixes to the lockup issues people were reporting and skipped v17 MR-1.

So fast forward to today, MR-2 was released and I decided that since MR-1 seemed to resolve the lockups people were reporting, I'd give it another go. First mistake I should have paid more consideration to: XG wanted to install the v17 GA firmware instead of MR-1 or MR-2.  So I took a backup and blindly said "ok, well, we'll do a stepped upgrade, GA first, then to MR-2.  I downloaded the firmware, installed it, and waited for it to come back up.  And waited.  And waited.  After an hour it was clear nothing was happening nor was it going to happen.  The LCD on the front of the XG210 was blank.  Hooked up the VGA cable, nothing.  Nothing on the console.  So after an hour and a half, I pulled the plug.  It booted up to v17 GA, but seemed to be missing most of the configuration.  Couldn't get to the network interfaces at all, said it was "initializing" and to try again later (30 minutes later it was still saying the same thing).  Nothing worked.  Finally I decided to just reset the configuration, but even that failed to run.  So I decided to reboot and change the active firmware back to v16.5 MR-8.  Woah my, that did not go well.  I was greeted by several messages that partitions were being checked, followed by messages that whatever these checks were doing failed.  It seemed that the Reporting partition in particular was completely destroyed.  At this point, I decided I needed to re-image the device, no problem, I'll just get the ISO from MySophos.  Well, upon login I was greeted with a 500 Server error telling me to try back in 15 minutes or call support.  Almost 2 hours later, sitting around twiddling my thumbs waiting for MySophos to work, I called support, who were nice but ultimately could not tell me why it was not working.  During holding waiting for somebody else, it began working again and I was able to get the ISO.  Unfortunately it appears you can only get the most recent version ISO, I wanted to go back to v16.5 MR-8 but that was apparently not an option.  So I grabbed v17 MR-2, made the bootable USB stick, and crossed my fingers this would go right.  To my surprise, it actually started up, formatted everything, and installed the firmware on the device without any hitch, and then I was able to restore the backup I had taken during setup (that's a nice touch) and upon reboot I once again had a functional XG now at v17 MR-2.  I just hope and pray at this point that I don't run into the XG lockup bug that bit me in the GA build, because I do not have a V16.5 to fall back on. 

So some final thoughts:

  • It is very alarming to me that the update could literally destroy the entire system like this to the point that even the old firmware was wrecked beyond function.  There is something very wrong with an update process that can do this. 
  • It is way too hard to obtain the ISO/Update files.  if you're going to hide them beyond MySophos, it can't be down for hours at 8:00pm Eastern time.  Most people don't deploy firewalls or updates in the middle of business hours.  Epic fail.  I wasted essentially 3 hours tonight simply because I could not get to the stupid file I needed.
  • There is precious little in terms of onboard diagnostics to help give you some direction on where to go.  Eventually I was able to infer based on the messages at boot up of v16.5 MR-8, and SF Loader's troubleshooter for the disk that the disk was corrupt, although it never came right out and said "the disk is corrupt."  I felt like I was a detective trying to solve a crime with no witnesses and little to no evidence.  Its a scary thing to have to re-image the device, especially when you have no idea why it happened or if its even necessary or will fix the problem.

In any event, a very frustrating 6 hours.  The good news is that everything seems to be running well so far. 

  • thanks for sharing your experience.

    I personally asked several times to keep an ftp server to Sophos for old images as it happens for UTM but they do not understand this point of view so in my case I am keeping the ISO on a separate server, just in case.

    For the upgrade, I do not know if you were unlucky or what...but having the network down for 6 hours is a disaster. We expect that the upgrade goes linear but it does not happen all the time. Anyway it is always recommende to test the upgrade in an test environment before you go into production.

    Regards

  • In reply to lferrara:

    Luk, having just went through what I went through, it would be great if Sophos could keep old images available for us, but I guess I will have to do as you do and just warehouse them myself, just in case.

    I am not able to have an exact duplicate of test vs. production environment but I do have an XG Home that I test the configuration on, as much as possible, before rolling to production, and I've never encountered anything like this.  I've never encountered the upgrade process failing this thoroughly that it not only failed to upgrade, but destroyed the existing firmware as well.  I can only hope this was a one time fluke but I guess time will tell.