This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Replacing SG550 Rev1 HA cluster with Rev2

Hello all,

We had our rev1 slave node die over the weekend and Sophos and forcing us to replace both units now to Rev2. Does anyone know if the older SGFZTCHF2 (2 port sfp+) flexiport modules will work in rev2?

I'm trying to figure out how to minimize my downtime for the switch over by getting the new cluster up ahead of time but to do that I need an extra module so all I have to do is move the cables, I begged Sophos to send me a module too since they are forcing me to do a complete replacement of both, and they said they would look around for one but so far haven't had any stock. So that has me wondering if the ones I do have would even work, or if I should be purchasing a couple different models?

Which brings me to my next question, does anyone have any experience with copying a config from a rev1 chassis to a rev2 chassis? Does anything change like interface numbering? Sophos closed out my ticket once they issued my RMA, I complained in the survey that most vendors would have kept it open and helped me with the process, but heard nothing back yet. I'm hoping I can get this done smoothly, this cluster serves very sensitive locations that are life saving.

This thread was automatically locked due to age.

0 BAlfson over 1 year ago

Hey Jason and welcome to the UTM Community!

I think you should open a new ticket and immediately request escalation. If you can't do that, your reseller can.

I've not seen that anyone here has experienced a problem transferring a config from a Rev.1 to a Rev.2 SG. Note that the following requires that the replacement units are at the same UTM version or newer.

I'm surprised that you can't move the SGFZTCHF2 module from the dead node to a Rev.2, replace the dead Rev.1 node with the Rev.2 SG and simply let the Rev.1 Master sync to the Rev.2 Slave.

Then, to complete the replacement, power down the Rev.1 Master so that the Rev.2 becomes Master, move the other SGFZTCHF2 to the other Rev.2 SG, move the Rev.2 into place with the Rev.2 Master and power it up.

This avoids losing logs and the reporting database.

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 Jason Brown3 over 1 year ago in reply to BAlfson

Thanks for the input, however the new rev2 chassis showed up today, to my surprise the module slots are completely different in size, so there's no way to reuse them in the new appliances. The rack rails don't even transfer either, the new chassis are 1/4" wider than the old ones.

I've got a line of communication open with Sophos now waiting to see what they do.
Cancel
Vote Up 0 Vote Down

Cancel
0 Raphael Alganes over 1 year ago in reply to Jason Brown3

Hello Jason,

Thanks for reaching out to Sophos Community and hope you are well.

I apologize that you have faced this inconvenience.

Also, would you be so kind to share with us your caseID via DM or by replying to this thread so we can also track the progress of this case on our end?

Many thanks for your time and patience and thank you for choosing Sophos

Cheers,

Raphael Alganes
Community Support Engineer | Sophos Technical Support
Sophos Support Videos | Product Documentation | @SophosSupport | Sign up for SMS Alerts
If a post solves your question use the 'Verify Answer' link.
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 1 year ago in reply to Jason Brown3

Ugh - glad none of my clients have bumped into this. If the old SFP+ modules wont work, I'd be disappointed in Sophos if they didn't give you new ones for the Rev.2 SGs. Let us know!

Cheers - Bob

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 Jason Brown3 over 1 year ago in reply to BAlfson

I've been on contact with my account executive and I think they are sending me some soon.

I do have a useful anecdote to share with everyone concerning the initial troubleshooting of this issue. If anybody ever runs into the dreaded root partition filling up issue on the primary in a HA cluster even with no queued updates or crash dumps etc. I know there's lots of threads about it which mostly end up being up2date related.

This started out with the secondary having some watchdog restarts for a few services and then rebooting for an "Unknown" reason on Saturday. it came back and brought HA back up but shortly thereafter I started getting emails from the primary about its root filling up to 95% from 75% before the event which is its normal highwater mark. I could see in the logs that communication between the primary and the secondary was having problem as the secondary put the database into read only mode. Initially we (Sophos Support) thought the database was just busted and needed to be rebuilt.

The first clue was the secondary was refusing SSH connections from within the shell on the primary and the serial console was unresponsive. (UTM still showed slave as READY) I had to physically power cycle the secondary via power cables. Upon booting console was showing POST and it seemed to boot normally, support rebuilt the database and that completed successfully, then proceeded to troubleshoot the root partition on the primary but came up with no smoking gun.

In a later call that night, again the slave stopped accepting SSH but this time instead of refusing connection it was saying password was incorrect which made no sense. However I still had console access on the laptop I had left plugged into it and we found that the repctl process wasn't even running which probably explains why the password wasn't working, it likely wasn't replicated from the primary. I had to physically reboot it again, this time it would not fully boot and would post then show an error code and instantly proceed into the BIOS on its own every time thereafter which was like 6 reboot attempts. Reflecting back on it now I think it suffered some kind of progressive storage failure, the raid always said OK and never emailed any alerts so I'm guessing it was something with the controller bus or the controller itself (I have no idea what's inside it, total guesswork).

With support still on the phone and determining the slave was now unrecoverable, we eventually made the realization that the root partition on the primary had miraculously returned to 75%. So it seems there is something that is happening where the primary is trying to push something to the slave but getting rebuffed with "error read only filesystem" and that ephemeral data is then eating up space on the root while it keeps trying, once the salve is no longer present, then repctl on the primary disengages and the space is cleared up.

The support folks seemed baffled by it, but if anyone has a root partition issue that is unexplainable, try taking the slave offline and see what happens, it could be an indicator of a more severe underlying issue.
Cancel
Vote Up 0 Vote Down

Cancel
0 BAlfson over 1 year ago in reply to Jason Brown3

Interesting, Jason.

Is Support telling you to disable HA, re-image the Slave, enable HA, attach the re-imaged device and then power it up? That would be my guess...

Cheers - Bob
PS Good on Sophos for replacing those modules.

Sophos UTM Community Moderator
Sophos Certified Architect - UTM
Sophos Certified Engineer - XG
Gold Solution Partner since 2005

MediaSoft, Inc. USA
Cancel
Vote Up 0 Vote Down

Cancel
0 Jason Brown3 over 1 year ago in reply to BAlfson

I don't know yet, I'm still waiting for all the gear before I do anything.
Cancel
Vote Up 0 Vote Down

Cancel