Can't login after reboot

Hi,

A customers Sophos HA Active/Passive cluster was misbehaving a few days ago - logging had stopped and some other things were going on. I rebooted the primary and then could no longer connect to anything. I didn't check beforehand if HA was functioning correctly :(

The next morning I power cycled both devices and booted up, and while everything appears to be working (packets forwarding, AP's working, etc) I can't actually log in to check anything remotely. I can get to the login for web console and SSH, but both the active and the passive devices are refusing my password.

This isn't the first time this has happened with this one client - a few things stop working, reboot, then no longer can connect.

As far as I am aware there is nothing strange about their config.

My plan at this stage is to rebuild it all (maybe 17.5.4... currently running 17.5.3) and restore from backup

The hardware is SG210 and is running XG. I think the hardware might be rev1 but can't confirm at the moment. I've definitely had some issues with XG on rev1 hardware at other clients but not like this.

Any other suggestions? Any idea why the password might have been corrupted?

Thanks

James

  • You're not alone. I'm experiencing this on an HA pair of XG125s this morning. Had the client reboot and they can get on the internet, but random services like DHCP are locking up and after typing my password to log into SSH it just hangs. The web interface is also stuck at the login screen and never accepts or rejects my password. I'm on 17.5.4 and have been a few days. It seemed to be fine on 17.5.1 which it's been on happily the past month.  

      

    I know for a fact HA was functioning correctly a few days ago because it had to be for the firmware update.

  • I'm pretty sure in this case i've found the root cause. One of the nodes has a failed SSD. The disk test passes when run from SFLoader, but after a few hours of running i get this:

    Apr 14 13:11:06 (none) user.err kernel: [276396.518959] end_request: I/O error, dev sda, sector 162339046
    Apr 14 13:11:06 (none) user.info kernel: [276396.518967] sd 0:0:0:0: [sda] Unhandled error code
    Apr 14 13:11:06 (none) user.info kernel: [276396.518968] sd 0:0:0:0: [sda]
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518969] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    Apr 14 13:11:06 (none) user.info kernel: [276396.518970] sd 0:0:0:0: [sda] CDB:
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518970] Write(10): 2a 00 09 ad 18 f6 00 00 18 00
    Apr 14 13:11:06 (none) user.err kernel: [276396.518974] end_request: I/O error, dev sda, sector 162339062
    Apr 14 13:11:06 (none) user.info kernel: [276396.518978] sd 0:0:0:0: [sda] Unhandled error code
    Apr 14 13:11:06 (none) user.info kernel: [276396.518979] sd 0:0:0:0: [sda]
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518980] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    Apr 14 13:11:06 (none) user.info kernel: [276396.518981] sd 0:0:0:0: [sda] CDB:
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518981] Write(10): 2a 00 09 ad 19 16 00 00 18 00
    Apr 14 13:11:06 (none) user.err kernel: [276396.518984] end_request: I/O error, dev sda, sector 162339094
    Apr 14 13:11:06 (none) user.info kernel: [276396.518989] sd 0:0:0:0: [sda] Unhandled error code
    Apr 14 13:11:06 (none) user.info kernel: [276396.518990] sd 0:0:0:0: [sda]
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518990] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
    Apr 14 13:11:06 (none) user.info kernel: [276396.518991] sd 0:0:0:0: [sda] CDB:
    Apr 14 13:11:06 (none) user.warn kernel: [276396.518992] Write(10): 2a 00 09 ad 19 46 00 00 18 00

    Which suggests that the SSD is going offline and freezing the cluster.

    It's a bit disappointing that this brings down the whole A/P HA cluster, but at least I know what the problem is and that there's an easy solution.

    James

  • In reply to jamesharper:

    If only I was that lucky. My client's has locked up twice in three days with no failed SFloader tests and no disk errors in syslog...

    I agree. It's called a HA cluster, if one starts acting wonky for any reason, the other should take over. Period. Not take down the whole cluster...