EAP 3: Quick HA problems

Hi all,

Have two SG 210 rev. 3 devices.

Waited for EAP3 to test the new HA version (QuickHA)

My primary device has been upgraded from v17 to v18 EAP1 --> EAP 2 and now, successfully EAP 3

My secondary device was installed with EAP 3 ISO.

 

Did setup both, with Quick HA, everything went fine and "green", did a failover to secondary device, all went well and primary rebooted.

 

But..never came back up.

 

Looking in CTsyndc.log on seconday device, all sync are successfull, except for dhcp6.lease file, as I do not use IPv6 dhvcp server.

 

Also LCD in old primary is just blank with a "-" sign in the top line.

 

Attached a screen, and it went into XG Failsafe.

 

The new master (The secondary), have Dead heartbeat service, also UI i VERY slow, CPU is just 6%.

 

I am now formatting both SG210 with EAP 3 ISO, restore from backup (without HA), and will try the Quick HA setup again.

i will keep you posted :-)

Parents
  • Another thing:

     

    PortC (E2) is DMZ on SG devices

    It is no possible to change to Port D (E3) which is labeled HA on the front, why?

     

    -----

    Best regards
    Martin

    Sophos XGS 2100 @ Home | Sophos v20 Technician

  • Follow up, this time I rebooted the master and after 6-10 minutes, the HA was synced again, thus the new master now shows this:

    And the old master this:

     

    Of course they are down, as the slave shall not handle theese.

     

    Are there som forgotten checks that are not done, after a master turns into slave?

    -----

    Best regards
    Martin

    Sophos XGS 2100 @ Home | Sophos v20 Technician

  • Now I pressed the "Switch to Auxiliary device"

    The master rebooted, the old master took over again, but now i very slow, takes 5-10 secs for ssh commands:

     

    SG210_WP03_SFOS 18.0.0 EAP3# tail -f ctsyncd.log
    [Fri Dec 20 08:46:17 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:46:17 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:52:40 2019] (pid=2507) [notice] committing all external caches
    [Fri Dec 20 08:52:40 2019] (pid=2507) [notice] Committed 56 new entries
    [Fri Dec 20 08:52:40 2019] (pid=2507) [notice] commit has taken 0.009839 seconds
    [Fri Dec 20 08:52:40 2019] (pid=2507) [notice] flushing external cache
    [Fri Dec 20 08:52:43 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:52:43 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:52:47 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:53:09 2019] (pid=2507) [ERROR] no dedicated links available!
    ^C
    SG210_WP03_SFOS 18.0.0 EAP3# tail -f msync.log
    Fri Dec 20 08:57:24 2019:870024:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:57:24 2019:870066:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:57:40 2019:468113:1452:BACK:MAST:INFO:vrrp.c:1111 no event set for event: MAST
    Fri Dec 20 08:57:40 2019:468131:1452:BACK:MAST:INFO:vrrp.c:1119 flags 2e event tracking stopped for last 5 minutes!!!(GTM:BACK)
    Fri Dec 20 08:57:44 2019:889281:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:57:44 2019:889300:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:58:04 2019:907287:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:58:04 2019:907308:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:58:24 2019:923398:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:58:24 2019:923419:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:58:40 2019:282294:1452:BACK:MAST:INFO:vrrp.c:1111 no event set for event: MAST
    Fri Dec 20 08:58:40 2019:282313:1452:BACK:MAST:INFO:vrrp.c:1119 flags 2e event tracking stopped for last 6 minutes!!!(GTM:BACK)
    Fri Dec 20 08:58:44 2019:941532:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:58:44 2019:941551:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:59:04 2019:957527:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:59:04 2019:957545:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.

     

    The old master is now dead, as it happened firstly in this thread.

     

    I think the new HA is very unstable, with UTM this was a piece of cake to make it work #zeroconfrules

     

    Any others have tried it out? Swithing back and forth?

    -----

    Best regards
    Martin

    Sophos XGS 2100 @ Home | Sophos v20 Technician

  • The broken slave shows this on screen:

     

    Booting '18_0_0_255'

    0.000000] [Firmware Bug]: TSC_DEADLINE disabled due to Errata: please update microcode to version: 0xb2 (or later)
    Password:

     

    But wasn't this not a bug in EAP2, and was supposed to be fixed in EAP 3?

     

    https://community.sophos.com/products/xg-firewall/sfos-eap/sfos-v18-early-access-program/f/feedback-and-issues/117071/firmware-bug-on-xg210

    -----

    Best regards
    Martin

    Sophos XGS 2100 @ Home | Sophos v20 Technician

  • Hi Martin,

           Thanks for your feedback, I will send you PM for more details purpose.

  • Thanks Martin for your tests. I will take the EAP3 training before testing the HA. I really hope that DMZ zone is not needed anymore. A proper zone "HA" or no zone should exist for HA configuration.

    Also, I really hope HA configuration is like UTM. Zero touch!

  • Lets us know if you hear anything on this.  I rolled back to 17.5 because I was having random reboots with two 310 Rev 2 devices.  Absolutely stable in 17.5.9

  • Just a quick note on that: XG V18 HA is not the full zero touch like UTM. 

    XG needs more background processes to actually pull of a Zero touch. 

    You still need to register the Appliance to mySophos to get the license and the model registered. 

     

    Fully Zero Touch would take this process in concern and register the appliance for you, if you put them into a HA. 

    There are couple of challenges to perform such a process and other processes automatically. 

     

    The Quick HA Mode will introduce a mode for you to simplify the process of HA for the Administrator. Most likely you only need to start the Aux, Register it, run through the wizard to skip to the End and put the password of HA Node into the Process. 

    No need of creating new zones etc, IPs etc. 

     

     

     

    __________________________________________________________________________________________________________________

Reply
  • Just a quick note on that: XG V18 HA is not the full zero touch like UTM. 

    XG needs more background processes to actually pull of a Zero touch. 

    You still need to register the Appliance to mySophos to get the license and the model registered. 

     

    Fully Zero Touch would take this process in concern and register the appliance for you, if you put them into a HA. 

    There are couple of challenges to perform such a process and other processes automatically. 

     

    The Quick HA Mode will introduce a mode for you to simplify the process of HA for the Administrator. Most likely you only need to start the Aux, Register it, run through the wizard to skip to the End and put the password of HA Node into the Process. 

    No need of creating new zones etc, IPs etc. 

     

     

     

    __________________________________________________________________________________________________________________

Children