EAP 3: Quick HA problems

Hi all,

Have two SG 210 rev. 3 devices.

Waited for EAP3 to test the new HA version (QuickHA)

My primary device has been upgraded from v17 to v18 EAP1 --> EAP 2 and now, successfully EAP 3

My secondary device was installed with EAP 3 ISO.

 

Did setup both, with Quick HA, everything went fine and "green", did a failover to secondary device, all went well and primary rebooted.

 

But..never came back up.

 

Looking in CTsyndc.log on seconday device, all sync are successfull, except for dhcp6.lease file, as I do not use IPv6 dhvcp server.

 

Also LCD in old primary is just blank with a "-" sign in the top line.

 

Attached a screen, and it went into XG Failsafe.

 

The new master (The secondary), have Dead heartbeat service, also UI i VERY slow, CPU is just 6%.

 

I am now formatting both SG210 with EAP 3 ISO, restore from backup (without HA), and will try the Quick HA setup again.

i will keep you posted :-)

  • Another thing:

     

    PortC (E2) is DMZ on SG devices

    It is no possible to change to Port D (E3) which is labeled HA on the front, why?

     

  • In reply to twister5800:

    Follow up, this time I rebooted the master and after 6-10 minutes, the HA was synced again, thus the new master now shows this:

    And the old master this:

     

    Of course they are down, as the slave shall not handle theese.

     

    Are there som forgotten checks that are not done, after a master turns into slave?

  • In reply to twister5800:

    Now I pressed the "Switch to Auxiliary device"

    The master rebooted, the old master took over again, but now i very slow, takes 5-10 secs for ssh commands:

     

    SG210_WP03_SFOS 18.0.0 EAP3# tail -f ctsyncd.log
    [Fri Dec 20 08:46:17 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:46:17 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:52:40 2019] (pid=2507) [notice] committing all external caches
    [Fri Dec 20 08:52:40 2019] (pid=2507) [notice] Committed 56 new entries
    [Fri Dec 20 08:52:40 2019] (pid=2507) [notice] commit has taken 0.009839 seconds
    [Fri Dec 20 08:52:40 2019] (pid=2507) [notice] flushing external cache
    [Fri Dec 20 08:52:43 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:52:43 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:52:47 2019] (pid=2507) [ERROR] no dedicated links available!
    [Fri Dec 20 08:53:09 2019] (pid=2507) [ERROR] no dedicated links available!
    ^C
    SG210_WP03_SFOS 18.0.0 EAP3# tail -f msync.log
    Fri Dec 20 08:57:24 2019:870024:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:57:24 2019:870066:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:57:40 2019:468113:1452:BACK:MAST:INFO:vrrp.c:1111 no event set for event: MAST
    Fri Dec 20 08:57:40 2019:468131:1452:BACK:MAST:INFO:vrrp.c:1119 flags 2e event tracking stopped for last 5 minutes!!!(GTM:BACK)
    Fri Dec 20 08:57:44 2019:889281:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:57:44 2019:889300:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:58:04 2019:907287:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:58:04 2019:907308:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:58:24 2019:923398:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:58:24 2019:923419:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:58:40 2019:282294:1452:BACK:MAST:INFO:vrrp.c:1111 no event set for event: MAST
    Fri Dec 20 08:58:40 2019:282313:1452:BACK:MAST:INFO:vrrp.c:1119 flags 2e event tracking stopped for last 6 minutes!!!(GTM:BACK)
    Fri Dec 20 08:58:44 2019:941532:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:58:44 2019:941551:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.
    Fri Dec 20 08:59:04 2019:957527:1418:BACK:MAST:DEBUG:worker.c:587 idle workers 10.
    Fri Dec 20 08:59:04 2019:957545:1418:BACK:MAST:DEBUG:worker.c:628 worker_num 14.

     

    The old master is now dead, as it happened firstly in this thread.

     

    I think the new HA is very unstable, with UTM this was a piece of cake to make it work #zeroconfrules

     

    Any others have tried it out? Swithing back and forth?

  • In reply to twister5800:

    The broken slave shows this on screen:

     

    Booting '18_0_0_255'

    0.000000] [Firmware Bug]: TSC_DEADLINE disabled due to Errata: please update microcode to version: 0xb2 (or later)
    Password:

     

    But wasn't this not a bug in EAP2, and was supposed to be fixed in EAP 3?

     

    https://community.sophos.com/products/xg-firewall/sfos-eap/sfos-v18-early-access-program/f/feedback-and-issues/117071/firmware-bug-on-xg210

  • In reply to twister5800:

    Hi Martin,

           Thanks for your feedback, I will send you PM for more details purpose.

  • In reply to twister5800:

    Thanks Martin for your tests. I will take the EAP3 training before testing the HA. I really hope that DMZ zone is not needed anymore. A proper zone "HA" or no zone should exist for HA configuration.

    Also, I really hope HA configuration is like UTM. Zero touch!

  • In reply to twister5800:

    Lets us know if you hear anything on this.  I rolled back to 17.5 because I was having random reboots with two 310 Rev 2 devices.  Absolutely stable in 17.5.9

  • In reply to lferrara:

    Just a quick note on that: XG V18 HA is not the full zero touch like UTM. 

    XG needs more background processes to actually pull of a Zero touch. 

    You still need to register the Appliance to mySophos to get the license and the model registered. 

     

    Fully Zero Touch would take this process in concern and register the appliance for you, if you put them into a HA. 

    There are couple of challenges to perform such a process and other processes automatically. 

     

    The Quick HA Mode will introduce a mode for you to simplify the process of HA for the Administrator. Most likely you only need to start the Aux, Register it, run through the wizard to skip to the End and put the password of HA Node into the Process. 

    No need of creating new zones etc, IPs etc. 

     

     

     

  • In reply to LuCar Toni:

    Thanks . It is still a beta and in the EAP 3 training course, this is not very clear. I am sure in the official delta course this will be addressed. Also, I hope that a proper KB will be created on how to configure HA on v18+. 

    Regards

  • In reply to lferrara:

    I just wanted to address the expectation. 

    Fully Zero Touch is consider for later on the road. 

    Basically the V18 Quick HA Mode will skip most of the stuff written in this KBA: https://community.sophos.com/kb/en-us/123174

    And there came some advantages with it, be able to change the settings of the HA without breaking the HA for example. 

     

  • In reply to twister5800:

    I have the same behavior.
    I can't dial the port.
    I've compared it on several appliances.
    The system automatically takes the first available port here.

     

     

     > Problem solved with EAP3 Refresh 1

  • In reply to bot:

    Sorry for the cross posting, but I was  trying to get an answer to clear a flag on my issue, but am now I am experiencing some similar issues described here.

    So, just a link into the other post to clear the mud in the water I created:

    https://community.sophos.com/products/xg-firewall/sfos-eap/sfos-v18-early-access-program/f/feedback-and-issues/117491/ha_pair-service-stopped