DHCP WAN not working after HA failover?

I have the most recent UTM software appliance running in HA mode.  ISP is verizon FIOS (DHCP address on WAN).  I've noticed that when an HA failover occurs, the host that is now master no longer has WAN connectivity - the dashboard shows an error indication for the WAN interface.  If I go into the Interfaces screen, and click the 'renew' button, everything comes up just fine, but this is obviously less than optimal :)  I am using a 24-port edgeswitch, with ports 17, 18 and 19 in VLAN 2.  The two UTM appliances connect to 18 and 19, and the verizon ONT (think cable modem) connects to 17.  At first I thought this was some kind of spanning tree issue, but all 3 ports are configured as edge ports, so they should start working very quickly.  I did the most recent firmware update this AM, and it of course had to update and reboot both nodes.  Both times, the WAN interface in UTM showed as down, with an error indication.   ssh to the master node and I see this:

2019:07:17-07:21:13 gateway-1 dhclient: DHCPREQUEST for XXX on eth1 to 255.255.255.255 port 67
2019:07:17-07:29:08 gateway-2 dhclient: DHCPREQUEST for XXX on eth1 to 255.255.255.255 port 67
2019:07:17-08:41:01 gateway-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 7
2019:07:17-08:41:01 gateway-1 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 6
2019:07:17-08:41:01 gateway-1 dhclient: DHCPOFFER of XXX from QQQ
2019:07:17-08:41:01 gateway-1 dhclient: DHCPREQUEST for XXX on eth1 to 255.255.255.255 port 67
2019:07:17-08:41:01 gateway-1 dhclient: DHCPACK of XXX from QQQ

07:21:13 was when I told the UTM to perform the update.  You can see it did dhcprequest twice, with no answer, then gave up.  08:41:01 is when I noticed I was off the air, and clicked the 'renew' button, at which point it did the full sequence of operations.  I freely admit I'm not that savvy with switching protocols, so I'm not sure what is going on here.  Any help would be appreciated.  As things are now, HA isn't really giving me any benefit, as a failure will cause a failover, but I will still be off the air :(

  • I have a random, dumb 8-port netgear I can deploy as the WAN switch just to eliminate any edgeswitch oddities.  Will try that when I get home...

  • In reply to dswartz:

    If that works, try 7.7 in Rulz (last updated 2019-04-17).

    Cheers - Bob

  • In reply to BAlfson:

    Looking at this again, I find it hard to believe it is STP related.  The ports in question are set for edgeport mode, so it should go quickly.  Also, from the logs, there was only one dhcprequest sent, and it *should* have retried (and if it were STP related, one would think the packet would just get dropped, not give a link error indication.)  Later, when I have the chance to play around, I will reproduce this, and try to get more detailed information.

  • In reply to BAlfson:

    Interesting follow-up.  I had node 1 set as preferred node, due to it being a more powerful appliance with more RAM, etc...  Something tickled my memory, so I tried the repro a couple of times.  e.g. power off main appliance, fails over with no issues, power main appliance back up, it comes up, and the failback occurs.  *that* is the one that gets an error on the WAN interface (state ERROR).  clicking on dhcp renew button 'fixes' it.  So I tried an experiment: I set preferred node to none.  Power off main appliance, fails over to backup with no errors as expected, power main back on, comes up as slave.  So far, no surprises.  I then click the reboot button for node 2, and it goes dead, and node 1 takes over *without* any errors!  When node 2 comes back up, it is the slave as expected.  It sounds like there is some wiggy interaction between dhcp client interface (WAN in my case) and preferred node in HA.  I don't at all mind having to manually fail back to node 1 if an issue occurs - what I can't tolerate is the internet being down without manual interaction :(

  • In reply to BAlfson:

    I realized I should clarify a point: the error seems to be happening in middleware or somewhere thereabouts.  If I go to the shell and do 'ifconfig eth1', there are zero errors, collisions, etc, yet the GUI shows the interface as errored/down.  dhclient is not running at that point, which is presumably why doing a renew 'fixes' things...

  • In reply to dswartz:

    I've seen this phenomenon several times, that's why I recommended #7.7 above to correct an issue with some providers' switches where auto-negotiate of speed&duplex fails.

    Cheers - Bob

  • In reply to BAlfson:

    Okay, I'll try forcing full/1000 and retry.  This seems like it would have to be a bug in the intel ethernet driver, no?  If auto-negotiation is not done yet, we should not be sending DHCPREQUEST out that nic.  Nor should the driver be accepting it?  And why does it only happen when failing back?  Hmmmm...

  • In reply to BAlfson:

    Okay, set 100/full on both switch ports, and the UTM HW page.  Now it all seems to work, even with failback enabled.  I tried 1000/full, but the switch would not accept it (it had it as a menu option, but would apparentl [silently] ignore it.)  I have a 75/75 fiber connection, so 100/full is fine.  Seems like a bug somewhere, but at this point, I'm willing to cut bait and move on.  Thanks for the help :)

  • In reply to dswartz:

    My guesz is that your ISP does MAC lockdown.  I think UTM moves the IP address but not the MAC.

  • In reply to DouglasFoster:

    No, the MAC moves also.  FWIW, verizon does do mac lockdown, but if the lease was not released before anew MAC tries for an address, it literally takes hours for the old lease to expire.  So merely clicking on the renew button in the Interfaces screen wouldn't accomplish anything.

  • In reply to dswartz:

    That is confusing.   As far as I know, the DHCP protocol allows for a lease to be renewed at any time.   The expectation is that the client will attempt to renew the lease at about 50% of the maximum lease period.   If the MAC is unchanged, the renewal should be a non-event.  Even more, if the DHCP lease information is replicated properly, and the MAC is unchanged, then the newly-active device would not need to ask for a lease.   I have no idea why the lease would need to expire before things start working.

    Overall, if this is for a business, it sounds like the best option is to switch to a static address.  If this is for home use, HA may not reduce enough downtime to make it worth this frustration, other than for its educational value.

    It might be informative to put a PC with Wireshark on the switch that implements the HA connection, and monitor traffic during the HA failover.   It would give you valuable information for pursuing a support case with Sophos.  You raise some interesting questions that would benefit from more data.

  • In reply to DouglasFoster:

    Yeah, this is truly mystifying.  This is a residential FIOS connection, so DHCP is the only option.  I am running the UTM CE, so I don't know if I can open support tickets or not.  Based on balfson's rulz 7.7 apparently fixing this, it seems almost like UTM is taking the slave down and bringing it back up, and trying to send the DHCPREQUEST before the link is ready, resulting in an error.  It isn't as simple as just dropping the DHCPREQUEST, as I would expect retries - when this failure mode occurs, the dhclient process is terminating and you are stuck.  This would seem to be confirmed by the fact that clicking the renew button 'fixes' the problem (as that would start the dhclient process.)  It's also odd that this only seems to happen when failing back to the preferred node.

  • I‘m seeing the exact same issue at some of my customers‘s Networks.

    After a Failover the new Master is using the same IP and Virtual MAC but there is no Connection (Error in Webadmin). After i renew the IP on the Interface (same IP) the connection is immediately online again.

    In Switzerland there are many FTTH Provider that deliver a fixed IP in DHCP Mode (but just always the same IP). So we have to use DHCP WAN even if it is a „fixed“ IP.

    I do not always see this Problem but i think that this phenomen exists since many months, maybe years.

    I will try the suggestion to fix the Interface to 1000/Full, even if on the other side (HP Procurve) i can only select 1000/Auto (i think that‘s because of some RFC in Gigabit Ethernet).

    I think that this Problem is not ISP related because how should the ISP block or even notice the Failover if the MAC (Virtual) stays the same? I also do not think that there is a DHCP Release at Failover, isn‘t it? So the ISP should not notice anything about this.

    I‘m curious about other replys and expiriences.

    - Michael

  • In reply to solae:

    Unfortunately it has not worked, i had the same Issue again even though i had configered the WAN Interface for 1000/full. After shortly disconnect the Cable of the WAN interface (just 1-2 seconds) the connection was back.

    There is no DHCPREQUEST or something in the Logfile, so the UTM seems to think the Internet is all good...

  • In reply to solae:

    Does it only happen on failback like in my case?