Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Firewall DHCP Relay stops working until you delete an recreate a random DHCP Relay object

This issue is annoying us for years and happened today again after one year of being working.

XG 430 with lag and SFOS 19.5.3

XG has several VLAN. On one VLAN a Windows DHCP Server is serving DHCP addresses.

On several other VLAN configured also on XG there are DHCP forwarders pointing to the Windows DHCP server.

At some point the Clients will no longer receive DHCP offers and they do not get IP addresses anymore.

This situation only stops with a firewall reboot or when you delete any DHCP relay object on the XG and recreate it.

Then the clients will get IP addresses immediately.

Today it happened again I deleted a RED15 on the XG and powered on an other RED15W. Both have DHCP servers.

I have had several cased open since 2021 with GES and it cost a lot of time and frustration. They never found out anything helpful. Want us to reproduce the issue. But this is impossible - we have no idea how to reproduce it. We can only start logging and put logs to debug after it occoured.

Cases handling the issue were:

05521277 / direct to 2nd Level: XG DHCP server or DHCP relay failing after some time - clients not receiving DHCP offer

05158330 / 05128430 / XG DHCP server or DHCP relay failing after some time - clients not receiving DHCP offer

04704295 / XG DHCP server or DHCP relay failing after some time - clients not receiving DHCP offer

03953883 / DHCP Relay not working until deletion and recreation of a random DHCP Relay object 

You can see on XG, it is not sending DHCPREPLY, this only starts again, when you recreated the dhcp relay

172.16.xxx.xxx is the Windows DHCP Server Relay IP address.

XG430_WP02_SFOS 19.5.3 MR-3-Build652 HA-Primary# tail -f networkd.log
udhcpc: sending discover
Forwarded BOOTREQUEST for 54:e1:ad:76:c0:f2 to 172.16.xxx.xxx
Forwarded BOOTREQUEST for ec:79:49:4e:99:57 to 172.16.xxx.xxx
Forwarded BOOTREQUEST for e8:80:88:54:61:5e to 172.16.xxx.xxx
udhcpc: sending discover
Forwarded BOOTREQUEST for e8:80:88:54:61:5e to 172.16.xxx.xxx
Forwarded BOOTREQUEST for 28:16:ad:3a:4c:83 to 172.16.xxx.xxx
Forwarded BOOTREQUEST for ec:79:49:4e:99:57 to 172.16.xxx.xxx
....
dhcp relay recreation
....
udhcpc: sending discover
Forwarded BOOTREQUEST for 60:5b:30:00:29:1f to 172.16.xxx.xxx
Forwarded BOOTREPLY for 60:5b:30:00:29:1f to 172.16.aaa.aaa
Forwarded BOOTREQUEST for 60:5b:30:00:29:1f to 172.16.xxx.xxx
Forwarded BOOTREPLY for 60:5b:30:00:29:1f to 172.16.aaa.aaa
Forwarded BOOTREQUEST for 60:5b:30:00:29:1f to 172.16.xxx.xxx
udhcpc: sending discover
Forwarded BOOTREQUEST for e4:46:b0:3a:04:0a to 172.16.xxx.xxx
Forwarded BOOTREPLY for e4:46:b0:3a:04:0a to 192.168.bbb.bbb
Forwarded BOOTREQUEST for e4:46:b0:3a:04:0a to 172.16.xxx.xxx
Forwarded BOOTREPLY for e4:46:b0:3a:04:0a to 192.168.bbb.bbb
udhcpc: sending discover



Added V19.5 MR3 TAG
[edited by: Erick Jan at 1:58 AM (GMT -8) on 10 Jan 2024]
[gesperrt von: LuCar Toni um 9:03 AM (GMT -7) am 23 Jul 2024]
Parents
  • There was one fix coming in for DHCP Relay in V20.0 with a catch.

    See: 

    https://support.sophos.com/support/s/article/KB-000045837?language=en_US

    Essentially i know, a relay could stop, if the interfaces are "invalid" like a virtual interface like RED, which is offline. 
    Could that be a way to reproduce it? 

    So if you have a timeframe, where the issue started, could you check the logviewer / system and check if there was a change? Like RED disable? 

    __________________________________________________________________________________________________________________

  • from the graph above and the Admin Audit log i would say it started with RED deletion

    I'm in the process of creating a new support case

    ->

    Case Number
    07174270
  • Hey  , 

    Thanks for the service request number, will get the SR expedited !!  

    Thanks & Regards,
    _______________________________________________________________

    Vivek Jagad | Team Lead, Technical Support, Global Customer Experience

    Log a Support Case | Sophos Service Guide
    Best Practices – Support Case  | Security Advisories 
    Compare Sophos next-gen Firewall | Fortune Favors the prepared
    Sophos Community | Product Documentation | Sophos Techvids | SMS
    If a post solves your question please use the 'Verify Answer' button.

  • So - Could you actually reproduce this by deleting one RED? Because the way for success here is to be able to reproduce this problem. 

    __________________________________________________________________________________________________________________

  • I have some spare RED15 I could delete - wonder if I will junk them all? see my other recent post...

  • some still work. So I deleted and recreated them but the issue did not come back.

    the Cluster had been rebooted last 4 days ago. I think this was the first network change after. Maybe it only happens in that combination - FW reboot, delete RED, DHCP issue? not so easy to test - this is no playground.

    But I found a comment of mine in earlier case notes:

    9/20/2022 4:59 PM 05662019 / direct to 2nd Level: XG DHCP server or DHCP relay failing after some time - clients not receiving DHCP offer

    We cannot reproduce it. It just comes. Eventually it has to do with RED changes. Last time I deleted and recreated a RED. 1 hour later or so we noticed the issue. You should have it in the logs.

    Also at that time I tried to delete and recreate RED again - issue not reproduced but some how related.

  • So - my understanding from a past issue with this situation was: If there is a "Huge" DHCP relay config and one interface went offline, it can cause problems. After removing the faulty options (there were instances like customers added not support XFRM to the DHCP Relay and break it), it worked again for ages. 

    Not sure, what causes your problem, the Team will look into your time stamp on your appliance to check, if we can find an indicator of your problem. But still, in IT it is the hardest to debug an problem, which is not reproducible (Like you properly know with the SATC issue). 

    __________________________________________________________________________________________________________________

  • I'd say it's not a huge DHCP config. 20 DHCP servers on XG and a bunch of relays.

    I'd expect hundreds to be huge.

    Last weekend I rebooted both FW nodes and will recreate a RED deletion tomorrow if that may reproduce it. The initial RED I deleted is broken unfortunately so it will be a different one. But both have (had)  their own DHCP servers.

    In the meantime the support case is going to my statisfaction - I guess I have to thank   for that. They are looking in the backup before the first deletion of the RED and communication is really well currently.

Reply
  • I'd say it's not a huge DHCP config. 20 DHCP servers on XG and a bunch of relays.

    I'd expect hundreds to be huge.

    Last weekend I rebooted both FW nodes and will recreate a RED deletion tomorrow if that may reproduce it. The initial RED I deleted is broken unfortunately so it will be a different one. But both have (had)  their own DHCP servers.

    In the meantime the support case is going to my statisfaction - I guess I have to thank   for that. They are looking in the backup before the first deletion of the RED and communication is really well currently.

Children