XG managed APX 740 APs randomly going offline and dropping all clients

Feature and severity: I have a bug (appears to be) with SFOS v18 v3 and APX 740 wireless access points that I consider moderately impacting.

Summary: I am unsure of the trigger, however, every now and then (appears random but multiple times per day) all 3 of my APX 740s “appear” to go offline then come back a minute or two later.

Observed behavior: All 3 APs drop all clients and the SSID isn’t broadcast then after two or three minutes they come back.  All the clients need to re-home to the best AP again.  I run 3 740s using an XG210 as the controller and have 3 SSIDs (one for only 2.4ghz, one for only 5ghz and one for a guest SSID that’s 2.4 + 5ghz).  I thought it may be related to auto channel selection so I manually set the channel on both 5ghz and 2.4ghz radios on all APs (different channels of course).  The problem persisted though.  I don’t recall having this issue previously but it may have been happening without me being aware.  I say that because I’ve recently added significant home automation devices so now it’s very noticeable when this happens.

i tried Sophos central wireless and it’s worse.  I won’t go back to central until several releases come out.

Reproduce it:  This happens on its own many times a day but nothing that forces it that I’m aware of.

Supporting logs: The log viewer, under “SYSTEM” shows (just a brief excerpt for brevity):

SYSTEM
2020-01-05 14:13:20
WirelessProtection
   
[MASTER] sending notification about offline AP P210018WVRKDY75
18006
SYSTEM
2020-01-05 13:00:20
WirelessProtection
   
Successfully sent config to AP [P210018V7XJJH9F].
18007
SYSTEM
2020-01-05 13:00:05
WirelessProtection
   
Successfully sent config to AP [P210018WVRKDY75].
18007
SYSTEM
2020-01-05 12:59:48
WirelessProtection
   
[MASTER] sending notification about offline AP P210018V7XJJH9F
18006
SYSTEM
2020-01-05 12:59:27
WirelessProtection
   
[MASTER] sending notification about offline AP P210018WVRKDY75
18006
SYSTEM
2020-01-05 12:57:29
WirelessProtection
   
Successfully sent config to AP [P210018V7XJJH9F].
18007
SYSTEM
2020-01-05 12:56:49
WirelessProtection
   
[MASTER] sending notification about offline AP P210018V7XJJH9F
18006
SYSTEM
2020-01-05 12:56:05
WirelessProtection
   
Successfully sent config to AP [P210018WVRKDY75].
18007
SYSTEM
2020-01-05 12:55:25
WirelessProtection
   
[MASTER] sending notification about offline AP P210018WVRKDY75
18006

 

Parents
  • Hello Jamie,

    can you please specify wich fixed channels do you selected?

    Kind Regards,

    Suzzyx

  • Hi Suzzyx,

     

    for 2ghz (3 APs):

    1, 6, 11

     

    for 5ghz (2 APs):

    36, 44

     

    side note: I could re-enable auto selection (I only ever used auto on 5ghz) and see if it happens.  Still no new AP offline entries in the log.

  • Yes I can.  It’s a small test network with roughly 50 devices, most wireless.  I’m not using VLANs although I have a test VLAN configured on all 48 switch ports (tagged) and a VLAN interface on the firewall but it’s not used (it is configured on the only LAN interface I have however).  The DHCP network for that VLAN is turned off as well.  Much of this was in place to test central wireless.  I’ll put a straw diagram together today and post.

  • didn’t get a chance to toss a diagram together. I will tomorrow. Today got unexpectedly busy.  But, it happened again today (that’s twice now today)

    SYSTEM
    2020-01-12 13:48:22
    Wireless Protection
       
    Successfully sent config to AP [P210018WVRKDY75].
    18007
    SYSTEM
    2020-01-12 13:47:48
    Wireless Protection
       
    [MASTER] sending notification about offline AP P210018WVRKDY75
    18006
  • Again

    SYSTEM
    2020-01-13 00:52:16
    Wireless Protection
       
    Successfully sent config to AP [P210018WVRKDY75].
    18007
    SYSTEM
    2020-01-13 00:51:35
    Wireless Protection
       
    [MASTER] sending notification about offline AP P210018WVRKDY75
    18006
  • The switch port the AP is connected to is bouncing when these entries occur in the firewall log. No errors on the switch.  Appears the AP is actually reloading but I’ll move switch ports later just in case and the replace the cable although this happens to the other AP as well, it doesn’t happen near as frequent. This the the log from my switch:

    13 Jan 2020 00:52:07%STP-W-PORTSTATUS: g17: STP status Forwarding
    13 Jan 2020 00:52:02%LINK-I-Up: g17
    13 Jan 2020 00:52:00%LINK-W-Down: g17
    13 Jan 2020 00:51:52%STP-W-PORTSTATUS: g17: STP status Forwarding
    13 Jan 2020 00:51:47%LINK-I-Up: g17
    13 Jan 2020 00:51:45%LINK-W-Down: g17
    13 Jan 2020 00:51:01%STP-W-PORTSTATUS: g17: STP status Forwarding
    13 Jan 2020 00:50:56%LINK-I-Up: g17
    13 Jan 2020 00:50:46%LINK-W-Down: g17
  • I moved from g17 to g18 on the switch FYI 

  • Ian,

    I went back at your original suggestion regarding DHCP and noticed the lease settings for the APs are 24 hours and correlate to the times they go offline.  Would it be better to statically reserve these or elongate the lease period? Or is there something unexpected going on here? I would not expect the AP to lose connection when it renews it’s IP lease.  Come to think of it, the WAN interface on the XG sometimes does the same thing.

  • No symptoms or log messages since I statically reserved the IPs in DHCP.  Last event was 1/13.  Still too soon for me to be "comfy" though.

    Network drawing (basic):

  • Hi James,

    at one stage there was an issue with the DHCP server, so I blew all my lease times out and I do make my IPs use a static assignment.

    The WAN interface sounds like your ISP might be performing network maintenance, the changes occur at night?

    Ian

    I do not have any of the APX series APs only the previous models.

     
    V18.0.x - e3-1225v5 6gb ram on 4 port MB with 2 x APX120 - 20w. 
    If a post solves your question use the 'This helped me' link.
  • I found the culprit on the WAN side and it turns out there's a node problem in the neighborhood.  so, you're right, it's carrier related but it was one of those deals where they didn't know there was a problem in their infrastructure.

    I have not had an issue since I statically reserved the IPs for the APX APs but I'd like to see a few more days go by to feel "good" if that makes sense.  I find it odd that statically reserving solves this because it's still a renew, just ensuring the same IP.  Seems it's masking an underlying issue.  I may do the same as you across the DHCP scope.

    Just my 2 cents right now.

  • Still solid uptime since the static reservations, but memory is now at 78%.  Concerning albeit unrelated to the AP issue.  Uptime is 10+ days.  Starts around 36% at boot.  I’ll watch it but worried there’s a memory leak at play.

Reply Children
  • Just happened to my disappointment.  There must be some logging on the access points that can be looked at.  I do not have power supplies for these access points (they doing ship with them) but I would expect a PoE related log entry on my switch.  Otherwise, I’d move them to the ports directly on the XG to rule out the switch.  I would imagine Sophos would want to get to the bottom of this but it doesn’t feel like it.  May consider rolling back to 17 and see if this follows as I’m running out of options. Keep in mind, no other devices are doing this.  At first it was all access points.  Now, it’s only the access point that’s running the 5ghz SSID. A while back I removed the 5ghz SSID from the other APs because devices weren’t rehousing properly and I’d prefer the outage sadly.

    System logs from the XG:

     
    Time
     
    Log comp
     
    Status
     
    User name
     
    Message
     
    Message ID
     
    SYSTEM
    2020-01-18 00:33:35
    Wireless Protection
       
    Successfully sent config to AP [P210018WVRKDY75].
    18007
    SYSTEM
    2020-01-18 00:33:01
    Wireless Protection
       
    [MASTER] sending notification about offline AP P210018WVRKDY75
    18006

    Switch logs show only that port bouncing (and this is a new port and cable to the wall jack):

    18 Jan 2020 00:33:38%STP-W-PORTSTATUS: g18: STP status Forwarding
    18 Jan 2020 00:33:33%LINK-I-Up: g18
    18 Jan 2020 00:33:31%LINK-W-Down: g18
    18 Jan 2020 00:33:23%STP-W-PORTSTATUS: g18: STP status Forwarding
    18 Jan 2020 00:33:19%LINK-I-Up: g18
    18 Jan 2020 00:33:16%LINK-W-Down: g18
    18 Jan 2020 00:32:32%STP-W-PORTSTATUS: g18: STP status Forwarding
    18 Jan 2020 00:32:27%LINK-I-Up: g18
    18 Jan 2020 00:32:17%LINK-W-Down: g18

    no errors on the switch port (3rd column is error received count):

      g18 15828947 0 30240 22497987 0 0
  • Hi James,

    you are describing an overheating issue. Try turning the 2.4ghz SSID on the failing AP off.

    Ian

     
    V18.0.x - e3-1225v5 6gb ram on 4 port MB with 2 x APX120 - 20w. 
    If a post solves your question use the 'This helped me' link.
  • I’ll give it a shot.  I’ll do it today.  Bit strange that using both radios causes them to overheat (I understand your point though).  The APs are in a well cooled area (sitting on a furniture surface - that’s it).  Given that all the APs were doing this when they all had both SSIDs enabled  and now only the AP that has both radios enabled is doing this, you could be on to something.  That’s a design flaw to me.

    But, as I said, it’s worth a try to identify the source.  I’ll report back. Thx Ian.  

  • Done.  2 APs with only 2.4ghz and one AP with only 5ghz

  • Quick update:

    10 days of uptime and frankly better AP Association by endpoints.  I’m going to call it identified and remediated after a month but not resolved as this is a product defect in my opinion.  This happened with the TX power turned down, too.  Honestly, the access point that is now dedicated to just 5ghz is running warm to the touch.  I now can’t run guest on all access points as 2+5 and I need additional APs to fill in 5ghz coverage gaps now. Nevertheless, it would appear you were correct, Ian.