This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

9.400 bricks connected APs!

Hello,

just a heads up warning regarding the soft released 9.400.


Just installed it on the gateway and all connected APs that were using VLAN tagging and bridging communication to VLANs got bricked by the AP firmware update.

If the AP is not using VLAN tagging and wireless networks just use bridging to AP LAN then it works. If VLAN tagging is used then after the firmware update the AP never finishes booting and gets stuck while trying to get IP address from DHCP in an never ending loo (AP50). Got a confirmation from a customer where it bricked their AP30s. For AP50 and 30 it is possible to use the flash tool to get them back to working state. Unfortunately AP55c gets bricked totally as there is no flash tool for it and after it gets the 9.400 firmware update and is deployed with VLAN tagging enabled it basically dies and never finishes booting or does not even get to a point of requesting the IP address from DHCP and gets stuck rebooting over and over again.

After rolling back to 9.355 (restore from tape) and reflashing the APs using the recovery tool (when possible) wireless works fine.

Looks like some nasty bug that slipped through QA... again :(



This thread was automatically locked due to age.
Parents
  • Hello Zdenek,

    we just did some investigation regarding this regression you mentioned, and indeed there is a regression, but it is a bit different then you think it is. First of all the AP is probably not bricked (even if it looks like). What is broken is the fallback mechanism, which is probably used in your setup as default due to the connecting VLAN being untagged (coming later to it). When you configure your AP with the VLAN tagging it tries to connect the UTM over the specified VLAN, if it can not do so it will after some time fallback to a default LAN behaviour this means it contacts the UTM without using a VLAN tag. And this fallback is currently broken in the 9.400.

    If you use this vlantagging option and meanwhile configure your switch to use the specified VLAN with 'untag', the AP will first try to connect the UTM with the specified VLAN. This leads to the packets going out from the AP with the vlantag, the switch will forward them to the UTM, but as soon as the answer from the UTM comes to the switch it will get the vlantag removed. Thus the AP is not able to match the answer (without VLANTAG) to his request (with VLANTAG), then after some time the AP will go to the fallback and work in this fallback mechanism, the bridge 2 VLAN networks are not effected by this fallback, so they still work as expected, it is just the way the AP contacts the UTM.

    We are working on the fix for this regression, but for now if you want to get the APs back running which don't show up anymore you could provide the VLANTAG as 'tagged' from the switch to them, then they should come up working again.

    Regards,
    Emanuel

Reply
  • Hello Zdenek,

    we just did some investigation regarding this regression you mentioned, and indeed there is a regression, but it is a bit different then you think it is. First of all the AP is probably not bricked (even if it looks like). What is broken is the fallback mechanism, which is probably used in your setup as default due to the connecting VLAN being untagged (coming later to it). When you configure your AP with the VLAN tagging it tries to connect the UTM over the specified VLAN, if it can not do so it will after some time fallback to a default LAN behaviour this means it contacts the UTM without using a VLAN tag. And this fallback is currently broken in the 9.400.

    If you use this vlantagging option and meanwhile configure your switch to use the specified VLAN with 'untag', the AP will first try to connect the UTM with the specified VLAN. This leads to the packets going out from the AP with the vlantag, the switch will forward them to the UTM, but as soon as the answer from the UTM comes to the switch it will get the vlantag removed. Thus the AP is not able to match the answer (without VLANTAG) to his request (with VLANTAG), then after some time the AP will go to the fallback and work in this fallback mechanism, the bridge 2 VLAN networks are not effected by this fallback, so they still work as expected, it is just the way the AP contacts the UTM.

    We are working on the fix for this regression, but for now if you want to get the APs back running which don't show up anymore you could provide the VLANTAG as 'tagged' from the switch to them, then they should come up working again.

    Regards,
    Emanuel

Children
  • Hello Emanuel,

    thank you for the response. The switch port the AP is connected to is configured with the AP control VLAN being both tagged and untagged on it plus the bridged VLANs are tagged there as well. So just like in past where it works, the AP should be able to reach the UTM via tagged or untagged way.

    With AP50 I can see that the AP boots and gets stuck in the infinite loop while trying to get IP from DHCP server (AP sends request, server provides offer but AP never acks it). When I deploy the AP50 without VLAN tagging and using just some testing SSID which bridges the traffic to AP LAN all works fine. AP connects to the UTM, flashes the firmware, reboots and wifi starts working. After I change the configuration to VLAN tagged setup it gets stuck in the DHCP cycle.

    However, the AP55 once the VLAN tagging config was pushed to it, it no longer finishes booting and does not even attempt to send DHCP request and just reboots over and over again in random intervals. Unfortunately there is no flash tool for the new APs so right now it is a brick.

    Right now I'm not able to do any more tests as I had to rollback the 9.400  back to 9.355 where all works as WiFi with VLANs is critical and I have to keep the business going.

    Cheers,

    Zdenek

  • Hello Zdenek,


    for my understanding, let's say you have the vlantag 500 configured as the AP control VLAN, and the switch has VLAN 500 tagged and untagged on the port the AP is plugged in. Now a packet with VLAN 500 arrives at the switch (which is for the AP) what does/should the switch do? Send the packet to the AP with the VLAN or send it without the VLAN? My expectation would be to send it without the VLAN.

    So my understanding is that if you have the switch configured VLAN 500 as untagged and tagged then both is allowed, so when the AP sends a packet with vlantag 500 it gets forwarded to the utm and when the AP sends a packet without any vlantag it also gets forwarded to the utm, but for the answer path there needs to be a decision. So I would expect the behaviour previously described with the fallback. (which is broken since 9.400)

    Anyways the problem you describe with the AP55 sounds a bit different since it does not even send the DHCP request on the VLAN, we will continue looking into that one. Was the AP55 also configured in 9.35 and then after update it went in the state, or did you configure it directly in 9.4?


    Regards,

    Emanuel

  • Thanks guys!  I had what appeared to be the exact problem that Zdenek described after upgrading to 9.400 this morning.  My AP55C just never came back online from the UTM's perspective.  I could look in the DHCP logs and see:

    2016:04:04-13:30:41 utmname dhcpd: DHCPDISCOVER from 00:1a:8c:xx:xx:xx (AP55C-A1234567890123) via 172.16.xxx.xxx
    2016:04:04-13:30:41 utmname dhcpd: DHCPOFFER on 172.16.xxx.xxx to 00:1a:8c:xx:xx:xx (AP55C-A1234567890123) via eth4.xxx

     Hundreds and hundreds of these lines.  I am using a Cisco switch.  My original configuration in the switch was:

    UTM switch port - Trunked with native VLAN set to my management VLAN number (123).

    AP switch port - Trunked with native VLAN set to my management VLAN number (123).

    Taking your advice, I removed the native VLAN configuration from both the UTM and AP switch ports and everything started working immediately.  Thanks and I hope this helps for anyone else who is having a similar problem.

  • Hi,

    So does that mean you're not doing any 'bridge to VLAN' wireless networks?

    Thanks,

    Barry

  • Hi Barry,

    Sorry, should have been a little more descriptive - I get used to talking to customers all day, sometimes! LOL!

    So, I do have three Wireless Networks on that AP, all three configured "Bridge to VLAN".  So, all of my wireless VLAN traffic between the UTM and AP was obviously being tagged, but my management traffic wasn't being tagged (native - in Cisco speak).  Essentially the only thing I really ended up having to change was to adjust my management VLAN traffic to be tagged between the UTM and AP as well.

    Chase

  • Sorry for late reaction., The AP55 was not connected to 9.3. It has been connected directly to 9.400 (it was a fresh AP55 out of the box). It appeared in the pending list, was accepted, firmware update was pushed to it and it never came back up.

  • ChaseDavenport said:

    Hi Barry,

    Sorry, should have been a little more descriptive - I get used to talking to customers all day, sometimes! LOL!

    So, I do have three Wireless Networks on that AP, all three configured "Bridge to VLAN".  So, all of my wireless VLAN traffic between the UTM and AP was obviously being tagged, but my management traffic wasn't being tagged (native - in Cisco speak).  Essentially the only thing I really ended up having to change was to adjust my management VLAN traffic to be tagged between the UTM and AP as well.

    Chase

    So at the moment on the switch port the AP is connected to you have the management VLAN both as native and tagged? Because that is the configuration we are using and which stopped working with 9.400.

    The presence of the management VLAN on the switch port the AP is connected to both as tagged and untagged (native/PVID) is critical in order to allow seamless deployment of APs without having to have some special port I have to connect the AP to first, let it upgrade the firmware and download config and after this is done to actually unplug it and connect it to the final port. Right now this functionality is broken (as it was officially confirmed).

    Zdenek

  • Hi Zdenek,

    Are you saying we SHOULD or SHOULD NOT be using PVID (and tagging) with 9.400?

    FWIW, my UTM is in VMWare ESXi so I don't think PVID is going to help, but I have tried both VLAN1 and VLAN13 on the PVID setting on my switch.

    I thought about plugging the AP30 directly into another NIC on the ESXi server, but then will bridge-to-VLAN still work?

    Thanks,

    Barry

  • BarryG said:

    Hi Zdenek,

    Are you saying we SHOULD or SHOULD NOT be using PVID (and tagging) with 9.400?

    FWIW, my UTM is in VMWare ESXi so I don't think PVID is going to help, but I have tried both VLAN1 and VLAN13 on the PVID setting on my switch.

    I thought about plugging the AP30 directly into another NIC on the ESXi server, but then will bridge-to-VLAN still work?

    Thanks,

    Barry

    From experience in our setup, when the management VLAN is both tagged and untagged on the port the AP is connected to (how VLANs are delivered to the UTM itself is not important), the connected AP ends up in the infinite DHCP request/offer/never ack loop (which has been confirmed by Sophos as a bug in AP firmware).

    As Chase mentioned in his post, it seems that if you remove the untagged management VLAN (native/PVID) from the port the AP is connected to and keep it there as tagged only (or at least that is how I understood what he did), the AP starts working (which I can't confirm at the moment as I'm not in the office this week).

    As for the plugging of AP directly to ESXi NIC, well, I don't think it makes any difference, plus, based on your ESXi configuration, the bridge to VLAN functionality may not work as expected afterwards, so I would generally recommend to keep the AP plugged into a physical switch.

  • On my Netgear GS108T switch, I cannot disable PVIDs, but I can set them to a different VLAN (for the AP30 and ESXi host). That doesn't seem to help.

    Is there an ETA for a fix from Sophos?

    Thanks,

    Barry