Important note about SSL VPN compatibility for 20.0 MR1 with EoL SFOS versions and UTM9 OS. Learn more in the release notes.

Sophos XGS is not compatible with VLAN ID 0 (Null VID) frames as defined in 802.1Q

TLDR - IEEE 802.1Q reserves VLAN ID 0 for a special purpose. Sophos XGS firewalls do not implement this special purpose correctly, preventing communication with some ISP Gateway modems. The request for proper implementation of VLAN ID 0 handling is being tracked under Feature Request SFSW-I-2426.

Background

This article discusses Priority Tagging and VLAN ID 0 (VID 0), aka the null VLAN. These can be confusing subjects, so let's first talk about where they come from.

IEEE created the P802.1p working group to implement QoS at layer 2. Their work on Priority Tagging was incorporated into 802.1D specification, which later became the 802.1Q specification that we all know and love. People talk about 802.1p like it's the IEEE specification for Priority Tagging, but it's not. There is no 802.1p spec. Priority tagging is a part of the IEEE 802.1Q specification, and has been for a while. 

Section 9.6, Table 9-2 of the 802.1Q spec describes how VID 0 (VLAN ID 0) should be handled. It's called the "null VLAN ID", because it means that no VLAN is in use. Let me explain... Priority Tagging information is stored in the 802.1Q frame header. Normally the 802.1Q header is only applied to frames that are part of a VLAN. But what if you want to implement Priority Tagging on the native/default network without a VLAN? For this situation, the 802.1Q header is used with VID 0. It tells the network device "I know you just received a frame with a 802.1Q header, but this frame isn't associated with any VLAN."

For further reading on the null VID, see pg 40 of this Avaya Labs paper from 2002.

The Problem

A fiber ISP in our area uses VLAN ID 0 Priority Tagging. They are unable to turn off Priority Tagging. Doing a few Google searches, this just seems to be a thing with some ISPs. Unfortunately, the current Sophos XGS firewalls are unable to process VLAN ID 0 traffic correctly, and therefore is unable to communicate with the modem.

Devices that are able to treat VLAN ID 0 as native (untagged) network traffic don't have any problems. For example, if I plug a Windows 10 PC straight into the ISP modem, it can communicate fine with no special NIC configurations (aside from setting an authorized MAC address).

Working with Sophos Support, we dug deeper into what's going on. In the following logs, the Sophos WAN IP is x.x.183.6 (MAC c8:4f:86:xx:xx:18) and the ISP gateway modem is x.x.183.1 (MAC 10:70:fd:xx:xx:0c). We see the Sophos broadcasts ARP requests for the modem's MAC over and over. The modem responds within milliseconds every time. This indicates the modem is receiving ARP requests from the Sophos and the modem is initiating a reply. But the Sophos is ignoring the modem’s replies, and continues to broadcast the same request. The Sophos' ARP table shows this by listing the record for x.x.183.1 as 'Incomplete'.

14:37:08.631766 Port3, OUT: c8:4f:86:xx:xx:18 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has x.x.183.1 tell x.x.183.6, length 28
14:37:08.652088 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 60: vlan 1, p 0, ethertype ARP, Reply x.x.183.1 is-at 10:70:fd:xx:xx:0c, length 42
14:37:09.657818 Port3, OUT: c8:4f:86:xx:xx:18 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has x.x.183.1 tell x.x.183.6, length 28
14:37:09.664086 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 60: vlan 1, p 0, ethertype ARP, Reply x.x.183.1 is-at 10:70:fd:xx:xx:0c, length 42
14:37:10.677829 Port3, OUT: c8:4f:86:xx:xx:18 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has x.x.183.1 tell x.x.183.6, length 28
14:37:10.701087 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 60: vlan 1, p 0, ethertype ARP, Reply x.x.183.1 is-at 10:70:fd:xx:xx:0c, length 42
14:37:12.634568 Port3, OUT: c8:4f:86:xx:xx:18 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has x.x.183.1 tell x.x.183.6, length 28
14:37:12.651102 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 60: vlan 1, p 0, ethertype ARP, Reply x.x.183.1 is-at 10:70:fd:xx:xx:0c, length 42

And the logs are littered with instances of the Sophos ignoring ICMP echo requests from the internet (x.x.120.109), despite ping/ping6 being enable on the WAN for troubleshooting:

14:37:07.684498 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 78: vlan 1, p 0, ethertype IPv4, x.x.120.109 > x.x.183.6: ICMP echo request, id 4, seq 62724, length 40
14:37:12.683622 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 78: vlan 1, p 0, ethertype IPv4, x.x.120.109 > x.x.183.6: ICMP echo request, id 4, seq 62726, length 40
14:37:17.677525 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 78: vlan 1, p 0, ethertype IPv4, x.x.120.109 > x.x.183.6: ICMP echo request, id 4, seq 62728, length 40
14:37:22.681029 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 78: vlan 1, p 0, ethertype IPv4, x.x.120.109 > x.x.183.6: ICMP echo request, id 4, seq 62730, length 40
14:37:27.682931 Port3, IN: 10:70:fd:xx:xx:0c > c8:4f:86:xx:xx:18, ethertype 802.1Q (0x8100), length 78: vlan 1, p 0, ethertype IPv4, x.x.120.109 > x.x.183.6: ICMP echo request, id 4, seq 62732, length 40

The keen eyed will have noticed that every packet the Sophos receives from the modem is logged as VLAN 1, not VLAN 0. So why would the Sophos rewrite the VLAN ID from 0 to 1? This isn't entirely wrong, as null VID frames should be treated as native network traffic, and the Sophos' default/native VLAN is 1. So what's the problem? Well, associating VID 0 frames with VLAN 1 only gets us half way there. This is supposed to be untagged native network traffic. The Sophos is clearly treating it as tagged native VLAN 1 traffic.

So why can't we just create a VLAN 1 sub-interface on the Sophos' WAN Port? Wouldn't that fix it? Yes & No. It does enable the Sophos to recognize inbound traffic from the ISP Gateway modem, but it also breaks out-going communication to the ISP Gateway modem. The ISP modem is expecting native network traffic or VID 0 traffic. It doesn't recognize tagged VLAN 1 frames, and will discard them.

Workarounds

The best workaround we've found is to connect a switch in-between the ISP Gateway a Sophos WAN port. You don't need anything fancy, just something that's VLAN aware. We've been using a USW Flex Mini, and it fixes the problem out of the box, no special VLAN setup required. Because it's a managed switch, we do end up configuring some VLANs (on the switching side, not in the Sophos) to segregate the WAN traffic from the switch's management IP port, which is then plugged into a DMZ port on the Sophos. So some special configurations are required to tidy things up for remote management.

If you're not looking to spend any money and don't want to make any physical changes, you can configure a WAN bridge. Some claim this works because bridges can set a default Port VLAN ID (PVID) of 0, instead of the system's native network (VLAN 1). But I tend to think it's because the bridge offloads more to the CPU, and that code base implements appropriate VID 0 handling. The downside of a WAN bridge is that it will waste a physical interface, and all the other bridge limitations (no Dynamic DNS, PPPoE, and IPsec VPNs.)

  1. Create a new zone called "DMZ_NULL" with type DMZ.
    1. This will be the zone assigned to the unused bridge member interface, because we're forced to have at least 2 member interfaces, and only 1 interface can have the WAN zone.
  2. Create a new bridge.
    1. Hardware: br0
    2. Enable routing on this bridge pair: I left this unchecked, but checking it shouldn't hurt either
    3. Member Interface 1: Select the actual WAN interface, and set the zone to 'WAN'
    4. Member Interface 2: Select an unused interface, and set the zone to 'DMZ_NULL'
    5. IPv4: Setup as instructed by your ISP. I'm using DHCP in my setup.
  3. Save the bridge

NOTE: When I created the bridge it's PVID was automatically set to 0. I didn't have to log into the Advanced Shell and run "system vlan-tag interface br0 vlanid 0" as other have reported. You can verify the bridge's PVID using the "system vlan-tag show" command.

Setting the bar too high?

After almost 2 months of working with Sophos Support (Case 01796956 - closed), the decision was made that this issue would become a feature request (SFSW-I-2426). I was not given an opportunity to comment on the Feature Request, and became concerned upon learning it was titled "Priority tagged frame support - 802.1p VLAN id Null/0". I fear someone at Sophos will read this and think the request is asking for full Priority Tagging support - reading, mapping, and configuring port-specific priority information, requiring changes to both the backend and front-end web UI. That's a big ask, and could kill the chances of this Feature Request getting approved.

This is why I'd like to highlight that full support for Priority Tagging is not being requested and is not required to fix the issue identified above. What's needed is basic compatibility with (not support for) VLAN ID 0 Priority Tagging. So when the Sophos receives a VLAN ID 0 frame (with or without Priority Information), it treats it as untagged native network traffic. That's it. No other support for Priority Tagging is necessary.

I've requested the Feature Request's title be changed, but I have no visibility into this process. It would be a shame to have this feature request de-prioritized for being attributed to Priority Tagging, especially when the fix might have nothing to do with Priority Tagging at all...

Digging Deeper

It's interesting that the Sophos is modifying the VID 0 traffic (translating it to a VLAN tagged VID 1 frame), albeit not in the way we want. I attempted to find out what's responsible for this activity by checking the Traffic Control (TC) filters and qdisc rules, as well as ebtables, but found nothing. I'm curious where this activity is defined, because it could be an easy fix. For example, if there's existing control logic that says "rewrite VID 0 to VID 1", modifying the policy to something like "strip 802.1Q header from all VID 0 frames" would resolve this issue.

TC supports similar policies, which could be used to remove the VLAN header for all ingress frames on a particular port that have frames matching VID 0. An example of this can be seen on Stack Exchange. TC rules normally run in the kernel, so this could impact throughput. Supposedly you can use TC clsact and write a small BPF program to pop (remove) the VLAN tags. Since some NICs support hardware offload of BPF programs, this method would remain performant.



Edited TAGs
[edited by: Erick Jan at 5:10 AM (GMT -7) on 2 Oct 2024]
  • I was looking into this FR and this situation and we tagged the FR to your desire,

    Unfortunately, as you might can relate to, the world of ISP is very diverse and there are big and smaller ISP, most doing their own setups. 

    This led to a more "having a ISP router" scenario in such deployments and move on with IPv4 from the ISP router to the Firewall. 

    Could you give us some more insight of the ISP using it like that and how popular this ISP is in your region? 

    __________________________________________________________________________________________________________________

  • The ISP we're having the issue with is Omni Fiber. It seems some AT&T Fiber deployments also have this issue... Just do a Google search for: at&t vlan 0 priority tagging

    Omni Fiber operates mostly in Ohio and Pennsylvania. Earlier this year they received a $150 million cash injection to continue expansion of high speed fiber in the midwest USA. They've been laying a ton of fiber lines where we're at in northern Ohio. Several of our customers have already moved or are considering moving to them.

    We discussed configuring the ISP router as a static bridge, but I was insistent that the Sophos would need to receive a public WAN IP (not a NAT IP), but nothing ever materialized. We felt that if we needed to stick a device in-between the Sophos and Omni Gateway, we'd prefer it be a device that we manage, such as a vlan-aware switch (as discussed in the Workarounds section).