This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Internationalized Domain Names (IDNs)?

I am looking for any information on what I need to know and need to do so that:

  • my email filtering is ready for Internationalized Domain Names appearing on incoming email addresses, server host names, and content.
  • my web protection is ready for Internationalized Domain Names in web addresses, whether used on host names, path names, or page names.

Open questions:

  • To what extent is UTM prepared to handle IDN-containing content?
  • How do I configure IDN content filters in UTM?
  • How will IDN content appear in log files?

I asked Sophos Support first.   They provided a link to the minimal information in this one KB article, and suggested that I follow-up with Sales.

https://community.sophos.com/kb/en-us/117316

Since this is already a multi-national audience, I thought I might get a better or faster answer from this forum.



This thread was automatically locked due to age.
Parents
  • Here is what I am figuring out:

    This RFC describes how to tag and encode Unicode or other non-Ascii data when it is embedded in an 8-bit data stream.   

       https://tools.ietf.org/html/rfc2047

    The format is: encoded-word = "=?" charset "?" encoding "?" encoded-text "?=", such as "=?UTF-8?B?ABCD?="

    In my data stream, this has become pretty common in the Subject text of advertising messages.   The most common charset is "UTF-8".   The valid encoding options are B or Q, as explained in the RFC.  I seen them both commonly used.  I have also seen this used in the Message From

    This encoding is most commonly seen in Subject text when the sender wants to include emojis and other symbols.   I have also detected it in the Friendly Name portion of the message From header.

    This RFC describes how to encode Unicode for domain name registrations. 

    https://tools.ietf.org/html/rfc5890

    The Unicode source text is put through the punycode algorithm, which is also defined in the RFC, and the result is prepended by "xn--".   The hyphens in positions 3 and 4 indicate encoding, and the first two characters indicate the encoding algorithm.   At the present time, punycode is the only allowed algorithm and therefore "xn" is the only valid prefix before "--".

    For filtering based on content, UTF creates complications, so the system administrator must understand whether the search rule will be applied to the encoded text or the decoded text.  If you want to block all encoded subject text, this regex (\=\?.+\?.+\?), if applied to the raw text, should capture the label regardless of the character set.   We have been using the simpler regex of looking for (UTF-8), but this does not detect all possible encoded character sets.   This is a crude way of detecting the presence of encoded data.   To do content filtering on the decoded data, a different approach will be needed.

    For filtering based on email addresses and domain names, punycode means that the universe of possible TLDs will continue to explode.   To block unwanted TLDs, you might want to build a list of TLDS that you want to allow, then block everything else.   Alternatively, you could use a regex to look for (xn--) anywhere in the name, or in the TLD (\.xn--[^.]+)

    All of the above regex commands are constructed as I write this, therefore untested.

Reply
  • Here is what I am figuring out:

    This RFC describes how to tag and encode Unicode or other non-Ascii data when it is embedded in an 8-bit data stream.   

       https://tools.ietf.org/html/rfc2047

    The format is: encoded-word = "=?" charset "?" encoding "?" encoded-text "?=", such as "=?UTF-8?B?ABCD?="

    In my data stream, this has become pretty common in the Subject text of advertising messages.   The most common charset is "UTF-8".   The valid encoding options are B or Q, as explained in the RFC.  I seen them both commonly used.  I have also seen this used in the Message From

    This encoding is most commonly seen in Subject text when the sender wants to include emojis and other symbols.   I have also detected it in the Friendly Name portion of the message From header.

    This RFC describes how to encode Unicode for domain name registrations. 

    https://tools.ietf.org/html/rfc5890

    The Unicode source text is put through the punycode algorithm, which is also defined in the RFC, and the result is prepended by "xn--".   The hyphens in positions 3 and 4 indicate encoding, and the first two characters indicate the encoding algorithm.   At the present time, punycode is the only allowed algorithm and therefore "xn" is the only valid prefix before "--".

    For filtering based on content, UTF creates complications, so the system administrator must understand whether the search rule will be applied to the encoded text or the decoded text.  If you want to block all encoded subject text, this regex (\=\?.+\?.+\?), if applied to the raw text, should capture the label regardless of the character set.   We have been using the simpler regex of looking for (UTF-8), but this does not detect all possible encoded character sets.   This is a crude way of detecting the presence of encoded data.   To do content filtering on the decoded data, a different approach will be needed.

    For filtering based on email addresses and domain names, punycode means that the universe of possible TLDs will continue to explode.   To block unwanted TLDs, you might want to build a list of TLDS that you want to allow, then block everything else.   Alternatively, you could use a regex to look for (xn--) anywhere in the name, or in the TLD (\.xn--[^.]+)

    All of the above regex commands are constructed as I write this, therefore untested.

Children
No Data