This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Filtering non English character emails

Hi All,

Recently we are getting inundated with emails containing Russian and Chinese characters. Is there anyway to filter these emails to quarantine via UTM 9 anti spam? 

Thanks,



This thread was automatically locked due to age.
Parents
  • You should be able to do this with a regex.

    I don't normally post non-Sophos links, but I found this resource which explains Unicode and Regex and the syntax differences between different Regex libraries.

    https://www.regular-expressions.info/refflavors.html

    I do not know what RegEx library is used on UTM.   Mr Alfson or Mr. Jaydeep, can you answer that?

    Recently I tried this Regex in one of my three spam filters.  I think this particular syntax works all regex libraries..

    [\p{S}p{C}P{Latin}]

    That is supposed to say:

    • Any symbol (heart, smiley, etc.)
    • Any control character.
    • Any non-Latin script

    We actually disabled the rule because it was matching too much.   In retrospect, I think the issue was that I was checking both Body and Subject, and the body has line feeds, so the control character match is not appropriate for the Body check.

    For your purposes, [P{Latin}] should catch Russian and Chinese text (and Korean and Katakana, etc.)

Reply
  • You should be able to do this with a regex.

    I don't normally post non-Sophos links, but I found this resource which explains Unicode and Regex and the syntax differences between different Regex libraries.

    https://www.regular-expressions.info/refflavors.html

    I do not know what RegEx library is used on UTM.   Mr Alfson or Mr. Jaydeep, can you answer that?

    Recently I tried this Regex in one of my three spam filters.  I think this particular syntax works all regex libraries..

    [\p{S}p{C}P{Latin}]

    That is supposed to say:

    • Any symbol (heart, smiley, etc.)
    • Any control character.
    • Any non-Latin script

    We actually disabled the rule because it was matching too much.   In retrospect, I think the issue was that I was checking both Body and Subject, and the body has line feeds, so the control character match is not appropriate for the Body check.

    For your purposes, [P{Latin}] should catch Russian and Chinese text (and Korean and Katakana, etc.)

Children
No Data