This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Custom content type rules (based on regexp) unable to scan Ms Office "new" file formats (docx, xlsx, pptx etc.)

Hello,

 

as I described in the subject - we encounter following issue -

Sophos DLP doesn't scan content of docx, xlsx, pptx while in case of same documents saved in the old formats (doc, xlsx, pptx) it works properly.

It also doesn't seem to have other issues.

 

Did anyone have this problem ?



This thread was automatically locked due to age.
  • Hello Dominik Uliasz,

    DLP doesn't scan content of docx ...
    what makes you thin so? Likely that the rule fails to trigger. There's a thread from a few months ago and I see that I then have successfully tested at least xlsX. Thus I assume that DLP is scanning but that the content isn't formatted in the way that the rule/CCL triggers - see my post in this thread mentioning failed detections in a PDF.

    You can easily check whether these types are scanned by using a simple rule (e.g. one that triggers on the presence of a single word).

    Christian

  • Hi Christian,

     

    thanks for a prompt reply.

    Rules I'm talking about contain CCLs with custom regular expressions inside (looking for specific information).

    I've put same content into different file types (xls, csv, doc, mht, htm etc) and these regular expressions match properly

    (I add that I have 1 CCL per 1 rule as if you put few CCLs in one rule there is a "AND" operator so it's almost impossible for the rule to trigger as all CCLs have to be matched).

    So these regexes in CCLs match properly but not for files saved as XLSX, DOCX, PPTX (and I believe for other default file extensions for Ms Office available since Office 2003).

    I though this may have something in common with the encodding but there is no difference if I change it.

     

    Any suggestions ?

  • Hello Dominik Uliasz,

    the text content proper is in an XML file inside the archive so you can easily check characters and strings. If your expressions rely on controls (LF and the like) the might not "see" the formatting they expect.

    Christian