reg ex question

Question

This is an interesting one but no doubt someone will explain that I'm just not understanding reg exs :) 
 We are using a default reg ex format in the Block these websites section of filter actions e.g. ^https?://([A-Za-z0-9.-]*\.)?youtube\.com/ 
 However we've noticed that if clients browse to a URL youtube.com. they can get past this unless we change the reg ex to ^https?://([A-Za-z0-9.-]*\.)?youtube\.com (remove the / at the end) 
 Can anyone explain what we did wrong? 
 And yeah I now have several dozen filter actions to work through :(

Sheldon Botha · Answer

I'm by no means an expert, but here are my thoughts. 
 From what I've read, UTM is using Perl regex: 
 community.sophos.com/.../117316 
 There is a free online tester here I found: 
 http://retester.herokuapp.com/ 
 
 If I try your regex of: ^https?://([A-Za-z0-9.-]*\.)?youtube\.com/ on the online tester, it looks like its wanting that trailing "/" in the match (eg: https://www.youtube.com/" ) 
 If I visit the youtube homepage by entering www.youtube.com there is no trailing "/" on the url at that point, hence why it is not matching on the homepage of youtube and users can get to that point. 
 I note by removing the "/" to make your regex "^https?://([A-Za-z0-9.-]*\.)?youtube\.com" this allows it to match http or https://www.youtube.com but not http or https://www.youtube.com/ 
 So the "?" in regex means that there needs to be 1 or 0 of the previous character to the "?" to perform the match. 
 If i did something like: ^https?://([A-Za-z0-9.-]*\.)?youtube\.com/? on the tester, that allows it to match both http or https://www.youtube.com/ and http or https://www.youtube.com 
 You of course also have "*" in regex which is zero or more of the previous character, so ^https?://([A-Za-z0-9.-]*\.)?youtube\.com/* also appears to match up wether the url has the "/" or not since it is 0 or more of the previous character (since this wouldnt ever be the case, I'd use "?") 
 Your regex "should" match on any youtube video URL that has something after the trailing / at the end of .com, but for the homepage itself, it looks like that regex is expecting the "/" as part of the pattern it is looking for, and since the homepage of youtube ends at .com, its not matching. 
 
 I may get corrected, but I believe that's what is happening here, it's been a few years since I've really used regex extensivly so I may be a little rusty on my assumptions here :) 
 
 Thanks 
 
 Sheldon