This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Enterprise Console and Relay, High Availibilty

Hi

We currently have a management server, with relay server for external clients. However we are building a second data centre and need to work out the best way managing sophos if the first site gets taken offline.

It looks like the only solution is to virtualize the servers and replicate them to the second site. The questions are:-

  1. Is this really the best way of managing this 
  2. Are we going to run into problems in the second site when we bring up the servers. As the IP addresses on the new site will be different are we going to have issues with communication between the clients relay and management server. 

Thanks

:50116


This thread was automatically locked due to age.
Parents
  • I know this is a fairly old thread, but being that I am currently building something very similar and the lack of information on this topic is very limited - I thought I might make a few (a bit of an understatement, sorry!) comments here to help anyone else currently thinking of this.

    As of SEC 5.4, SQL availability groups work with the SEC. I am running a multi-node, multi-subnet cluster with an AG, and this is no issue. The installer is a little painful as it mandates a computer name AND instance name (technically you don't need an instance name with an AG if you have only 1 instance in use), but if you specify this it will work just fine. There is one exception however, since the SEC cannot easily be reconfigured to support multi-subnet failover in the database connector, and thus must have the DNS entry configured in WSFC to only register the active IP address. If you do this, your failover time is as low as your DNS TTL allows before the entry is updated. Having a synchronous copy of the three Sophos databases (SOPHOS54, SophosSecurity, SophosPatch52) means failover at a database layer is then possible.

    The SEC is painful to failover however, and you must NEVER run two management servers in the same environment connected to the same message router environment at the same time (multiple certificate paths, multiple EM paths, etc. - it creates a world of pain I have experienced thanks to "proactive" server engineers thinking that turning on disabled Sophos services was the right thing to do). The active management server also needs the registry key for "ParentAddress" made blank (this then assumes the EM role for the message router), whilst the DR server should be directed to the active management server via the ParentAddress key if you intend on monitoring the local agent and SUM on that server. Of course, the offline server needs certificate manager, management server, management host and patch services all disabled. You will also need to edit table_router on the offline server to remove the "EM" and "CM" items from the top of the file, so as to prevent that server trying to process directed messages as such itself (basically turning the offline EM into a message relay). The management components on the migrated SEC server will then need to be enabled, and the EM and CM components will automatically register in the local message router.

    All the above however assumes the DR SEC has been built correctly, and has the correct certificate and encryption keys in place. If you have done this wrong, you will find issues with one part or another. For example, the wrong CertAuthStore will result in endpoints being unable to communicate with the SEC. If you forget the managed application keys, you will trigger reinstalls of downstream relays on SUM servers and also have issues decoding passwords for update policies (this only has effect on new policy pushes to an endpoint, existing already deployed policies are unaffected - re-saving the password in policy will also correct this). So there are some oddities if you aren't careful, so it is best to use the backuprestore scripts from Sophos to make sure the right registry keys are taken. Finally, you need to ensure the correct IOR strings are being passed back and forward and that the "ParentAddress" entry at each level can be resolved to the upstream server (either via DNS, hostfile or IP). There is no asymmetrical capability here, the address if using DNS must resolve both on the server with the DNS IOR (e.g. the message relay must be able to resolve the address it passes downstream), in addition to the downstream relay being able to resolve that same address to the upstream server. The IPs can differ (as will occur in a DNS situation), but the name must match in both configurations.

    You can also add extra relays to the mrinit.conf (separated by commas) to allow failover between multiple relays for reliability (it will connect to the first one it can successfully). You can also consider, depending on your environment as to whether you want the SUM to determine the relay. Let me explain this; if you have an mrinit.conf placed in the "rms" subfolder in a CID that differs from the version or configuration installed on the endpoint, the update will trigger a reinstall of RMS (and thus restart the agent and re-establish communications upstream). DNS entries are resolved upon each update attempt by AutoUpdate or SUM; whereas RMS will only resolve DNS upon service start. HA therefore simply fails to be useful with RMS in the case of a DNS entry being updated (I typically use IP addresses in such a case, as the next SAU update can potentially trigger a failover).

    In any case, not the easiest thing to achieve and I really wish Sophos had an article or section of documentation on how to do all this more officially - but it is doable and I do use this in production environments currently with great success (full cutover of 20,000 endpoints+ in 30 minutes with 5 minutes outage). Is it HA? No, but components such as database can be close (or truly HA, if you have a spanned subnet cross-site). RMS is partial HA through multi-entries, SUM/SAU can be configured HA (via DNS, resolved on each update), SEC is Active/Passive only (service/configuration/file changes needed).

    The real question comes down to what the business agrees for RPO and RTO. You shouldn't lose data in the case of an outage with RMS due to the store-and-forward by default, and provided your database and backup routines are effective and done per best-practice - I have yet to see a case where HA is truly necessary. An outage of SEC doesn't stop SUM, nor does it result in the entire environment suddenly losing protection. Events will be delayed only for duration of outage, so nothing gets lost.

Reply
  • I know this is a fairly old thread, but being that I am currently building something very similar and the lack of information on this topic is very limited - I thought I might make a few (a bit of an understatement, sorry!) comments here to help anyone else currently thinking of this.

    As of SEC 5.4, SQL availability groups work with the SEC. I am running a multi-node, multi-subnet cluster with an AG, and this is no issue. The installer is a little painful as it mandates a computer name AND instance name (technically you don't need an instance name with an AG if you have only 1 instance in use), but if you specify this it will work just fine. There is one exception however, since the SEC cannot easily be reconfigured to support multi-subnet failover in the database connector, and thus must have the DNS entry configured in WSFC to only register the active IP address. If you do this, your failover time is as low as your DNS TTL allows before the entry is updated. Having a synchronous copy of the three Sophos databases (SOPHOS54, SophosSecurity, SophosPatch52) means failover at a database layer is then possible.

    The SEC is painful to failover however, and you must NEVER run two management servers in the same environment connected to the same message router environment at the same time (multiple certificate paths, multiple EM paths, etc. - it creates a world of pain I have experienced thanks to "proactive" server engineers thinking that turning on disabled Sophos services was the right thing to do). The active management server also needs the registry key for "ParentAddress" made blank (this then assumes the EM role for the message router), whilst the DR server should be directed to the active management server via the ParentAddress key if you intend on monitoring the local agent and SUM on that server. Of course, the offline server needs certificate manager, management server, management host and patch services all disabled. You will also need to edit table_router on the offline server to remove the "EM" and "CM" items from the top of the file, so as to prevent that server trying to process directed messages as such itself (basically turning the offline EM into a message relay). The management components on the migrated SEC server will then need to be enabled, and the EM and CM components will automatically register in the local message router.

    All the above however assumes the DR SEC has been built correctly, and has the correct certificate and encryption keys in place. If you have done this wrong, you will find issues with one part or another. For example, the wrong CertAuthStore will result in endpoints being unable to communicate with the SEC. If you forget the managed application keys, you will trigger reinstalls of downstream relays on SUM servers and also have issues decoding passwords for update policies (this only has effect on new policy pushes to an endpoint, existing already deployed policies are unaffected - re-saving the password in policy will also correct this). So there are some oddities if you aren't careful, so it is best to use the backuprestore scripts from Sophos to make sure the right registry keys are taken. Finally, you need to ensure the correct IOR strings are being passed back and forward and that the "ParentAddress" entry at each level can be resolved to the upstream server (either via DNS, hostfile or IP). There is no asymmetrical capability here, the address if using DNS must resolve both on the server with the DNS IOR (e.g. the message relay must be able to resolve the address it passes downstream), in addition to the downstream relay being able to resolve that same address to the upstream server. The IPs can differ (as will occur in a DNS situation), but the name must match in both configurations.

    You can also add extra relays to the mrinit.conf (separated by commas) to allow failover between multiple relays for reliability (it will connect to the first one it can successfully). You can also consider, depending on your environment as to whether you want the SUM to determine the relay. Let me explain this; if you have an mrinit.conf placed in the "rms" subfolder in a CID that differs from the version or configuration installed on the endpoint, the update will trigger a reinstall of RMS (and thus restart the agent and re-establish communications upstream). DNS entries are resolved upon each update attempt by AutoUpdate or SUM; whereas RMS will only resolve DNS upon service start. HA therefore simply fails to be useful with RMS in the case of a DNS entry being updated (I typically use IP addresses in such a case, as the next SAU update can potentially trigger a failover).

    In any case, not the easiest thing to achieve and I really wish Sophos had an article or section of documentation on how to do all this more officially - but it is doable and I do use this in production environments currently with great success (full cutover of 20,000 endpoints+ in 30 minutes with 5 minutes outage). Is it HA? No, but components such as database can be close (or truly HA, if you have a spanned subnet cross-site). RMS is partial HA through multi-entries, SUM/SAU can be configured HA (via DNS, resolved on each update), SEC is Active/Passive only (service/configuration/file changes needed).

    The real question comes down to what the business agrees for RPO and RTO. You shouldn't lose data in the case of an outage with RMS due to the store-and-forward by default, and provided your database and backup routines are effective and done per best-practice - I have yet to see a case where HA is truly necessary. An outage of SEC doesn't stop SUM, nor does it result in the entire environment suddenly losing protection. Events will be delayed only for duration of outage, so nothing gets lost.

Children
No Data