This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Enterprise Console and Relay, High Availibilty

Hi

We currently have a management server, with relay server for external clients. However we are building a second data centre and need to work out the best way managing sophos if the first site gets taken offline.

It looks like the only solution is to virtualize the servers and replicate them to the second site. The questions are:-

  1. Is this really the best way of managing this 
  2. Are we going to run into problems in the second site when we bring up the servers. As the IP addresses on the new site will be different are we going to have issues with communication between the clients relay and management server. 

Thanks

:50116


This thread was automatically locked due to age.
  • Hello CRIPete,

    I'm not aware of a (generally available) failover feature in the management server.

    Just some thoughts - I see several (might not be all) aspects which have to be considered:

    1. Where do the endpoints get their updates from (directly or indirectly)? Is a Secondary location defined managed by a SUM which updates directly from Sophos?
    2. Database - how is it kept up to date (policies, groups and group-membership at least)? You can't simply use a copy on the backup server.
    3. Network, naming, RMS considerations 

    As for 3. you can set up kind of an alternate server if it uses the same certificates. If you configure RMS to use just an alias FQDN clients (and relays) should be able to reconnect to the backup.

    What is the scenario you have in mind? A complete disaster at the primary site? Or just a temporary outage? IMO the most important part is updating. While you sooner or later need to re-establish management it's not high-priority. Then there's the question of switching back.

    Christian

    :50132
  • HI Christian

    In an ideal world we would want the same level of availability with Sophos that we have the rest of our services which is practically instant, depending on replicaytion. However I have spoken to support and this doesn't look to be feasible.

    It terms of the updates and moving the database and servers across we are fine.It seems to be the RMS considerations where we come unstuck. 

    :50138
  • Looking at the management, it looks like we could just remove the IP Address from the mrinit.conf file. This should give is the opportunity to change the ip address of the server without affecting client access. I assume they will revert to the FQDN instead.

    Has anyone tried this?

    :50142
  • Hello CRIPete,

    with the last server migration we started using just alias FQDNs in mrinit.conf to be more flexible in the future and didn't see any issues. Please note that there is an MRParentAddress and a ParentRouterAddress. in mrinit.conf. These are used by RMS to determine its Router-type by comparing the specified addresses with the respective local values. If there is a match in MRParentAddress it's server (i.e. the management server), else if there is a match in ParentRouterAddress it's message relay, otherwise (there is no match) it's Endpoint, thus you have to make sure that server and relays correctly recognize of the "addresses" as their own.

    Christian

    :50168
  • Hi Christian

    Thanks getting back, it sould like we can achieve what we need to do. Have you been in a situation yet where you have changed the IP and is still working.

    Thanks

    Pete

    :50280
  • Hello Pete,

    well, you have to take my word for it - about five years ago we (not really me, I was on paternal leave)  "lost" (don't ask me for details) the management server  Admins tried to restore it but in the end it didn't work out. I did have a backup of the really important stuff (viz. the certificate keys) and built a new server - as they were still trying to get the old one running with a different name and IP. We have no administrative rights over most endpoints (this is a university) so a reprotect (or even a reinit) is impossible. So we just made the old FQDN an alias for the new server and endpoints started to appear.

    As an aside, we kept the alias in place and even four years later two "new" endpoints with a policy pointing to the old server appeared - dunno where they had spent the years in between :smileyvery-happy:

    Christian

    :50334
  • I know this is a fairly old thread, but being that I am currently building something very similar and the lack of information on this topic is very limited - I thought I might make a few (a bit of an understatement, sorry!) comments here to help anyone else currently thinking of this.

    As of SEC 5.4, SQL availability groups work with the SEC. I am running a multi-node, multi-subnet cluster with an AG, and this is no issue. The installer is a little painful as it mandates a computer name AND instance name (technically you don't need an instance name with an AG if you have only 1 instance in use), but if you specify this it will work just fine. There is one exception however, since the SEC cannot easily be reconfigured to support multi-subnet failover in the database connector, and thus must have the DNS entry configured in WSFC to only register the active IP address. If you do this, your failover time is as low as your DNS TTL allows before the entry is updated. Having a synchronous copy of the three Sophos databases (SOPHOS54, SophosSecurity, SophosPatch52) means failover at a database layer is then possible.

    The SEC is painful to failover however, and you must NEVER run two management servers in the same environment connected to the same message router environment at the same time (multiple certificate paths, multiple EM paths, etc. - it creates a world of pain I have experienced thanks to "proactive" server engineers thinking that turning on disabled Sophos services was the right thing to do). The active management server also needs the registry key for "ParentAddress" made blank (this then assumes the EM role for the message router), whilst the DR server should be directed to the active management server via the ParentAddress key if you intend on monitoring the local agent and SUM on that server. Of course, the offline server needs certificate manager, management server, management host and patch services all disabled. You will also need to edit table_router on the offline server to remove the "EM" and "CM" items from the top of the file, so as to prevent that server trying to process directed messages as such itself (basically turning the offline EM into a message relay). The management components on the migrated SEC server will then need to be enabled, and the EM and CM components will automatically register in the local message router.

    All the above however assumes the DR SEC has been built correctly, and has the correct certificate and encryption keys in place. If you have done this wrong, you will find issues with one part or another. For example, the wrong CertAuthStore will result in endpoints being unable to communicate with the SEC. If you forget the managed application keys, you will trigger reinstalls of downstream relays on SUM servers and also have issues decoding passwords for update policies (this only has effect on new policy pushes to an endpoint, existing already deployed policies are unaffected - re-saving the password in policy will also correct this). So there are some oddities if you aren't careful, so it is best to use the backuprestore scripts from Sophos to make sure the right registry keys are taken. Finally, you need to ensure the correct IOR strings are being passed back and forward and that the "ParentAddress" entry at each level can be resolved to the upstream server (either via DNS, hostfile or IP). There is no asymmetrical capability here, the address if using DNS must resolve both on the server with the DNS IOR (e.g. the message relay must be able to resolve the address it passes downstream), in addition to the downstream relay being able to resolve that same address to the upstream server. The IPs can differ (as will occur in a DNS situation), but the name must match in both configurations.

    You can also add extra relays to the mrinit.conf (separated by commas) to allow failover between multiple relays for reliability (it will connect to the first one it can successfully). You can also consider, depending on your environment as to whether you want the SUM to determine the relay. Let me explain this; if you have an mrinit.conf placed in the "rms" subfolder in a CID that differs from the version or configuration installed on the endpoint, the update will trigger a reinstall of RMS (and thus restart the agent and re-establish communications upstream). DNS entries are resolved upon each update attempt by AutoUpdate or SUM; whereas RMS will only resolve DNS upon service start. HA therefore simply fails to be useful with RMS in the case of a DNS entry being updated (I typically use IP addresses in such a case, as the next SAU update can potentially trigger a failover).

    In any case, not the easiest thing to achieve and I really wish Sophos had an article or section of documentation on how to do all this more officially - but it is doable and I do use this in production environments currently with great success (full cutover of 20,000 endpoints+ in 30 minutes with 5 minutes outage). Is it HA? No, but components such as database can be close (or truly HA, if you have a spanned subnet cross-site). RMS is partial HA through multi-entries, SUM/SAU can be configured HA (via DNS, resolved on each update), SEC is Active/Passive only (service/configuration/file changes needed).

    The real question comes down to what the business agrees for RPO and RTO. You shouldn't lose data in the case of an outage with RMS due to the store-and-forward by default, and provided your database and backup routines are effective and done per best-practice - I have yet to see a case where HA is truly necessary. An outage of SEC doesn't stop SUM, nor does it result in the entire environment suddenly losing protection. Events will be delayed only for duration of outage, so nothing gets lost.