This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

RE: SUM fail to connect on 8194

Hi QC,

 

I'd like to continue this discussion, first thing first, our Sophos gateway is the old firewall from Cisco and we want to migrated it to the new firewall from Palo Alto.

When we tried to migrated it to the new Firewall, all of thousands of clients could connect to port 80 or 8192 just fine, but cannot connect to port 8194.

As we dig deeper, the RMS server couldn't telnet localhost 8194 even though it was a local connection to the server itself.

From wireshark pcap, we saw that the RMS sent TCP RST ACK for all the TCP SYN attempts from all the clients.

We tried to open the case to Palo Alto support and they said there was no drop at the Firewall at all, we even remove any inspection for the traffic to Sophos server at the Firewall.

But if we revert back the migration back to the old Firewall, port 8194 will comes back online again and everything going back to normal.

I attached some of the logs from the Sophos at the normal condition and migration (failed) condition.

Please tell me if you need another log.

 

Thank you.

 

sophos log June 12th.zip



This thread was automatically locked due to age.
Parents
  • Hello Ridho Rizki Antoro,

    this issue is actually more than three months old - what did you do in the meantime, use the old firewall?

    To make sure I understand you correctly:

    • 10.126.47.5 is your management server
    • all of thousands of clients could connect to port 80 or 8192 just fine - from the logs I get it that most of your endpoints use a message relay (and I assume these serve as additional update managers as well) so the connections aren't from the endpoints (except the relays of course) to the management server but to the relays. There's also in both the "before" and "after" logs traffic to and from the management server. So which problems did you actually observe?
    • wireshark pcap - where did you run wireshark? On endpoints, or on relays as well?

    As said, there's traffic on port 8194 in the logs and it seems to be normal. You did include a netstat -an for the "normal" situation, it shows slightly less than 1300 connections to 8194 from several subnets (you don't have a corresponding one for "migrasi", do you?) Guess you have much more than 1300 endpoints.

    the RMS server
    any endpoint with RMS installed is an RMS server, I think you mean the management server. And couldn't is rather vague, do you mean the screen blanked, no reaction, but after some time you got the cmd-prompt? Or some error reported by telent, if so - which one?

    Thus I have no idea what the (failed) condition could be, could you please give some more details?

    Christian 

Reply
  • Hello Ridho Rizki Antoro,

    this issue is actually more than three months old - what did you do in the meantime, use the old firewall?

    To make sure I understand you correctly:

    • 10.126.47.5 is your management server
    • all of thousands of clients could connect to port 80 or 8192 just fine - from the logs I get it that most of your endpoints use a message relay (and I assume these serve as additional update managers as well) so the connections aren't from the endpoints (except the relays of course) to the management server but to the relays. There's also in both the "before" and "after" logs traffic to and from the management server. So which problems did you actually observe?
    • wireshark pcap - where did you run wireshark? On endpoints, or on relays as well?

    As said, there's traffic on port 8194 in the logs and it seems to be normal. You did include a netstat -an for the "normal" situation, it shows slightly less than 1300 connections to 8194 from several subnets (you don't have a corresponding one for "migrasi", do you?) Guess you have much more than 1300 endpoints.

    the RMS server
    any endpoint with RMS installed is an RMS server, I think you mean the management server. And couldn't is rather vague, do you mean the screen blanked, no reaction, but after some time you got the cmd-prompt? Or some error reported by telent, if so - which one?

    Thus I have no idea what the (failed) condition could be, could you please give some more details?

    Christian 

Children
  • Hi Christian,

     

    Thank you for your time and in such a short time answering my question too.

    Sorry if I got mixed up with the Sophos terminology, I actually not a Sophos engineer, but I am a network engineer in charge with the migration process for this server.

    Yes this issue already happened since March, and we already tried to migrate it more than 10 times with varied methods and still failed.

    Every time we tried to migrate the servers and failed, we fallback it again to the old firewall.

     

    I talked with the Sophos user and he confirmed some of the things that hopefully gives you a better understanding with our user Server.

    Sophos server on our Data Center separated to 5 different server, the management server is indeed 10.126.47.5, all SUM and Push Policy via port 8194 is being managed by this server. For the other server, we've got Database on 10.126.47.11, Disk Encryption on 10.126.47.9, and I believed for IP 10.126.47.10 and 10.126.47.8, the role is to backup the management server.

    I also got confirmation that we have a lot of Message Relay that deployed on each branch sites, so your assumption is correct.

     

    This is what I mean when I said we cannot telnet 8194 but another port is fine:

    But I believed our endpoints cannot telnet 8194 to Sophos because the Sophos itself cannot telnet localhost 8194.

    This is our main problem and we still cannot find why it happened after migration, but comes back to normal after we fallback.

    We already tried bypass the new Firewall and create a new gateway on the switch, but the problem persists, so it was just normal at the old Firewall.

    We also had Sophos server on DRC (just one physical server, not like Sophos DC that has 5 servers) that already migrated to new Firewall and works fine for the push policy features. But then again, there isn't any traffic at DRC, 

     

    I've captured wireshark from the endpoint, firewall, and also from 47.5 when at migration condition.

    But I only captured wireshark from the endpoint and 47.5 when at normal condition.

    This is what it looks like in normal condition:

    This is what it looks like after migration:

     

    I uploaded some of the pcap files too if you interested:

    https://1drv.ms/f/s!Aoqg_BYRzLylgfIhEZCegja3FfBzMA

     

    Our focus now is to really find what causing the Sophos server always resetting the TCP SYN from endpoints.

    I've suspected the culprit was because port exhaustion at port 8194, but still cannot sure why it works fine at the old Firewall.

    I really hope you could gives us some insights on what happened on the server, so we know how to migrate it to the new Firewall.

    If you need some specific details that didn't provided here, I could arranged another migration attempt and collect that kind of data to help the analysis further.

     

    Thank you very much.

  • Hello Ridho Rizki Antoro,

    port exhaustion
    a suspect. Not your classic port exhaustion where you run out of ephemeral ports though.

    RMS doesn't monitor whether an inbound connection (endpoint.12345→server.8194) still exists. It doesn't apply a timeout, it doesn't any unsolicited SENDs downstream. Thus if a connection is not orderly closed by the endpoint it stays idle in an ESTABLISHED state. Over time there are more an more of these idle connections and at a certain point SYNs on port 8194 (but not on others) are rejected.
    When an endpoint can't send an upstream message on an existing connection it drops the old one and tries to establish a new one. Depending on the location of the error (e.g. connection to access point lost) and the networking components and their configuration the other side (the management server) might not get notified of the disruption. ... I'm the inquisitive type, while writing this I simply unplugged the network cable, plugging it in again RMS established a new connection but the old one is still ESTABLISHED on the management server.

    I've noticed this on my server some time ago. The number of connected endpoints gradually decreased, didn't see any new endpoint for days which is quite unusual. netstat showed gazillions of connections though, sometimes ten or more from the same endpoint. If you killed some of them new connections were established until eventually the saturation point has been reached again. I decided to regularly restart the Sophos Message Router service to get rid of the excess and dead connections.

    Should be fairly easy to verify on 47.5 if such excess connections are causing the RSTs. You should also observe that everything seems to work at first and the RSTs start only after some time. Does no harm to restart the Message Router service then and watch it repeat.

    Can't answer why it is different with the Cisco and the Palo Alto, I'm neither a firewall nor a network expert, probably no expert at all.
    Might be that you had always run into this problem rather sooner than later but the Cisco monitors the connections and takes them down if they are idle for too long (notably it has to send a FIN to both sides - something firewalls rarely do but the Cisco might do it).

    Christian

  • Hi Christian,

     

    Sorry I just read your reply, I'm confused because it's a new thread now.

     

    So you're saying that after we migrate the server to Palo Alto, we need to restart every RMS at Branch sites to clear the old connection on Management Server right?

    We already tried to restart the Management Server every time we migrated it to Palo Alto, but we never restart all of the RMS on Branch sites. This is because I just aware about RMS after talking to you too.

     

    One question, why do you say "Cisco monitors the connections and takes them down if they are idle for too long (notably it has to send a FIN to both sides - something firewalls rarely do but the Cisco might do it)." Because I didn't find any FIN from Firewall to both sides on the Wireshark, if you don't mind, can you point out why you made this statement?

     

    Thanks,

    Ridho

  • Hello Ridho,

    sorry, an automated mail should have informed you of the split

    why you made this statement
    Just a guess as it would explain the different behaviour. Forget it for the moment.

    Let's walk though it step by step (and please confirm, comment on each)

    • you change to Palo Alto naturally restarting the Management Server (SEC)
    • no obvious problems at first, endpoints connect, status message come in
    • after some time issues are observed, how and where? In the Console (e.g. new endpoints don't appear) or somehow on the endpoints?
    • at this point even a loopback to 8194 on (SEC) fails
    • again on SEC, checking connections to port 8194 reveals not only an excessive number but also more than one from the same remote address for many remote addresses
    • restarting the Message Router Service resolves this problem 
    • after some time issues are again seen

    You didn't tell what serious problem has been seen that apparently severaltimes led to the decision to revert to the Cisco?

    Christian

     

  • Hi Christian,

     

    Ahh okay then, I thought you get that from the server logs I gave you before.

     

    Let's walk through it step by step (and please confirm, comment on each)

    • you change to Palo Alto naturally restarting the Management Server (SEC)
      • Yes that's correct.
    • no obvious problems at first, endpoints connect, status message come in
      • After we migrated it to Palo Alto (or even bypass all the Firewall), Management Server (47.5) couldn't telnet localhost 8194 just a split second after. So the problem is immediately happened.
    • after some time issues are observed, how and where? In the Console (e.g. new endpoints don't appear) or somehow on the endpoints?
      • I believed from the monitoring status, all the endpoints status are OK, but all of them cannot updating the latest policy from Management Server. (I need to confirmed to my user though if this is correct)
    • at this point even a loopback to 8194 on (SEC) fails
      • Yes.
    • again on SEC, checking connections to port 8194 reveals not only an excessive number but also more than one from the same remote address for many remote addresses
      • Yes, but all of them was getting TCP-RST by the SEC. because at this point it already at the port exhaustion condition (I guess).
    • restarting the Message Router Service resolves this problem 
      • What is Message Router Service here? We did try to restart the Management Server (SEC) again and again, but it never solved the problem even for the slightest. One thing that resolved this problem is when we fallback to Cisco FWSM again, all the 8194 connections was restored immediately.

     

    From what I understanding right now, the main issue here is, we never touched RMS at each branches and clearing up the session there. My best guess right now is, All the RMS didn't really create a new TCP connection and just spuriously retransmit an old TCP connection to SEC after the migration happened. On the other hand, SEC server is expecting to get a new TCP connection from the outside, because it has a new gateway now. This condition, combined with TCP spurious retransmission from all the RMS, creating a port exhaustion condition on SEC server port 8194.

    The point I still didn't get is, why after we fallback the connection to Cisco FWSM Firewall, port 8194 on SEC is coming back to normal and the services running normally. Is it because FWSM still keeping the stateful table from the old connection even though I already clear all the connection there, or there was another reason for that, I am not certainly sure.

     

    I attached netstat before & after migration if it helps:

    netstat_outsophos_migrasi_palo.txtnetstat_sophos_fallback_cisco_fwsm.txt

     

    Ridho

  • Hello Ridho,

    first of all, the service responsible for the communication is the Sophos Message Router (RouterNT.exe). You don't have to restart the whole server if you want to reset the port 8194 connections.
    When this service is stopped it sends a FIN on its connections, so there shouldn't be any old TCP connections to SEC. Furthermore when the process is terminated any sockets created by it are destroyed. While an attempt by a client to use this connection likely results in a RST this is neither caused by port exhaustion nor contributes to it.

    To the netstat output:
    Both look pretty normal, the list of LISTENING ports (there are more than 400 ephemeral starting with 49258) is the same. There are less than 300 ESTABLISHED to port 8194 with the Cisco but only 90 with the Palo Alto - no indication of port exhaustion. Both logs show 0.0.0.0:8194 LISTENING. Seems the 300 or so connections are the normal state but for some reason behind the Palo Alto something gets more or less stuck (the RST should be the result of the Router rejecting instead of accepting the connection after getting notified of the SYN but it seems that it doesn't get out of the state which causes it to do so).
    As said, it'd be interesting what happens if at this point the Sophos Message Router service (and just this service) is restarted. There are (if I did count correctly: 90) connections to 8194 so it doesn't fail from the start. Maybe monitoring the traffic on 8194 from when the Router is (re-)started until it no longer accepts connection would give some insight.

    Christian     

  • Hi Christian,

     

    I will try to proposed to my user and scheduled the next attempt based on our discussion here.

    After that, I will definitely informed you about how things going from there.

     

    Thank you.

     

    Ridho

  • Hello,

    In our entreprise we have exactly the same problem of connection between palo alto and sophos on port 8194.

    I’m wondering if you succeeded to find a solution to this problem

    Thank you

    Regards