This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Weekly Complete Outage

Hoping someone may have experienced this and found a fix. We have a single customer that is having a regular problem. Doesn't seem to happen for any rhyme or reason. Basically all traffic ceases to process through the UTM. The Web interface is not accessible. Their VoIP phones and everything goes down. Their ISP's router shows up and is remotely accessible. Physical reboot of Sophos resolves the issue.

We've:

1. RMA'd the UTM
2. Updated to latest firmware

Sophos can't seem to figure out the issue. They think the new unit has a memory problem (potentially, they don't know and want us to go onsite to run a memtest). Kind of doubtful since it was just RMA'd with the same issue. Was hoping someone may have seen this before.



This thread was automatically locked due to age.
Parents
  • Similar story.  Might help.

    A bit different circumstances, but we had a similar issue with a new install for a client - also doing Internet and VoIP.  Failure anywhere from 5 to 45 days apart.  Random times; weekdays and weekends - daytime and nighttime.  We found a "symptom" in the Sophos logs pretty quickly.  We could see the upstream DNS fail and the UTM would drop off line, no matter the DNS source, and soon enough the WAN side wouldn't work.  Similar to your situation, a restart of the WAN side, or a reboot of the UTM, would bring it back up immediately.  We thought about a script to restart automatically, but instead taught a couple of people how to fix it quick. 

    This install is in a 20 story high rise building.  We had the ISP swap out their en-suite router.  In our case, we mostly build our own UTM boxes.  We went through 3 of them, including (intentionally) different hardware the third time, etc.  Nothing helped.  Sophos couldn't find a problem either.  We did lots of logging and discussing with Sophos and the ISP.

    We became convinced that it was something on the ISP side - perhaps other electronics in the building? There were too many subscribers for our client to be the only one suffering this problem.  And then, one day, we checked the logs and noticed that the client hadn't had the problem in quite some time.  After many months, it had magically gone away.  Part of the issue with this large, national ISP is that we were never sure that the higher level changes we wanted (we wanted all new circuits, etc.) were installed.  So, to this day we're not exactly sure what was done to fix it, but obviously something, and it wasn't on our side.

    Ultimately we re-installed the first UTM we built for this client.  It's been working fine ever since with no problems.

    I'd keep after your ISP.

Reply
  • Similar story.  Might help.

    A bit different circumstances, but we had a similar issue with a new install for a client - also doing Internet and VoIP.  Failure anywhere from 5 to 45 days apart.  Random times; weekdays and weekends - daytime and nighttime.  We found a "symptom" in the Sophos logs pretty quickly.  We could see the upstream DNS fail and the UTM would drop off line, no matter the DNS source, and soon enough the WAN side wouldn't work.  Similar to your situation, a restart of the WAN side, or a reboot of the UTM, would bring it back up immediately.  We thought about a script to restart automatically, but instead taught a couple of people how to fix it quick. 

    This install is in a 20 story high rise building.  We had the ISP swap out their en-suite router.  In our case, we mostly build our own UTM boxes.  We went through 3 of them, including (intentionally) different hardware the third time, etc.  Nothing helped.  Sophos couldn't find a problem either.  We did lots of logging and discussing with Sophos and the ISP.

    We became convinced that it was something on the ISP side - perhaps other electronics in the building? There were too many subscribers for our client to be the only one suffering this problem.  And then, one day, we checked the logs and noticed that the client hadn't had the problem in quite some time.  After many months, it had magically gone away.  Part of the issue with this large, national ISP is that we were never sure that the higher level changes we wanted (we wanted all new circuits, etc.) were installed.  So, to this day we're not exactly sure what was done to fix it, but obviously something, and it wasn't on our side.

    Ultimately we re-installed the first UTM we built for this client.  It's been working fine ever since with no problems.

    I'd keep after your ISP.

Children
No Data