New Sophos Support Phone Numbers in Effect July 1st, 2023

Weird UTM freezes randomly approximately once a day ...

I have experienced a strange lockup on my "new" UTM box, but I checked log files and they don't reveal anything, just a bunch of weird characters ...

2023:03:16-01:32:01 escape75 /usr/sbin/cron[25494]: (root) CMD (  nice -n19 /usr/local/bin/gen_inline_reporting_data.plx)
2023:03:16-01:35:01 escape75 /usr/sbin/cron[25649]: (root) CMD (   /usr/local/bin/reporter/system-reporter.pl)
�����������������������������������������������������������������������������������������������������������
2023:03:16-09:03:10 escape75 syslog-ng[4942]: syslog-ng starting up; version='3.4.7' 2023:03:16-09:03:12 escape75 ddclient[5361]: WARNING: cannot connect to checkip.dyndns.org:80 socket: IO::Socket::INET: Bad hostname 'checkip.dyndns.org' 2023:03:16-09:03:24 escape75 system: System was restarted



So,- I've been running the software version of UTM (9.714) on my old unit (an XG115 r2) for a couple of years without any issues,
and recently I have migrated my saved config over to a new unit (XG115 r3) and a few hours after setting up the new unit (at night)

it froze up, and interfaces were not pingable (LAN) so I powered it down and rebooted. It's working again ...

Just wondering if there's something more I can look at to see what the issue was .. I have a hunch maybe it was DHCP related,
as my devices on the LAN were renewing the IP addresses and they were not in the table on the new unit, but it's a wild guess,
so if this doesn't happen again then maybe it's nothing to worry about.

I don't know if there would be an issue moving the config file (and license) from the old unit, but I wouldn't think so.

The new unit was installed the same way as the old unit, using the ssi-9.714-4.1.iso file and removing the /etc/asg with a software license,
and the old unit hasn't experienced any weird issues in years, and the ethernet ports and devices are setup in an identical way, nothing changed.

Just looking for thoughts and ideas ...

Stats from top:

top - 11:32:20 up 2:31, 1 user, load average: 0.09, 0.29, 0.25
Tasks: 163 total, 1 running, 160 sleeping, 0 stopped, 2 zombie
Cpu(s): 0.6%us, 0.5%sy, 0.0%ni, 98.5%id, 0.1%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 3898468k total, 3558768k used, 339700k free, 111124k buffers
Swap: 4194300k total, 112k used, 4194188k free, 1352808k cached

Zombies:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18256 0.0 0.0 0 0 ? Z 11:30 0:00 [aua.bin] <defunct>
root 18595 0.6 0.0 0 0 ? Z 11:32 0:00 [confd.plx] <defunct>



edit title
[edited by: Escape75 at 4:26 AM (GMT -7) on 20 Mar 2023]
Parents Reply
  • Just another update, I was able to find some weird entries for the last 3 hours before the crash in syslog ...

    192.168.5.1 Mar 22 12:46:18 daemon warning 2023:03:22-19:27:59 escape75 URID[7870] T=7870 ------ 2 - sxl2_internal_get_time: The clock was set back from 1679510202 to 1679509679\n

    192.168.5.1 Mar 22 13:22:48 daemon warning 2023:03:22-19:27:49 escape75 URID[7870] T=7870 ------ 2 - sxl2_internal_get_time: The clock was set back from 1679510204 to 1679509669\n

    That doesn't look very good ... that's a 30 seconds time jump back!

    This in turn causes DHCP to do this:

    192.168.5.1 Mar 22 13:04:32 daemon debug 2023:03:22-19:27:52 escape75 dhcpd reuse_lease: lease age -337 (secs) under 25% threshold, reply with unaltered, existing lease for 192.168.5.5

Children
  • One more update, the unit also freezes on SFOS hardware version, no access to ports or serial possible.

    I have contacted the seller and got a refund. I ended up paying some duties, he paid a little more for shipping,
    but in the end it's a wash and I'm happy that we were able to resolve this situation as best as we could.

    I will maybe play with it some more, it seems like it's starting to lose track of time (RTC clock issue)
    but I'm wondering if this could be a CMOS battery issue, even though it's not losing bios settings ...

    It certainly strange as it was pulled from a working environment, and now that I have excluded the SSD,
    as well as memory issues, there's not much that remains ...

    Thanks for everyone's help!

  • Doubt it's the battery.  Battery comes into play when device is off. When on, power supply provides power for everything. I wonder however if the ps isn't somehow flakey causing the issues.  That would make sense for all the instability.

  • Yes I would also think that’s the case …

    I don’t think it’s the power supply because I am using the same one I’m also using for my XG115 r2.

    It’s the original 12v 3a power supply, and I tried both plugs in the r3, I believe they are for failover.

    Unless somehow the r3 requires more power than what’s labeled on the unit …

  • Well, I opened it up and cleaned the board with alcohol, then coated with wd-40, now it's nice and shiny.
    Also replaced the CMOS battery which was reading 3.0v, so theoretically still good but not 3.3v.

    We'll see what happens ... if it still freezes up then I guess it's a hardware fault of some sort.

  • I assume the chassis is the heatsink?

    What's the wd40 for?

  • Yes, the underside of the chassis has a big heatsink mounted to it, and the CPU connects to it with a gummy thermal pad that can be re-installed after being taken off.

    I imagine the heatsink is big as it's meant to possibly be used for other models that have other chips that need cooling, as it's basically almost the size of the motherboard.

    The wd40 is meant as a corrosion inhibitor, but in my case I was using it to clean the board after the alcohol, as apparently it's safe to do so, as it's non-conductive.

    The unit was never opened before, or if it was, it was opened by Sophos as the security sticker was left intact, however I've noticed several places on the underside of the board that had weird residue, possibly thermal paste of some sort, so that's why I gave it a good cleaning.

    I guess I'll find out, if it stays up over the next few days I will be happy,- it's worth trying Slight smile

  • Just a quick update, it's been almost 24 hours and still up, too early to tell if it's good.

    I opened up my old XG115 r2 and tested the battery, which is older, and to my surprise it was 3.2v ...

    It's also interesting that the old unit has a 2450 vs a 2032, with nearly a double capacity!

  • Unfortunately it went down again, lasted about 24 hours ...

    Oh well, it's a hardware fault after all I guess so nothing I can do Slight smile

  • One more update ...

    I tried with another power supply (higher amperage) but the issue remains.

    However, I have been testing Windows 10 on the unit, as well as performed Intel CPU tests and it passed!
    I am also now running pfSense on the unit, and hopefully I will get some more answers as to what could be happening.

  • Feels good to have gone through this exercise? :)

    Run some prime95 torture tests on it for a few hours.. I'd be surprised if it doesn't lock up sooner than later.

    Has *sense been stable on it? I think their igb drivers are more picky than linux's.