This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Weird UTM freezes randomly approximately once a day ...

I have experienced a strange lockup on my "new" UTM box, but I checked log files and they don't reveal anything, just a bunch of weird characters ...

2023:03:16-01:32:01 escape75 /usr/sbin/cron[25494]: (root) CMD (  nice -n19 /usr/local/bin/gen_inline_reporting_data.plx)
2023:03:16-01:35:01 escape75 /usr/sbin/cron[25649]: (root) CMD (   /usr/local/bin/reporter/system-reporter.pl)
�����������������������������������������������������������������������������������������������������������
2023:03:16-09:03:10 escape75 syslog-ng[4942]: syslog-ng starting up; version='3.4.7' 2023:03:16-09:03:12 escape75 ddclient[5361]: WARNING: cannot connect to checkip.dyndns.org:80 socket: IO::Socket::INET: Bad hostname 'checkip.dyndns.org' 2023:03:16-09:03:24 escape75 system: System was restarted



So,- I've been running the software version of UTM (9.714) on my old unit (an XG115 r2) for a couple of years without any issues,
and recently I have migrated my saved config over to a new unit (XG115 r3) and a few hours after setting up the new unit (at night)

it froze up, and interfaces were not pingable (LAN) so I powered it down and rebooted. It's working again ...

Just wondering if there's something more I can look at to see what the issue was .. I have a hunch maybe it was DHCP related,
as my devices on the LAN were renewing the IP addresses and they were not in the table on the new unit, but it's a wild guess,
so if this doesn't happen again then maybe it's nothing to worry about.

I don't know if there would be an issue moving the config file (and license) from the old unit, but I wouldn't think so.

The new unit was installed the same way as the old unit, using the ssi-9.714-4.1.iso file and removing the /etc/asg with a software license,
and the old unit hasn't experienced any weird issues in years, and the ethernet ports and devices are setup in an identical way, nothing changed.

Just looking for thoughts and ideas ...

Stats from top:

top - 11:32:20 up 2:31, 1 user, load average: 0.09, 0.29, 0.25
Tasks: 163 total, 1 running, 160 sleeping, 0 stopped, 2 zombie
Cpu(s): 0.6%us, 0.5%sy, 0.0%ni, 98.5%id, 0.1%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 3898468k total, 3558768k used, 339700k free, 111124k buffers
Swap: 4194300k total, 112k used, 4194188k free, 1352808k cached

Zombies:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18256 0.0 0.0 0 0 ? Z 11:30 0:00 [aua.bin] <defunct>
root 18595 0.6 0.0 0 0 ? Z 11:32 0:00 [confd.plx] <defunct>



This thread was automatically locked due to age.
Parents
  • I think you've put in a good effort to try to get it going. What was the source of this device? If you can return it, I would. No point in wasting more time trouble shooting.

    I've bought used pc equip on ebay before. If it works great, if I can't get it going within an evening or 3, it goes back.

  • Yes, it was ebay in fact.

    I have purchased a previous XG115 r2 and it's still running great, but this one has issues from the start ...

    I was just trying to figure out if I'm running into some weird bug possible with one of my LAN devices causing a crash,
    I know it would be strange but I've seen strange things, and the seller claims it was running just fine before it was replaced.

    You know, when I see stuff like this it makes me wonder if it was another bad unit or:

     XG115 Rev 3 freezing sometimes on SFOS 18.5.2 MR-2-Build380 

  • T's gateway box doesn't offer true bridge mode. Instead, it's an almost 1:1 NAT of sorts (certain ports are blocked) with the public ip getting passed to one device on the lan side. There's 4 lan ports on the rear. The gateway still offers connectivity to the other 3 ethernet on a 192.168.1.x/24 subnet.

    The theory is T is better able to mine user data with spyware in the gateway than doing it further upstream at the central office or NOC layer. Not to mention there's a much small state table (8K or 16K entries).

    Finally, the box is drawing 10-15 watts for doing absolutely nothing.

    Here's a good write up about T's auth process.  I am using the "supplicant" method.

    github.com/.../opnatt

  • I see what you mean, it only supports IP Passthrough ...

    That reminds me of the time I had Shaw Cable and their modem didn't expose bridge mode,
    although it supported it technically, so one had to go into developer tools in a web browser
    and adjusting the code once logged in to the modem to re-enable the disabled functionality.

  • Both the Small and Large FFT tests have passes, each running 2.5 hours ...

    The stick of RAM that was in the 'faulty' unit is also running fine in my r2 unit, I think I can pretty much exclude CPU, Memory and SSD ...

    Next I guess I'll test the interfaces ... 

  • Does a known good stick work in the faulty unit? 

    Passes memtests and all?

  • I say yes... here's why.

    I swapped the sticks of ram between the units, and the r2 (old) unit has been working fine, if the stick was bad it wouldn't.

    The tests I've done with memtest86+ (note the plus) failed on the r3 unit, with BOTH sticks of ram, but the same test
    also crashes my desktop (i5-11400F) quite bad, actually reboots it and then locks it up, and that PC is 100% stable!
    So I say memtest86+ with multicore enabled (F2 option) from the Ultimate Boot CD is NOT stable for testing.

    Now, I cannot test my r2 unit with that same memtest86+ because it only has a VGA output and I don't have a VGA display,
    and the included serial out does not correctly mirror the output from memtest86+ and many other such utilities.

    Also note that the r3 unit, passes both memtest86 single core as well as multicore ram tests, with both sticks,
    so that leads me to believe we can disregard my previous message about having found a memory test error ...

    The r3 unit also passed all prime95 tests, and has been running Windows from USB flawlessly,
    it looks 100% stable to me, except when running UTM or SFOS, which again doesn't make sense.

    I will focus on ethernet ports next ...

    I would be surprised if you can pass this test on your hardware:
    https://www.ultimatebootcd.com/download.html
    Pick memory, memtest86+ and hit F2 to force multicore Slight smile
    I think that particular version is not stable with multicore.

  • Wow.,.. <mind blown>!!!

    I always get the memtest binary from the memtest+ website itself.

    Don't think i've used that ultimateboot cd recently.

  • Yeah I know!

    You have 2 seconds to hit that F2 otherwise it does a single core test ...

    I don't understand why pfSense and all other operating systems run just fine, except UTM/SFOS!

  • Next stop, iperf testing?

    Reload pf or opn, install iperf3, then connect a cable between the various interfaces and test away.

  • I finished testing with iperf on pfSense each interface between the XG and my PC,- no issues.

    Each port has been transferring at 944Mbps for exactly an hour each while being pinged,
    and first three ports did about 5ms loaded, while the fourth was at about 20ms loaded.

    Looks all normal, I believe the 4th port uses a shared interrupt so it's not as high priority.


    So the question remains, why does pfSense pass all tests, as well as other OS's but not UTM/SFOS ?!

    Could this relate to some specific BIOS settings or version, I don't have any other XG115 r3 units to compare ...

  • Why not test between the different ports themselves?  That would put more strain on the system and be more representative of firewall operation - 2 nics moving data between each other.

    Meaning connect port 1 to port 2, port 2 to port 3, port 3 to port 4, and then variations there of (1-3, 1-4, etc..)

    If it all still works, then im out of ideas. Something in sg/xg is causing it to fault.

Reply
  • Why not test between the different ports themselves?  That would put more strain on the system and be more representative of firewall operation - 2 nics moving data between each other.

    Meaning connect port 1 to port 2, port 2 to port 3, port 3 to port 4, and then variations there of (1-3, 1-4, etc..)

    If it all still works, then im out of ideas. Something in sg/xg is causing it to fault.

Children
  • You might be right, but then again quite often when it was freezing there was no traffic (or very little) on the LAN.
    And I would think pfSense should do the same, so it's looking like there's some sort of software type issue almost.

    In fact my next test was going to be remove pfSense and load UTM and just let it sit with no ports hooked up,
    and I suspect it might freeze as well, but I guess time will tell.

    In the meantime it would be nice to get my hands on another XG115 r3, but do I risk it Slight smile

  • That would be quite the snafu if pfsense/opnsense works fine on the hardware, but their own utm/xg craps out. There has to be something the sophos software is setting that's tripping the crash/panic.

    Curious to see results of your next test.

  • Yes I'll be testing it further and I'll let you know ...

    For now, I've changed some BIOS settings and re-installed UTM and I have a 21 hour uptime so far ...

    *I would think 'reset to defaults' would be the correct settings for that unit though*

  • What settings did you change?

    Sure, applying defaults is a good idea for baseline. That is, apply defaults, save, reboot, go back into bios, make changes as needed.

  • Yes, well defaults were not working so I figured I should change some and see ...

    I can't check right now as I don't want to reboot the box, but one was in regards to virtualization
    and there was some that had to do with UEFI support for network, and other devices at boot.

    UTM doesn't use UEFI, so I changed them, but if it does fix it, I will screenshot all the changes to compare.

  • After an uptime of over 2 days, I have reset the BIOS to defaults and I'm waiting on a crash.

    I will then change it one by one to figure out which option causes the issue, if any, hope I'm not losing it Slight smile

    Changed options are highlighted:

  • Update,- 20 hours and box is still up, I'm getting confused, but will continue testing to get longer uptime ...

    Now I'm beginning to wonder if the stick of ram is possibly bad after all, but if it is, then why does it work in the r2 unit.

    I clearly need to do more testing, possibly get a VGA display and perform a 24 hour memory test on both units Slight smile

  • If you got pc that accepts that ram, test it there. That will rule out any sg hardware gremlins.

  • Not really, it's a laptop type SODIMM stick, but once I confirm it works with whatever BIOS options,

    then I will reverse the sticks of RAM again and see if the issue comes back, it could be just that stick

    in that box, due to timings, not sure. I've seen weird things and this might be another one of those Slight smile

  • Very weird, after 1 day and 10 hours the unit is still up, now after putting the original RAM it works ...

    I wonder if something was just lose and re-inserting the RAM fixed it, I guess time will tell,-
    I never bothered to remove the RAM or the SSD as they had that factory security glue on them,
    connecting the sides of the RAM and SSD to the socket they were each plugged in to.

    Oh well, I'm continuing to test ...