This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Weird UTM freezes randomly approximately once a day ...

I have experienced a strange lockup on my "new" UTM box, but I checked log files and they don't reveal anything, just a bunch of weird characters ...

2023:03:16-01:32:01 escape75 /usr/sbin/cron[25494]: (root) CMD (  nice -n19 /usr/local/bin/gen_inline_reporting_data.plx)
2023:03:16-01:35:01 escape75 /usr/sbin/cron[25649]: (root) CMD (   /usr/local/bin/reporter/system-reporter.pl)
�����������������������������������������������������������������������������������������������������������
2023:03:16-09:03:10 escape75 syslog-ng[4942]: syslog-ng starting up; version='3.4.7' 2023:03:16-09:03:12 escape75 ddclient[5361]: WARNING: cannot connect to checkip.dyndns.org:80 socket: IO::Socket::INET: Bad hostname 'checkip.dyndns.org' 2023:03:16-09:03:24 escape75 system: System was restarted



So,- I've been running the software version of UTM (9.714) on my old unit (an XG115 r2) for a couple of years without any issues,
and recently I have migrated my saved config over to a new unit (XG115 r3) and a few hours after setting up the new unit (at night)

it froze up, and interfaces were not pingable (LAN) so I powered it down and rebooted. It's working again ...

Just wondering if there's something more I can look at to see what the issue was .. I have a hunch maybe it was DHCP related,
as my devices on the LAN were renewing the IP addresses and they were not in the table on the new unit, but it's a wild guess,
so if this doesn't happen again then maybe it's nothing to worry about.

I don't know if there would be an issue moving the config file (and license) from the old unit, but I wouldn't think so.

The new unit was installed the same way as the old unit, using the ssi-9.714-4.1.iso file and removing the /etc/asg with a software license,
and the old unit hasn't experienced any weird issues in years, and the ethernet ports and devices are setup in an identical way, nothing changed.

Just looking for thoughts and ideas ...

Stats from top:

top - 11:32:20 up 2:31, 1 user, load average: 0.09, 0.29, 0.25
Tasks: 163 total, 1 running, 160 sleeping, 0 stopped, 2 zombie
Cpu(s): 0.6%us, 0.5%sy, 0.0%ni, 98.5%id, 0.1%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 3898468k total, 3558768k used, 339700k free, 111124k buffers
Swap: 4194300k total, 112k used, 4194188k free, 1352808k cached

Zombies:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18256 0.0 0.0 0 0 ? Z 11:30 0:00 [aua.bin] <defunct>
root 18595 0.6 0.0 0 0 ? Z 11:32 0:00 [confd.plx] <defunct>



This thread was automatically locked due to age.
Parents
  • I think you've put in a good effort to try to get it going. What was the source of this device? If you can return it, I would. No point in wasting more time trouble shooting.

    I've bought used pc equip on ebay before. If it works great, if I can't get it going within an evening or 3, it goes back.

  • Yes, it was ebay in fact.

    I have purchased a previous XG115 r2 and it's still running great, but this one has issues from the start ...

    I was just trying to figure out if I'm running into some weird bug possible with one of my LAN devices causing a crash,
    I know it would be strange but I've seen strange things, and the seller claims it was running just fine before it was replaced.

    You know, when I see stuff like this it makes me wonder if it was another bad unit or:

     XG115 Rev 3 freezing sometimes on SFOS 18.5.2 MR-2-Build380 

  • Assuming no issues after a few hours of prime95. Perhaps the nics are faulty?

    As this is a multiple interface device, you could just run one ethernet cable between 2 ports then set up iperf3 to generate traffic.

    The server and client will need to be bound to the separate interfaces using -B.  IIRC the format is -B {ip of interface).

    Lets say you assign 192.168.1.1 to interface 1, 192.168.1.2 to interface 2

    -B 192.168.1.1 will bind to interface 1, -B 192.168.1.2 will bind to interface 2.

    Run that for a period of time to see if it causes some sort of failure.

  • Some good points, I will keep working in it ...

    My uptime is now 2 Days 17 Hours.

    I would think that if the issue related to a faulty NIC, it would also show up in pfSense, hmm!

    I wish I had access to another XG115 r3, could it be different BIOS settings, or something ...

  • Uptime over 3 days and no issues with pfSense, but I did a memtest using ubcd539.iso and it froze!

    The only issue is that I did the multicore test which I've read can cause false positives, but this thing acually froze ...

    It was stuck on the above screen, only + was blinking but test time was not incrementing, I left it for a minutes just to see.

    So I'm thinking maybe it's bad RAM after all ?!?

    (Of course I wish I had another XG115 r3 to test the RAM using multiple CPU's to see if it also behaves this way)

  • Sort of makes sense.  If the bad memory is never access then there's no issue. I bet *sense has a smaller memory foot print than utm/xg.  Same in windows. Prime95 has a test which uses up most of the memory. 

    Why wouldn't you run the mem test with all cores going? It should work.

    Choose the red option to fill the ram during test.  Choose the blue option to stress the cpu/cache only.

    I use both when validating a new build.

    Bought some teamgroup ram a while back from amz.  Two kits, 2x16gb each, 3600mhz.  Running both together at 3200mhz no problem.  Running both at 3600mhz I'd get weird stability issues. Eventually tracked it down to one of the kits needing more than the xmp set 1.35V ddr voltage. 1.40V.

    Didn't care much for this and exchanged the bad (questionable) kit. IMO memory should run at rated xmp settings without having to bump up ram voltage (mem controller might need a boost, but that's a different matter).  With the new kit, all 4 slots populated, 3600mhz 1.35v ran without any further issues.

  • Gets even weirder, memtest86 actually passes, but memtest86+ freezes always at the same place, 2:38 minutes, hmm!

    I actually didn't realize I already run Mersenne Prime Test v28.x using blend which says lots of memory tested, and it passed.

  • I now swapped the memory sticks between my XG115 r2 and r3, and the memtest86+ stopped again at the same time, weird!

    I will run r2 with the stick from r3 and I'll see if it screws up ...

  • Try one stick in one slot then the other.

  • There's only one slot on the r2 and r3 ...

  • Isn't that a pickle!

    If it's not the memory module, then the memory slot or memory controller.  Neither of which you can replace.

Reply Children
  • Still, i'd be curious to the results of the prime95 test, the cpu torture and full ram. I bet one of those will crash the system.

  • I will set it up to run from a bootable windows usb Slight smile

  • Btw, I feel your pain in testing this.

    For the last 2 weeks been playing with *sense products.  I have att fiber which requires a special flavor of BS to work without their "required" gateway box.

    In a nutshell the service uses eapol on vlan0 to authentic and enable data flow.  This works great on utm (after installing wpa_supplicant), but *sense is freebsd based which has broken vlan0 implementations.  To get it working requires using something called netgraph to handle the vlan0 traffic at a kernel level.

    Anyway, there's multiple ways of implementing this, but some are claiming success without netgraph - this is welcome as fewer system resources are needed when transferring to internet at line speeds.

    Bottom line, I fully understand your frustration!

  • Yes, that also sounds like a painful situation, but at least you're sharing the pain with all other pfsense users Slight smile

    What's wrong with using their "required" box and somehow turning it into a bridged unit?

  • T's gateway box doesn't offer true bridge mode. Instead, it's an almost 1:1 NAT of sorts (certain ports are blocked) with the public ip getting passed to one device on the lan side. There's 4 lan ports on the rear. The gateway still offers connectivity to the other 3 ethernet on a 192.168.1.x/24 subnet.

    The theory is T is better able to mine user data with spyware in the gateway than doing it further upstream at the central office or NOC layer. Not to mention there's a much small state table (8K or 16K entries).

    Finally, the box is drawing 10-15 watts for doing absolutely nothing.

    Here's a good write up about T's auth process.  I am using the "supplicant" method.

    github.com/.../opnatt

  • I see what you mean, it only supports IP Passthrough ...

    That reminds me of the time I had Shaw Cable and their modem didn't expose bridge mode,
    although it supported it technically, so one had to go into developer tools in a web browser
    and adjusting the code once logged in to the modem to re-enable the disabled functionality.