We currently have a problem routenely between 06:20am/07:00am. What we have found out is that the appliance SG310 starts losing packets as we hit a higher load. This is somewhat understandable. We cant find the source of the problems. There is no clear indication that there are really a lot of packets going through firewall.
We can see very high software interrupts on both network cards that completely saturate the corresponding cpu cores. During the day we have about 90000 connections without any problems. Conntrack count (conntrackd -s) shows lower values that our peaks. Traffic on FW (iftop) and our switches do not show any problems.
Perf shows:
+ 25.46% ksoftirqd/0 [kernel.kallsyms] [k] nf_conntrack_tuple_taken
+ 23.68% kworker/1:1 [kernel.kallsyms] [k] nf_conntrack_tuple_taken
+ 10.17% ksoftirqd/1 [kernel.kallsyms] [k] nf_conntrack_tuple_taken
+ 9.69% conntrackd [kernel.kallsyms] [k] nf_conntrack_tuple_taken
+ 2.26% confd.plx [kernel.kallsyms] [k] nf_conntrack_tuple_taken
+ 2.14% swapper [kernel.kallsyms] [k] nf_conntrack_tuple_taken
+ 2.01% mdw.plx [kernel.kallsyms] [k] nf_conntrack_tuple_taken
+ 1.62% confd.plx libperl.so [.] Perl_hv_common
+ 1.31% ksoftirqd/0 [kernel.kallsyms] [k] hash_conntrack_raw
+ 1.15% kworker/1:1 [kernel.kallsyms] [k] hash_conntrack_raw
+ 0.69% ksoftirqd/0 [kernel.kallsyms] [k] nf_ct_tuple_equal
+ 0.67% kworker/1:1 [kernel.kallsyms] [k] nf_ct_tuple_equal
+ 0.62% mdw.plx libperl.so [.] Perl_hv_common
+ 0.57% ksoftirqd/0 [kernel.kallsyms] [k] nf_ct_invert_tuple
+ 0.56% ksoftirqd/1 [kernel.kallsyms] [k] hash_conntrack_raw
+ 0.51% kworker/1:1 [kernel.kallsyms] [k] nf_ct_invert_tuple
+ 0.50% postgres [kernel.kallsyms] [k] nf_conntrack_tuple_taken
We have disabled all reporting and reduced retention.
There is nothing strange in the logs.. mdw/sercice monitor go crazy because the appliance has general network problems then (accessing inet and/or internal network).
Paketfilter log does not show unregular traffic hitting the Appliance.
When we fail over the second unit immediately starts spiking with load (before no load) so this rules out any hardwaredefect.
It is either an internal cronjob that makes the load explode (its bevore the admbs maintainance!!) via some bug in conntrack handling or we have packets hitting the firewall (or empty connections.. dangling connections) but cant find the source.
It is like its completely hidden from us?
1. Any Ideas for debugging?
2. Where would we see if this a "new connection" issue
When we have the problem again is there a safe way to deactivate conntrack?
This thread was automatically locked due to age.