This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Connectivity issues while DSL reconnect in spite of uplink balancing

Our UTM 9 has 6 configured DSL (PPPoE) connections and so we're using uplink balancing. We configured the reconnect in pairs (DSL 1 & 4 at 3:00, DSL 2 & 5 at 4:00, and DSL 3 & 6 at 5:00). The problem is that we have short connectivity issues at these times.

My assumption: The reconnect process doesn't clear the route cache (with some routes over the DSL connection in question) and so some destinations are only reachable again when the DSL connection is back up.

The configuration option "binding timeout" got me thinking in this direction (net.ipv4.route.gc_timeout, http://vincent.bernat.im/en/blog/2011-ipv4-route-cache-linux.html).

Any hints and opinions are very welcome!



This thread was automatically locked due to age.
  • Hi, Florian, and welcome to the UTM Community!
    By "binding timeout," do you mean the 'Persistance timeout' in Uplink Balancing? That should not be related to the issue you're having. Then again, you didn't say what connectivity issue you have - what you're seeing...
    Cheers - Bob
     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hi Bob,

    Thanks for your answer! I'll try to clarify my statements. Switching the WebAdmin language from german to english doesn't work for me, so I'll try to translate it correctly.

    With "binding timeout" I mean the configuration option under "Interfaces" -> "Uplink Balancing" -> "Edit planner" (active interfaces). The setting is named "Bindung Zeitüberschreitung" (german), the form element has the ID "FORM_FORM_POPUP_active_ELEMENT_persistence_time" and the selected value is "1 hour".

    The concrete issue is that services behind the UTM cannot access at least some hosts outside in the internet at the exact times (3:00, 4:00, and 5:00) of the DSL reconnects. Some error messages from our ticketing system OTRS:

    Message: POP3: Can't connect to vwp0075.webpack.hosteurope.de
    Message: POP3S: Can't connect to pop.googlemail.com

    Yep, I know, quite generic. I also ran an endless ping to pop.googlemail.com every 5 seconds over night, at the three points in time, this also fails.

    Sure, I did use DNS names instead of IP addresses, but I don't think, that's the point.

    Thanks for your help!

    Florian

  • Ich lese gerne Deutsch, Florian, but I can't think creatively in German any more most of the time. [:(]  Now we're both on the same page! [:)]

    Does the ping fail for the entire hour until the connection is passed back to the preferred interface, or just for a minute after the reconnect starts?

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Hey Bob,

    the ping just fails for 5-15 seconds. The PPPoE log for this period tells me that it took ~10 seconds to reconnect two DSL connections.

    Because this two durations match, it could be possible that the cache routes using these gateways are not cleared? Or perhaps the route cache doesn't get cleared fast enough or asynchronously?

    Bye,

    Flo
  • OK, I think I see now...

    In Uplink Balancing and Uplink Monitoring, the standard time between pings is 15 seconds and the system will allow up to five seconds to receive a response. This means the failover time is from five to 20 seconds.

    If you want to change those values in WebAdmin, you have to define your own monitoring hosts. I use the two Google DNS servers when I want to play with those values.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Ok, I lowered the two values from 15/5 to 5/3 and restarted my ping script. I'll check the results after the weekend.

    You explained the source of the problem quite well, but I'm wondering why the UTM needs these check in case of a self-induced reconnect. When it performs this reconnect, it could also invalidate the cache instead of relying on this uplink monitoring.

    Don't you also smell a bug?
  • That does sound like a good idea for a Feature Suggestion - the reconnect could trigger a failover instead of having a five-to-20 second wait.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  • Thanks for your help, Bob!

    I did a feature request (feature.astaro.com/.../11700897-dsl-reconnect-invalidates-route-cache), but also the clarification helps me.

    Bye,

    Flo