This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Site-to-Site VPN / Amazon VPC - Temporary Delays & Errors when adding configurations

Hello all,

 

We are using a Sophos UTM as the primary VPN device in a Transit VPC setup, with just a few (3-5) small VPCs. We've been using this for a while now and we've found that, once the various AWS VGW Site2Site VPN's are connected, the setup is completely stable and throughput is not an issue for us.

 

However, whenever we add or replace a AWS Site 2 Site VPN configuration (a spoke VPC) things don't work very well and the issue gets worse the more VGWs are added. After 2 VGWs, in the best case, the new VPC configuration will fail to connect for somewhere between 3-10 minutes. Usually, however, all AWS VPC connections need to be stopped and started, or the whole UTM needs to be rebooted before things will connect. Worse, the more AWS VGW's that are connected, the longer the UTM is inaccessible on reboot! When we have 5 VGWs connected and it takes 12-15 minutes after booting before the UTM is even ping-able. If you only have 1-2 VGWs, the UTM is accessible within 0-2 minutes.

Whenever we add or remove a VPC configuration, the system logs are always littered with messages about accessing an uninitialized value in the BGP configuration tooling. Below is a sample of what we see:

 


Use of uninitialized value in concatenation (.) or string at /</var/mdw/mdw.plx>ASG/dynamic_routing.pm line 706.
2020:04:22-01:47:18 netsec-test middleware[3724]:
2020:04:22-01:47:18 netsec-test middleware[3724]: 1. main::_warn:182() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: 2. ASG::dynamic_routing::reconfigure_bgp:706() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 3. ASG::dynamic_routing::setAll:370() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 4. core::Config::load:362() /</var/mdw/mdw.plx>core/Config.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 5. main::top-level:224() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: |=========================================================================
2020:04:22-01:47:18 netsec-test middleware[3724]: E Use of uninitialized value in concatenation (.) or string at /</var/mdw/mdw.plx>ASG/dynamic_routing.pm line 713.
2020:04:22-01:47:18 netsec-test middleware[3724]:
2020:04:22-01:47:18 netsec-test middleware[3724]: 1. main::_warn:182() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: 2. ASG::dynamic_routing::reconfigure_bgp:713() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 3. ASG::dynamic_routing::setAll:370() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 4. core::Config::load:362() /</var/mdw/mdw.plx>core/Config.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 5. main::top-level:224() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: |=========================================================================
2020:04:22-01:47:18 netsec-test middleware[3724]: E Use of uninitialized value in concatenation (.) or string at /</var/mdw/mdw.plx>ASG/dynamic_routing.pm line 733.
2020:04:22-01:47:18 netsec-test middleware[3724]:
2020:04:22-01:47:18 netsec-test middleware[3724]: 1. main::_warn:182() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: 2. ASG::dynamic_routing::reconfigure_bgp:733() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 3. ASG::dynamic_routing::setAll:370() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 4. core::Config::load:362() /</var/mdw/mdw.plx>core/Config.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 5. main::top-level:224() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: |=========================================================================
2020:04:22-01:47:18 netsec-test middleware[3724]: E Use of uninitialized value in concatenation (.) or string at /</var/mdw/mdw.plx>ASG/dynamic_routing.pm line 734.
2020:04:22-01:47:18 netsec-test middleware[3724]:
2020:04:22-01:47:18 netsec-test middleware[3724]: 1. main::_warn:182() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: 2. ASG::dynamic_routing::reconfigure_bgp:734() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 3. ASG::dynamic_routing::setAll:370() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 4. core::Config::load:362() /</var/mdw/mdw.plx>core/Config.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 5. main::top-level:224() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: |=========================================================================
2020:04:22-01:47:18 netsec-test middleware[3724]: E Use of uninitialized value in concatenation (.) or string at /</var/mdw/mdw.plx>ASG/dynamic_routing.pm line 750.
2020:04:22-01:47:18 netsec-test middleware[3724]:
2020:04:22-01:47:18 netsec-test middleware[3724]: 1. main::_warn:182() mdw.pl
2020:04:22-01:47:18 netsec-test middleware[3724]: 2. ASG::dynamic_routing::reconfigure_bgp:750() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 3. ASG::dynamic_routing::setAll:370() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 4. core::Config::load:362() /</var/mdw/mdw.plx>core/Config.pm
2020:04:22-01:47:18 netsec-test middleware[3724]: 5. main::top-level:224() mdw.pl
2020:04:22-01:47:19 netsec-test middleware[3724]: |=========================================================================
2020:04:22-01:47:19 netsec-test middleware[3724]: E Use of uninitialized value in concatenation (.) or string at /</var/mdw/mdw.plx>ASG/dynamic_routing.pm line 751.
2020:04:22-01:47:19 netsec-test middleware[3724]:
2020:04:22-01:47:19 netsec-test middleware[3724]: 1. main::_warn:182() mdw.pl
2020:04:22-01:47:19 netsec-test middleware[3724]: 2. ASG::dynamic_routing::reconfigure_bgp:751() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:19 netsec-test middleware[3724]: 3. ASG::dynamic_routing::setAll:370() /</var/mdw/mdw.plx>ASG/dynamic_routing.pm
2020:04:22-01:47:19 netsec-test middleware[3724]: 4. core::Config::load:362() /</var/mdw/mdw.plx>core/Config.pm
2020:04:22-01:47:19 netsec-test middleware[3724]: 5. main::top-level:224() mdw.pl

 
After the VPN connection + BGP is established, these message stop and everything works just fine. However, needing to restart the UTM on every change is pretty rough, and needing to be down for 10-20 minutes for each reboot isn't great.
 
Anybody else dealt with this? Any ideas on how to avoid it? 
 
 


This thread was automatically locked due to age.
Parents
  • /bump - anybody able to share their experience here?

  • Hi and welcome to the UTM Community!

    "Usually, however, all AWS VPC connections need to be stopped and started, or the whole UTM needs to be rebooted before things will connect again"

    I think you found the workaround, but, since you're in the USA, I think you should get Sophos Support involved.  If they know of a less-disruptive workaround, please let us know.  In any case, Support should let the developers know about this issue.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA
  •  - thanks for the reply! We'll reach out to Sophos Support and update this thread with the response.

     

    Update 2020-05-14: No response from Sophos yet.

     

    Update 2020-05-15: Received a response for Sophos support. They did not try to validate or confirm the issue. Instead, they just asked if we were using physical UTMs or AWS virtualized UTMs (not sure how that is relevant). We are using UTMs running on AWS instances and they have become focused on that. They seem to be completely unclear on the difference between the Sophos Marketplace AMI and Sophos provided CloudFormation templates, and we're getting nonsense questions as a result.

     

    Update 2020-05-17: Escalated to our reseller for a response.

     

    Update 2020-05-22: After providing Sophos support with 5 clear steps to reproduce the issue, they declined to investigate on their own. Instead, they asked that we reproduce the issue for them again, provide them with logs from the system, a tcpdump capture, and an explanation of the specific steps we took. We provided all of this, but also made it clear that they can just confirm this on their own - it just requires a test UTM and connecting a few VGWs to it.

     

    Update 2020-06-03: Sophos Support informed us they have escalated the issue to their "Global Escalation Specialists", who will contact us if they have any questions. Past 20 days at this point without Sophos support attempting to reproduce the issue.

     

    Update 2020-06-16: Sophos support claims they have reproduced the issue and now have a patch they would like us to install on our production systems. We setup a testing environment and system for them to test the patch against, not wanting to try unreleased patches on live systems. The patch does not work and we learn that they did not actually try to reproduce the issue on their side. They had only used our logs to come up with a potential fix.

      

    Update 2020-06-25: After going back and forth, Sophos developers finally tried to reproduce the issue. After doing this, they were able to identify the problem and provided us a patch that worked in the test environment. The patched UTM no longer has any delays or errors when using multiple VPC VGW connections. We just now need them to release the patch.

     

    Once the patch is released, this problem will be gone and I think the UTM will now be the easiest-to-use Transit VPC appliance out there.

    Unfortunately, it took over a month of exchanges and push back before Sophos development attempted to reproduce the issue. Within 5 days of development reproducing it on their own, they had a fix. I wish it didn't take an enormous amount of effort to get Sophos to look into problems.

Reply
  •  - thanks for the reply! We'll reach out to Sophos Support and update this thread with the response.

     

    Update 2020-05-14: No response from Sophos yet.

     

    Update 2020-05-15: Received a response for Sophos support. They did not try to validate or confirm the issue. Instead, they just asked if we were using physical UTMs or AWS virtualized UTMs (not sure how that is relevant). We are using UTMs running on AWS instances and they have become focused on that. They seem to be completely unclear on the difference between the Sophos Marketplace AMI and Sophos provided CloudFormation templates, and we're getting nonsense questions as a result.

     

    Update 2020-05-17: Escalated to our reseller for a response.

     

    Update 2020-05-22: After providing Sophos support with 5 clear steps to reproduce the issue, they declined to investigate on their own. Instead, they asked that we reproduce the issue for them again, provide them with logs from the system, a tcpdump capture, and an explanation of the specific steps we took. We provided all of this, but also made it clear that they can just confirm this on their own - it just requires a test UTM and connecting a few VGWs to it.

     

    Update 2020-06-03: Sophos Support informed us they have escalated the issue to their "Global Escalation Specialists", who will contact us if they have any questions. Past 20 days at this point without Sophos support attempting to reproduce the issue.

     

    Update 2020-06-16: Sophos support claims they have reproduced the issue and now have a patch they would like us to install on our production systems. We setup a testing environment and system for them to test the patch against, not wanting to try unreleased patches on live systems. The patch does not work and we learn that they did not actually try to reproduce the issue on their side. They had only used our logs to come up with a potential fix.

      

    Update 2020-06-25: After going back and forth, Sophos developers finally tried to reproduce the issue. After doing this, they were able to identify the problem and provided us a patch that worked in the test environment. The patched UTM no longer has any delays or errors when using multiple VPC VGW connections. We just now need them to release the patch.

     

    Once the patch is released, this problem will be gone and I think the UTM will now be the easiest-to-use Transit VPC appliance out there.

    Unfortunately, it took over a month of exchanges and push back before Sophos development attempted to reproduce the issue. Within 5 days of development reproducing it on their own, they had a fix. I wish it didn't take an enormous amount of effort to get Sophos to look into problems.

Children
  • If you're the reseller, ask for escalation.  If not, forward the last response from Support to your reseller and ask them to tickle Support.

    Cheers - Bob

     
    Sophos UTM Community Moderator
    Sophos Certified Architect - UTM
    Sophos Certified Engineer - XG
    Gold Solution Partner since 2005
    MediaSoft, Inc. USA