Redundant VPN Tunnels via Different ISPs

Intro

My friends will tell you that I’m obsessed with redundancy, both in life and in I.T.

At home I have two main internet connections, via Altice Optimum (“cable”) and Verizon FiOS. They’re both relatively high bandwidth, and are connected to my two core routers that operate in an active/passive configuration. Basically this:

                                                                O------O
                                        +--------+             /        \
                                        |        |------------/          O
+------------------+--------------------|  Core  |           /          /
|  Optimum Router  |                    | Router |----------O          /
+------------------+\     ______________|   01   |           \        O
                     \   /              |        |------------O        \
                      \ /               +--------+           /          \
                       X          Keepalived |              /  Various   O
                      / \          Heartbeat |             O  Networks  /
                     /   \              +--------+          \          O
+------------------+/     \_____________|        |-----------\          \
|   FiOS Router    |                    |  Core  |            \          O
+------------------+--------------------| Router |-------------O        /
                                        |   02   |            /        /
                                        |        |-----------O        /
                                        +--------+            \      /
                                                               O----O		   

Hmmm.. I can’t tell if that thing on the right looks like a cloud or a turd. Probably the latter. I’ll skip the ASCII “art” next time.

But is that really enough? Ever since “hurricane” Sandy I’ve been worried about losing both FiOS and Optimum simultaneously. It’s never happened due to a coincidence of network failures on both providers, but it’s a different story if a tree takes out the lines.

Enter Sprint. Many years ago, I configured a Netgear 6100D from Sprint to act as an emergency failover (and backdoor) so some things would stay up and running in the event of a failure. But lately I started thinking about the scenario of a core router failure.

Now, I should point out that, aside from misconfiguration oopsies on my end, I’ve never had a complete failure of both core routers.

Nonetheless, wouldn’t it be better to have yet another router — sorta seperate from the other two — in case they go down for whatever reason? And wouldn’t it be yet better if that new router wasn’t reliant on the Optimum and FiOS lines? And wouldn’t it be even superer betterer if the new router also had two independent internet connections?

Yes.

This isn’t as costly as it sounds, btw. My routers are just commodity hardware (right now they all happen to be Dell T110 II chassis with a bunch of NICs giving 12 ports per router).

The Sprint connection costs ~$15/mo (after taxes and fees) for 1GB per month (more than enough for the veritable trickle of pings that run through it on a regular basis).

And it was cheap enough for me to add a second cell connection via T-Mobile’s network, because I have Google Fi (aka Project Fi) which provides free “data only” SIMs that operate on TMo. (Note that a full Fi phone will choose the best connection amongst TMo, Sprint, and Something Cellular.) The “data only” SIM shares its allowance with my regualr Fi user account, so the cost there is negligible. I did, however, purchase a Netgear LB1121 which is a very simple 4G LTE to Ethernet “adapter” (to call it a router would do disservice to actual routers).

Network Diagram or Whatever

To be fair, I think the ASCII diagram was better.

The one thing that might be perplexing about this diagram is the External Backup VPN01 machine in the lower-right.

Perhaps needless to say, the Sprint and TMo connections won’t have static IPs. To make matters worse, they’ll only have one IP each. I did prevously use dynamic DNS with the Sprint device, but the Netgear 6100D is a HUGE pile of shit.*

*The biggest embarassment for the 6100D is that it comes with a telnet interface exposed. Which you can’t turn off. Which has no password. Which lets you view AND EDIT the config files for the entire device. Oh, and did I mention that a config file includes the admin password? IN PLAIN TEXT? Disgusting.

Besides, dynamic DNS would still only afford me one non-redundant IP per connection, and cellular network IPs can change very frequently.

Hence I spooled up an Amazon EC2 instance and installed OpenVPN on it. The backup router at my house connects to it via two independent tunnels, such that if one internet connection/VPN tunnel goes down, traffic will still flow on the other one.

Network Interface Naming

It took me a shockingly long time to figure out that this was a good idea, but I change the udev rules on my systems to rename the network ports to something logical. Usually it’s the name of the network to which the port is connected. So, for example:

File: /etc/udev/rules.d/70-persistent-net.rules

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1f:29:5a:c5:d7", ATTR{type}=="1", KERNEL=="eth*", NAME="ethdev"

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1f:29:5a:c5:d6", ATTR{type}=="1", KERNEL=="eth*", NAME="ethgst"

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="90:e2:ba:69:bf:91", ATTR{type}=="1", KERNEL=="eth*", NAME="ethmgt"

That’s a snippet from one of my core routers. (Note that I’m using CentOS/RedHat; The location and format of that file may differ.) The interface names are bolded, and correlate this way:

ethdev = Development network
ethgst = Guest network
ethmgt = Management network

Of course, if you rename the interfaces here you’ll have to rename them anywhere else. grep -R eth0 /etc/* 2> /dev/null should find every existing use of eth0 if, for example, that were the name of the interface before the change. Particularly look at your network configuration scripts (/etc/sysconfig/network-scripts/ifcfg-* in my case) and your firewall rules which may or may not specify interface names.

Strictly speaking, it’s not necessary to start the interface name with “eth“, but I stick with that to distinguish, for example, hardline ethernet interfaces from VPN tunnel or WLAN interfaces.

And likewise I also name the VPN tunnels, usually based upon what’s on the opposite end of the tunnel. But in the case of this article, I named them based upon the ISP via which the traffic transits.

OpenVPN Server Configuration Files

I’m using OpenVPN 2.4.7. If you’re using a different version, the options presented here may differ. But this should be acceptable for many a version.

Per my poorly construed diagram above, I want to connect a router at my house (rtr-backup01) to an Amazon EC2 instance in “the cloud” (ext-backup-vpn01).

The EC2 host is a nano instance, incidentally. One CPU core, 1GB RAM, and 8GB disk space. That’s actually more than what’s required for this purpose, so don’t go overboard in a similar circumstance.

There will be two VPN tunnels connecting those two hosts, which will be redundant to each other. One tunnel will be connected via Sprint, and the other via T-Mobile.

Here’s the tmobile server config:

(I show the sprint configs all together down below so you can see the differences, though they’re broadly similar.)

port 1199
proto tcp
dev tuntmobile
ca ext-backup-vpn01/ca.crt
cert ext-backup-vpn01/ext-backup-vpn01.crt
key ext-backup-vpn01/ext-backup-vpn01.key
dh ext-backup-vpn01/dh2048.pem
server 10.208.3.0 255.255.255.0
push “route 172.31.41.125 255.255.255.255″
push “route 172.31.41.126 255.255.255.255″
push “route 10.71.246.0 255.255.255.0″
client-connect ext-backup-vpn01/ccd/client-connect-tmobile.bsh
client-disconnect ext-backup-vpn01/ccd/client-disconnect-tmobile.bsh
route-metric 10
client-config-dir ext-backup-vpn01/ccd
topology p2p
cipher AES-128-CBC
comp-lzo
tcp-nodelay
persist-key
#persist-tun
keepalive 5 30
status /var/log/openvpn/ext-backup-vpn01-tmobile.status
log /var/log/openvpn/ext-backup-vpn01-tmobile.log
verb 3
mute 20

The port, protocol, and dev fields are pretty standard and self explanatory.

Same goes for the ca, cert, key and dh fields. I won’t get into the generation of certificates (etc.) here, but there are plenty of good tutorials on the subject.

server must be different between the two tunnels, otherwise it’ll lead to confusion when trying to route traffic. This essentially defines the network that will be used within the VPN tunnel, between the server and client. (In this case there’s only ever going to be one client, but all clients would be allocated an address in this space.)

The push commands tell the clients which networks are accessible via the tunnel, on the server side. In this example, the two addresses beginning with 172.31.41 are the private network addresses of the EC2 instance, as assigned by Amazon. The network 10.71.246.0 is used by a different VPN instance, allowing me to connect to ext-backup-vpn01 from anywhere.

These are the two most important configuration items, at least as far as making these redundant tunnels function properly:

client-connect and client-disconnect specify shell scripts that are run when the client connects and then disconnects, respectively. In my case, the purpose of those scripts is to establish routes to the networks behind each client when they connect, and to tear down those routes when they disconnect. I’ll post the full code for those below.

route-metric is essentially ignored, as the two scripts mentioned above set the routes and their metrics. Usually this setting would be used to establish the metric for routes created by OpenVPN, e.g. with the route configuration option. I left it in the config as a reminder: The tmobile routes have a metric of 10 whereas the sprint routes have a metric of 20.

client-config-dir points to a directory that contains various configuration options specific to each client. I’ll also show that below.

topology p2p specifies that it’s a point-to-point configuration. (Not valid when using Windows.) Here’s a more robust discussion of that option.

cipher, comp-lzo, and persist-key are pretty standard options. See the OpenVPN reference manual for more info on these and all other options.

persist-tun may be essential for other use cases, as it causes the tunnel interface (i.e. tuntmobile) to remain even when there’s no connectivity between server and client. You may have some scripts or programs that rely on finding your tunnel’s interface, or it may be referenced elsewhere. For example, I’m not sure what would happen if you referenced a transient network interface in your iptables config. In my case, I want the tunnel interface to be torn down when the tunnel isn’t established.

Another important option: keepalive [interval] [timeout]. The interval parameter is the frequency at which the client “pings” the server to determine if the tunnel is still up. The timeout parameter is the amount of time without a successful ping that would elapse before OpenVPN decides the tunnel is actually down. Importantly, when it decides the tunnel is down, the client-disconnect script is run.

You may need to fine-tune keepalive to suit your needs, but remember that the timeout is the minimum amount of time that the primary tunnel will be down before its routes disappear, thereby allowing the secondary tunnel to take over traffic.

Due to the routing metric of the tmobile tunnel being lower (10) than that of the sprint tunnel (20), tmobile is the primary tunnel. So when that connection goes down, it will take at least 30 seconds (but probably no more than 40-ish) for sprint to take over.

status, log, verb, and mute all relate to logging (and status, natch), and can be set as desired.

Client [Dis]connect Scripts

Incidentally, these scripts don’t need to live in the client-config-dir (named ccd), but that’s where I felt like putting them.

Note that they do need to be readable and executable by the OpenVPN process. So if, for example, openvpn runs in the user:group context of openvpn:openvpn, then you’ll want to chown openvpn:openvpn * and chmod ug+rx * for your scripts (where * would only reference the applicable scripts).

Also, your OpenVPN process must have the ablity to create routes in the kernel routing table (though you can use tables other than the main/default table). It can be useful, when troubleshooting, to run the OpenVPN process as root:root. Once everything is working, you can manipulate the user/group context.

Here’s what I have in the script referenced by client-connect ext-backup-vpn01/ccd/client-connect-tmobile.bsh (one also exists for sprint, and is shown much farther down on this page):

#!/bin/bash
while read ROUTE; do
ip route add $ROUTE via $ifconfig_local metric 10 >> /var/log/openvpn/client-connect-tmobile.log 2>&1
done < /etc/openvpn/ext-backup-vpn01/ccd/client-connection-routes
exit 0

And here’s client-connect ext-backup-vpn01/ccd/client-disconnect-tmobile.bsh (one also exists for sprint:

#!/bin/bash
while read ROUTE; do
ip route del $ROUTE via $ifconfig_local metric 10 >> /var/log/openvpn/client-disconnect-tmobile.log 2>&1
done < /etc/openvpn/ext-backup-vpn01/ccd/client-connection-routes
exit 0

Both of those files reference the file /etc/openvpn/ext-backup-vpn01/ccd/client-connection-routes, which in my case contains:

10.201.0.0/16
10.253.0.0/16
10.1.1.0/24
10.1.2.0/24
192.168.0.0/21
192.168.10.0/24
10.250.0.0/16
10.101.0.0/16
10.121.0.0/16
192.168.90.0/24
192.168.81.0/24

Each of the networks above are accessible on the client end of the tunnels.

The scripts iterate through each line of /etc/openvpn/ext-backup-vpn01/ccd/client-connection-routes, calling ip route add or ip route del to either establish or remove the routes when the client-connect or client-disconnect scripts are called.

The only difference between the client-connect and client-disconnect scripts above is that one contains add and the other contains del.

The only difference between the tmobile version of the scripts shown above and the sprint versions is the metric. (And, as you can see, the name of the log file.. which is not required, but may help with debugging.)

The astute viewers amongst you will say “WTF? That could all be done with one script!”

Kinda.

Because I’m running two separate instances of OpenVPN servers, each one needs both a connect and disconnect script. (That’s 4 total.) Those scripts could then call a single script which would do all the route manipulation. I dunno, what I have is pretty functional, but yes, it could be a bit more streamlined.

Note that OpenVPN sets a whole bunch of environment variables in the context of each script when calling it. See the OpenVPN reference manual for a full list. (The document doesn’t appear to have anchor tags, but search the page for “bytes_received”. That’s the first variable in the list.)

So you could have all sorts of caveats (if/then) and other functionality within those scripts. If you had multiple clients connecting to the same server instance, those variables would tell you who that client is, and as such you could take different actions for different clients. It’s actually a pretty robust arrangement.

The only environment variable I’m using is $ifconfig_local, which is the IP address of the server on its end of the VPN tunnel. So, in the examples above, 10.208.3.0 255.255.255.0 is the VPN’s network (defined by the server option in the config file), and so 10.208.3.1 is the server’s IP. Thusly, $ifconfig_local is 10.208.3.1.

The last bit of the configs are the client config directory files.

Here’s the contents of ext-backup-vpn01/ccd/client-tmobile01.

BTW, that directory is defined in the main OpenVPN config file by the parameter client-config-dir, and the file name (client-tmobile01) is the X509 name of the client certificate (defined when you created the certificate).

ifconfig-push 10.208.3.100 10.208.3.1
iroute 10.201.0.0 255.255.0.0
iroute 10.253.0.0 255.255.0.0
iroute 10.1.1.0 255.255.255.0
iroute 10.1.2.0 255.255.255.0
iroute 192.168.0.0 255.255.248.0
iroute 192.168.10.0 255.255.255.0
iroute 10.250.0.0 255.255.0.0
iroute 10.250.0.0 255.255.0.0
iroute 10.101.0.0 255.255.0.0
iroute 10.121.0.0 255.255.0.0
iroute 192.168.90.0 255.255.255.0
iroute 192.168.81.0 255.255.255.0

There is something important to note here: iroute does NOT create routes in the kernel routing table. That’s what the scripts above do.

iroute tells OpenVPN itself that it is capable of transiting traffic to that network. Hence every single one of those iroute commands correlates to a network in the file /etc/openvpn/ext-backup-vpn01/ccd/client-connection-routes, above. The routes need to be enumerated in both places.

(I only thought of this just now, but to avoid maintaining two different lists the client-[dis]connect scripts could iterate through the client config file and create a route in the kernel routing table for each of the iroute lines.)

OpenVPN Client Config Files

Here’s the OpenVPN conf file for the tmobile client:

client
dev tuntmobile
proto tcp
port 1199
local 10.222.3.5
remote 50.60.70.80
route-metric 10
resolv-retry infinite
persist-key
dh client-tmobile01/dh2048.pem
ca client-tmobile01/ca.crt
cert client-tmobile01/client-tmobile01.crt
key client-tmobile01/client-tmobile01.key
topology p2p
up-delay
cipher AES-128-CBC
comp-lzo
verb 3
status /var/log/openvpn/client-tmobile01.status
log /var/log/openvpn/client-tmobile01.log

There’s nothing too crazy on the client side, but there are a few things to discuss:

local 10.222.3.5 is the address of the ethernet interface which connects to the T-Mobile cell modem / “router” (the Netgear LB1121).

I’ve changed remote to a nonsense address to protect the innocent, but it’s the public (Elastic) IP of my EC2 instance on which the tmobile OpenVPN server runs.

up-delay is probably best defined by the OpenVPN reference manual:

Delay TUN/TAP open and possible –up script execution until after TCP/UDP connection establishment with peer.In –proto udp mode, this option normally requires the use of –ping to allow connection initiation to be sensed in the absence of tunnel data, since UDP is a “connectionless” protocol.

On Windows, this option will delay the TAP-Win32 media state transitioning to “connected” until connection establishment, i.e. the receipt of the first authenticated packet from the peer.

Needless to say, the client configuration for the sprint connection is nearly identical, and is shown below.

The Sprint-Related Files

Just for completeness, here are the full readouts of the files on the sprint server.

I marked in bold each place where the files differ from the tmobile files.

port 1198
proto tcp
dev tunsprint
ca ext-backup-vpn01/ca.crt
cert ext-backup-vpn01/ext-backup-vpn01.crt
key ext-backup-vpn01/ext-backup-vpn01.key
dh ext-backup-vpn01/dh2048.pem
server 10.208.2.0 255.255.255.0
push “route 10.71.246.0 255.255.255.0″
push “route 172.31.41.125 255.255.255.255″
push “route 172.31.41.126 255.255.255.255″
client-connect ext-backup-vpn01/ccd/client-connect-sprint.bsh
client-disconnect ext-backup-vpn01/ccd/client-disconnect-sprint.bsh
route-metric 20
client-config-dir ext-backup-vpn01/ccd
topology p2p
cipher AES-128-CBC
comp-lzo
tcp-nodelay
persist-key
#persist-tun
keepalive 10 30
status /var/log/openvpn/ext-backup-vpn01-sprint.status
log /var/log/openvpn/ext-backup-vpn01-sprint.log
verb 3
mute 20

Note that I used the same server certification authority, certificate, and key file for both servers. It’s perhaps not best practice, but honestly what does it matter… if someone compromises one tunnel’s encryption, then they compromise both. But they’re redundant connections serving the same purpose, so the risk is minimal. You may, of course, use completely different certificates for both.

client-connect-sprint.bsh:

#!/bin/bash
while read ROUTE; do
ip route add $ROUTE via $ifconfig_local metric 20 >> /var/log/openvpn/client-connect-sprint.log 2>&1
done < /etc/openvpn/ext-backup-vpn01/ccd/client-connection-routes
exit 0

client-disconnect-sprint.bsh:

#!/bin/bash
while read ROUTE; do
ip route del $ROUTE via $ifconfig_local metric 20 >> /var/log/openvpn/client-disconnect-sprint.log 2>&1
done < /etc/openvpn/ext-backup-vpn01/ccd/client-connection-routes
exit 0

ext-backup-vpn01/ccd/client-sprint01:

ifconfig-push 10.208.2.100 10.208.2.1
iroute 10.201.0.0 255.255.0.0
iroute 10.253.0.0 255.255.0.0
iroute 10.1.1.0 255.255.255.0
iroute 10.1.2.0 255.255.255.0
iroute 192.168.0.0 255.255.248.0
iroute 192.168.10.0 255.255.255.0
iroute 10.250.0.0 255.255.0.0
iroute 10.250.0.0 255.255.0.0
iroute 10.101.0.0 255.255.0.0
iroute 10.121.0.0 255.255.0.0
iroute 192.168.90.0 255.255.255.0
iroute 192.168.81.0 255.255.255.0

Here’s the configuration file on the sprint client.

client
dev tunsprint
proto tcp
port 1198
local 10.222.2.5
remote 10.20.30.40
route-metric 20
resolv-retry infinite
persist-key
#persist-tun
dh client-sprint01/dh2048.pem
ca client-sprint01/ca.crt
cert client-sprint01/client-sprint01.crt
key client-sprint01/client-sprint01.key

cipher AES-128-CBC
topology p2p
up-delay
comp-lzo
verb 3
status /var/log/openvpn/client-sprint01.status
log /var/log/openvpn/client-sprint01.log

In Conclusion

With both tunnels providing routes to my home infrastructure via Amazon’s network and my EC2 instance, I have the ability to have unlimited static, public IPs for the Sprint and T-Mobile connections.

Using iptables’ DNAT manipulation, I can reverse NAT those public IPs to any internal IP addresses I desire.

Moreover, I have a separate VPN server running on the EC2 instance which will allow me to connect to it, and therefore my entire infrastructure, using an OpenVPN client on one of my laptops, tablets, or phones. That’s particularly useful when I’m traveling and my network goes dark. Up until now, if my cable and FiOS connections went down or my core routers went down, I’d have no visibility as to what happened. This was also true if I became subject to a DDOS attack.

Finally, by having the backup router and backup internet connections, I can route outgoing mail through them as a redundant path. That means that my Zabbix servers (for system monitoring) and other scripts can communicate issues to me even during a widespread outage.

Overkill?

Definitely.

Fun?

Definitely.

Though your mileage may vary ;)

Gratuitous Pics

Backup Router (Dell PowerEdge T110 II)

This is the backup router which maintains the VPN tunnels via Sprint and T-Mobile to Amazon’s network (and hence my EC2 instance). In addition to the connections for those two cellular ISPs, it also connects to my FiOS line for direct VPN access. The other connections are for various in-house networks.

Netgear / T-Mobile LB1121 WWAN to LAN Router

This is the Netgear LB1121 which provides connectivity to T-Mobile’s network. It’s not exactly feature rich, but it serves the purpose of providing an ethernet port routed to T-Mobile. It does have PoE, though, which is pretty awesome. Here I’m just using the internal antennae, and as you can see I get mediocre service in the basement. (I may put this upstairs eventually… hmmm.)

Netgear / Sprint 6100D WWAN to LAN Router

This is the Netgear 6100D, providing connectivity via Sprint’s network. Even though the software of this device is terrible, it’s pretty good hardware-wise. It even has PoE! (But only on the WAN port for some bizarre reason. That’s why it has 2 ethernet cables running to it; One is just for power.) There’s also a coax cable attached to it, connecting to a directional antenna in my attic! I did a whole video about that install, which you can check out if you’re bored. :)

About Scott

I'm a computer guy with a new house and a love of DIY projects. I like ranting, and long drives on your lawn. I don't post everything I do, but when I do, I post it here. Maybe.
Bookmark the permalink.

Leave a Reply