Fix for: Keepalived router enters fault state on link down

TL;DR: This is the configuration option you want: dont_track_primary

At work and at home I have pairs of redundant “core” routers in an active-passive (or master-backup as you like) configuration. They consist of commodity hardware, a few 4-port gigabit NICs, and CentOS. All of these machines had been running flawlessly for anywhere from two to six years (as they were put into service or upgraded).

That is until yesterday when my primary router at home had an SSD failure which completely stopped it in its tracks. The backup router took over, and in less than a second traffic was being routed. All of my point-to-point VPNs reconnected within about 20 seconds. In other words, it worked exactly as it should.

Until I turned off power to the broken router. Then everything stopped.

I had made a minor change to my router pair a few months ago, and didn’t think anything of it. Instead of running VRRP traffic through the switch, I had dedicated a NIC port on each machine and connected them directly using a crossover cable. I had only tested by bringing the primary router down gracefully, and did not pull the plug.

When the plug was pulled on the broken router, the now-master saw the link go down on the VRRP port and keepalived went into the FAULT state. It gave up its VIPs and basically stopped keeping anything alive.

That behavior can make sense in certain scenarios. For example, if just the NIC port used for VRRP went down on the master router, I wouldn’t want the backup also taking the VIPs (and certain routes, etc.) If I had VRRP going through one switch and production traffic going through another, I wouldn’t want a failure on the less important switch to again cause VIP conflicts.

In my case, I find it much (much, much, much) more likely that the link having gone down will mean that one of the machines has died completely. In my experience power supplies and HDDs (or SSDs) are far more likely to fail than a NIC or NIC port. It’s not to say that the latter is impossible, but rather that I have to plan for the most likely worst-case scenario.

All that being said, there is one setting for your keepalived.conf to obviate this issue: dont_track_primary

That’s it. It doesn’t have options or qualifiers. From the man page:

# Ignore VRRP interface faults (default unset)
dont_track_primary

From the keepalived changelog:

VRRP : Chris Caputo added "dont_track_primary"
vrrp_instance keyword which tells keepalived to ignore VRRP
interface faults. Can be useful on setup where two routers
are connected directly to each other on the interface used
for VRRP. Without this feature the link down caused
by one router crashing would also inspire the other router to lose
(or not gain) MASTER state, since it was also tracking link status.

Perfect, right?

Here’s my keepalive configuration that’s been sanitized and edited for brevity:

global_defs {
   notification_email {
     me@mydomain.corn
   }
   notification_email_from rtr-core02@int.meagain.net
   smtp_server 10.80.1.41
   smtp_connect_timeout 30
   router_id RTR-CORE-A
}
vrrp_instance VI_0 {
    state BACKUP
    interface p4p1
    smtp_alert
    virtual_router_id 50
    priority 50
    advert_int 1
    dont_track_primary
    notify_master /etc/keepalived/promotemaster
    notify_backup /etc/keepalived/promotebackup
    authentication {
        auth_type PASS
        auth_pass sanitizedpassword
    }
    virtual_ipaddress {
        192.168.1.1/24 brd 192.168.1.255 dev p3p1 label p3p1:100
        192.168.1.2/24 brd 192.168.1.255 dev p3p1 label p3p1:101
        10.1.1.1/24 brd 10.1.1.255 dev p3p2 label p3p2:100
        10.1.1.2/24 brd 10.1.1.255 dev p3p2 label p3p2:101
        # Many VIPs omitted here for brevity
    }
    virtual_routes {
        158.209.0.99/32 via 78.123.265.1 dev p1p1 table main
        0.0.0.0/0 via 91.59.24.131 dev p1p2 table 50
        193.266.0.0/16 via 91.59.24.131 dev p1p2 table main
        # Many routes omitted here for brevity.  IPs are sanitized/randomized
    }
}

I’m hoping that I put enough keywords in this article so that you found it easily. The whole point of this post is to counter the drought of discussion on this topic.

About Scott

I'm a computer guy with a new house and a love of DIY projects. I like ranting, and long drives on your lawn. I don't post everything I do, but when I do, I post it here. Maybe.
Bookmark the permalink.

3 Comments

  1. “I’m hoping that I put enough keywords in this article so that you found it easily. The whole point of this post is to counter the drought of discussion on this topic.”

    Thanks for making Google a better place ;-)

  2. Thanks for this, solved a problem that was driving me mad :)

  3. Hi Scott,

    I am currently working on a redundant firewall setup with keepalived which seems to be similar to yours. Therefore I’d be thankful if you might answer me some questions:

    * If you add a new virtual_address or virtual_route, do you need to do a failover each time? If yes, does this work reliable for you? I had decided against doing the ip/routing configuration within keepalived for this reason.
    * How many virtual_addresses and virtual_routes do you manage with keepalived?
    * What does advert_int do?

    Best Regards

    Janno

Leave a Reply