Redundant email servers with soft-fail (450) vs. hard-fail (550)

I manage a fairly large number of incoming mail exchangers, which are numerous both to handle large message volumes as well as to provide redundancy.

In most cases, these mail servers are Postfix with MySQL providing virtual alias maps, transport maps, relay domains, and virtual alias domains. Unfortunately the Postfix+MySQL implementation isn’t always 100% great. On very rare occasions the Postfix instance may fail to communicate with the MySQL server, for any number of reasons.

From the perspective of the sender’s MX, this usually results in a 550 status code (often given as “Relay access denied”). This is a hard-fail, in that it tells the upstream MX that the recipient they’re trying to reach is permanently unavailable. The upstream MX then gives up on delivery and sends a bounce notification to the sender. This behavior is highly undesirable for the purposes of redundancy, because the message will never be retried at another one of my MX servers (which is probably working just fine)!

The solution is to tell Postfix to soft-fail in the case of any undeliverable mail. For example, I have the following settings in my main.cf:

access_map_reject_code = 450
maps_rbl_reject_code = 450
reject_code = 450
relay_domains_reject_code = 450
unknown_local_recipient_reject_code = 450
unknown_relay_recipient_reject_code = 450
unknown_virtual_alias_reject_code = 450
unknown_virtual_mailbox_reject_code = 450

A 4xx code is generally interpreted by mail exchangers as a temporary condition, encouraging them to retry delivery at a later time. Most mail exchangers are smart enough to then try other MX records in DNS for delivery.

However, there are downsides to a soft failure (4xx code).

First, you’ll see a whole lot of retries for messages that have truly invalid recipients. In fact, with a 99.999% MX uptime, the vast majority of retries will be for spam messages, mis-typed addresses, spammer bounces, etc. This will result in a lot of extra traffic and a lot of extra load on mail servers. Of course, it will also result in lots of deliveries of valid messages in the case of an outage on one MX!

Second, in the case of a mis-typed address (or other case where the mail is really not deliverable) it could take over a week for the sender to get a bounce back (undeliverable notification). In the mean time they would probably assume that their message went through, and that the recipient was just ignoring them!

So what’s the solution?

First, let’s look at what a properly-configured (read: non-spammer) mail exchanger will do when trying to deliver a message:

  1. Obtain from DNS the MX records for the destination domain.
  2. Attempt delivery to the MX with the lowest number in its priority field.
  3. If delivery doesn’t succeed (generally because the server could not be contacted or the server gave a soft failure) then:
    • Go back to step #2 using the next-highest priority number, or cycle back to the lowest number if there is no higher number.

That’s a simplification, but the idea is that delivery will be tried to each successive MX server in a loop until delivery succeeds or a hard-fail status is received.

The best solution that I’ve found is to set up the last MX server (the one with the highest priority number in DNS) to hard-fail with 5xx status codes, while the others are set to soft-fail with 4xx status codes.

Spammers will generally attempt delivery to any arbitrary MX server and will not obey the priority order in DNS. So there will be some unnecessary retries. But this will cut down on traffic from accidental delivery attempts (e.g. mis-typed addresses), as well as from bounce notifications from valid MX servers (e.g. if a spammer is using one of your users’ email addresses to send mail).

Valid senders with invalid recipients will receive a timely bounce back, though it may take a few minutes depending upon the timeouts in the upstream MX.

Most importantly, delivery of valid and desirable messages will be retried across MX servers until a functional one is found.

Actually, my setup is a bit different: I host incoming mail services for many domains, and want to keep traffic flowing equally across all of my mail exchangers. As such, each domain is assigned three of my MX servers in random order. That means that there is no MX server that will always have the highest priority number in DNS, and so all of those servers are set up to soft-fail.

I think you can see where I’m going: I created another fall-back server which is assigned to users’ domains as a fourth MX record with the highest priority number. This server is set up identically to all the other MX servers, except that it will reply with a 5xx status codes in cases where the other servers would give 4xx codes. For almost all valid (non-spam) email, this server will be a last resort.

I say “almost all” valid email because there is some MX software out there that will try DNS MX records in an arbitrary order. However this behavior is not that common, and the chances that delivery would be first attempted to the hard-fail server while that server was down is slim: It would effect a very low number of messages, and so is an acceptable risk.

If you’ve been working with mail servers and SMTP for any length of time, then you know that mail delivery is never as simple as it seems. There are trade-offs and “gotchas” galore when implementing and administering a mail infrastructure. To my mind, the #1 goal of any admin should be ensuring successful mail delivery. Every trade-off should be in furtherance of that result.

It’s like the old saying: Better that 10 guilty persons escape than one innocent suffer.

Better that 10 spam messages get delivered than one important message bounce.

About Scott

I'm a computer guy with a new house and a love of DIY projects. I like ranting, and long drives on your lawn. I don't post everything I do, but when I do, I post it here. Maybe.
Bookmark the permalink.

Leave a Reply