Thursday, June 3, 2010

Temporary and permanent errors

Typically, you'll only see two types of errors "Temporary" and "Permanent" errors. Standard codes can be found in RFC 2821 section 4.2.3, but there are many different servers that reply with alternate codes as well.

Temporary errors are expressed through 4.x.x error codes. These might be due to the recipient server being busy, the recipient server greylisting the sender, or closing the connection prematurely. The sending server can also generate 4.x.x errors if they can't resolve the recipient's DNS, the recipient's server is offline, the recipient server is connected to and responds but does not fully accept the email or any other issue that does not result in the recipient server accepting the email (with a 2.x.x status) or outright rejecting the message (with a 5.x.x message). Sometimes firewalls intercept the packets and mangle them or drop them, this can also cause a 4.x.x error since either the recipient's server does not receive the correct packets to accept the message and times out the connection eventually, or the sending server does not know that the recipient's server accepted the message and also times out the connection eventually.

While temporary errors occur due to transmission issues or issues that are expected to resolve themselves typically, Permanent errors mean that the server was connected to, delivered a response, and that that response was that it refused to accept the message... EVER.

Typical reasons for this are that the mailbox is full or nonexistent, the message was blocked as spam/virus, the mail server has vital errors or a policy violation such as non-local delivery for the domain, too many hops per message, or others. This error means that the server is online, listening for connections, accepted your connection but then actively refused to accept the message. It's the difference between sending someone you don't want to talk to to voicemail versus picking up the phone and telling them to stop calling. There is no confusion.

A sending server can also generate a permanent error as a result of timing out the message after repeated temporary errors, but this will typically be hours or days later, whereas a recipient server's rejection is almost instant usually. Obviously there will also be repeated 4.x.x errors in the logs to reference.

Understanding the difference between these errors is also important to determine where to troubleshoot first. A permanent error is almost always the recipient server, whereas a temporary error could be one of many things.

Senders side:

DNS issues (queries)
ISP/routing issues
Firewall issues (packet filtering)
Addressing issues

Recipient's side:

DNS issues (serving)
ISP/routing issues
Firewall issues
Greylisting
Rate Limiting
Hardware issues
NAT issues
and many more....


As always, the bounce itself will have a lot of good information, the "How to read a bounceback" post should be the first place to check if a NDR/DSN has been generated, but otherwise use of your troubleshooting tools in conjunction with the log data should provide plenty of information to determine the source of the issue.


TEA

Reading message headers

The routing of the email section is reasonably straightforward if you keep in mind that they are backwards.

1. Look for lines in the headers that say "Received: from".
2. Look for the last "Received: from" line and read upwards.

For instance, let’s assume we’re tracing an email with the following headers:

Received: from unknown (HELO BorderMTA.MyDomain.com) (9.10.11.12) by mail.MyDomain.com with SMTP; 13 Feb 2009 23:23:29 +0200
Received: from outbound.sendingdomain.com (outbound.sendingdomain.com [5.6.7.8]) by BorderMTA.MyDomain.com (8.14.1/8.14.1) with ESMTP id n1DLNSS7000915 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Fri, 13 Feb 2009 16:23:28 -0500
Received: from mail.sendingdomain.com (mail.sendingdomain.com [1.2.3.4]) by outbound.sendingdomain.com (8.14.1/8.14.1) with ESMTP id n1DLNQgX009122 for ; Fri, 13 Feb 2009 16:23:26 -0500
Received: from 10.0.0.100 ([10.0.0.100]) by mail.sendingdomain.com ([10.0.0.1]) with Microsoft Exchange Server HTTP-DAV ; Fri, 13 Feb 2009 21:23:21 +0000

When read from bottom to top says it went from 10.0.0.100 to mail.sendingdomain.com in the first line: “Received: from 10.0.0.100 ([10.0.0.100]) by mail.sendingdomain.com ([10.0.0.1])”

1. From mail.sendingdomain.com to outbound.sendingdomain.com in the second line: “Received: from mail.sendingdomain.com (mail.sendingdomain.com [1.2.3.4]) by outbound.sendingdomain.com “
2. From outbound.sendingdomain.com to BorderMTA.MyDomain.com in the third line: “Received: from outbound.sendingdomain.com (outbound.sendingdomain.com [5.6.7.8]) by BorderMTA.MyDomain.com”
3. And finally it's delivered from BorderMTA.MyDomain.com to mail.MyDomain.com in the last (topmost) line: “Received: from unknown (HELO BorderMTA.MyDomain.com) (9.10.11.12) by mail.MyDomain.com”

You can tell that the sender has an outbound email scanner (outbound.sendingdomain.com) after their mail server and that you have a border MTA before your main mail server.

Now that we know what we're reading, it's trivial to troubleshoot routing issues or delays that may not have been permanent errors or delayed enough to cause a DSN, but delayed nonetheless. Let's take the above headers and play with te timestamps a bit

Received: from unknown (HELO BorderMTA.MyDomain.com) (9.10.11.12) by mail.MyDomain.com with SMTP; 14 Feb 2009 00:53:47 +0200
Received: from outbound.sendingdomain.com (outbound.sendingdomain.com [5.6.7.8]) by BorderMTA.MyDomain.com (8.14.1/8.14.1) with ESMTP id n1DLNSS7000915 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Fri, 13 Feb 2009 18:53:13 -0500
Received: from mail.sendingdomain.com (mail.sendingdomain.com [1.2.3.4]) by outbound.sendingdomain.com (8.14.1/8.14.1) with ESMTP id n1DLNQgX009122 for ; Fri, 13 Feb 2009 16:23:26 -0500
Received: from 10.0.0.100 ([10.0.0.100]) by mail.sendingdomain.com ([10.0.0.1]) with Microsoft Exchange Server HTTP-DAV ; Fri, 13 Feb 2009 21:23:21 +0000

Can you find the delay? I included several time zone changes, so don't get duped by them ;-)

If you said it's from outbound.sendingdomain.com to BorderMTA.MyDomain.com you're correct!

It took about 2.5 hours for outbound.sendingdomain.com to successfully deliver to BorderMTA.MyDomain.com. Now that we've got the location of the delay we know where to start looking for the issue. Although the headers simply show where the delay occurred, not the actual cause of the delay, you can save a lot of time just quickly looking at them and isolating the possible trouble areas.

You might have noticed that each intermediary server is listed twice in the headers. When the server receives the message it says “by some.server.com” and then when it sends it on it says “Received: from some.server.com” in the next line. Often this is enough to tell where the delay occurred. If a server accepted the message and didn’t deliver it for many minutes, hours, days, etc. it frequently indicates that the recipient’s mail server was unavailable, deferring connections or “greylisting” the sending server. In that case, the logs you’ll want to read are on the recipient’s mail server (or possibly their firewall especially if it’s a connection issue or if it has a MTA built in). If this server does not have pertinent log files, then contact the sending server’s administrator to see if their logs show any additional information.

Sometimes the sending server may have issues sending an email that are outside of the recipient mail server’s control. A few examples would be retrieving the DNS information of the domain or recipient server, issues with connectivity or specific routing due to ISP issues, or any number of other issues. In these situations you’ll have to contact the sending server to determine the issue. Because the message never left the sending server, there will not be any logs on the recipient mail server. This is often the best place to start when troubleshooting home built or custom email applications such as bulk mailing software/vendors, website forms etc.

In general, if there's a delay you can start looking at the recipient server logs first. These will be the most verbose if in fact the issue is there. If the recipient server is a "border MTA" or another server that is responsible for all of your incoming email, and you're not experiencing issues across the board, you'll probably want to check the sending server first. If there was an issue on the border MTA it's likely (unless it's a rate limit/spam filter/etc issue) that the issue is on the sender's side. Either way, the issue must be on one of these two servers, so there's no reason to troubleshoot elsewhere unless you see additional delays in the headers (which is unlikely).

In summary, The headers go in order from bottom to top. They should all have timestamps, although they may be in different time zones, or the server clocks could be off. If you have a delay reported to you, isolate the servers that were responible for the transmission at that moment, determine what services are running on the servers (anti spam, rate limits, etc) and which is more likely to relate to the issue at hand (if the issue is all of your email, it might be your border MTA, if it's only this sender then it's likely on their end) then read the logs on this server. Once you have the error in hand from the logs, Google it. You're not the first admin to run into this issue, and most often someone has posted it to a forum, etc and gotten an answer.



TEA