Tuesday, December 18, 2012

TTL in DNS part 3

When we last talked in part 2, we were discussing what TTL is not (it is not how long the name server takes to update) and what it is (it is how long a computer takes to update it's own, local, cache).

We ended with an object lesson where www.TheEmailAuthority.com has a 1 hour TTL on it's A record, and I change the A record and shut down the server at my old IP promptly at 3 PM.

User1 was just reading my site at 2:30, so when they go to load another page anytime after 3 and before their own, local, cached record expires at 3:30 they won't be able to load the page.

However, User2 was lucky enough to request my A record at 3:10, so they've got twenty full minutes of Email Authority goodness while User1 is still just reading an error screen.

We've now seen how two users each with their own cache and their own TTL respond to DNS changes.  This simple example only used one record type, and only one "hop" between the client and the server.  Let's take this to the next level and try a more complex, email related, change.

Domain1.com is changing their email server, and their online ordering system sends them emails every time someone places an order.  Because of this, they cannot afford to have any down time.  We'll presume that they're open 24/7/365 so late-night and weekend migrations are out of the question.  Take these three examples into consideration, presuming the following zone file:

domain1.com. 86400 IN NS ns1.domain1s-ISP.com. 
domain1.com. 86400 IN NS ns2.domain1s-ISP.com. 

domain1.com. 14400 IN MX mail.domain1.com. 

mail.domain1.com. 3600  IN A  10.11.12.13.

Example 1: Change the MX record

Domain1.com's email administrator sets up mail2.domain1.com and places it into the MX records.  He sends a test email from his gmail account, and it routes correctly.  However, the online ordering emails are not coming through to mail2.domain1.com.

This is because the online ordering server still has his old MX records cached, and is sending the order emails to the old server until it updates it's cache sometime in the next 4 hours.  It could be less than that depending on when the online ordering server last cached the records, but there is still an outage and the admin loses his job.

Moral of this story: MX record changes aren't good enough.

Example 2: Set the TTL to a low value and change the IP of the A record

The next admin understands TTL, and logs in to his DNS interface to change the TTL in the morning before the migration that evening.  He sets the A record's TTL to 1 minute in the morning, and then changes the A record's IP address to point to the new server

This way the cached MX records include the same A record, and the short TTL means that the connecting mail servers *should* update the IP every minute.

Unfortunately, the owner's wife uses an AOL address, and they cache records for their own pre-set value (24+ hours), so since the owner's wife's email wasn't coming through the new admin got fired too.  He at least got a severance package...


Example 3: Set the original server to relay mail to the new server

The final admin understands that things don't always go according to plan, and leaves the old server online for several days after the migration.  He sets the old server to act as a mail relay though, so it's not delivering locally to any of the users (and thereby interrupting mail flow).  He waits until no legitimate email is seen in his logs for a day or so and then takes the server offline.

This admin also could have used a mail relay service such as a hosted anti-spam service to act as the relay for extra points :-)


Well, that was a much longer look into TTL than anyone would ever want to know.  Hopefully you enjoyed it, and can now impress your friends (or put them to sleep) with your new-found TTL skills.



-TEA