Main Nav

Message from mark.duling@biola.edu

Hi Keith,

Well at the least I'd say OSPF or anycast isn't something we'd do soon, because we're still building out redundancy in our core infrastructure and such, and we're not even at our own planned ideal point with DNS as we'd planned it out a couple of years ago.  We do have redundant DNS of course (actually Infoblox).

I guess I was partly in amazement that BIND has an opaque forwarder logic and couldn't quite believe it, and also partly wondering if there were an easy way to improve the situation incrementally for the rare circumstance I mentioned.  I still can hardly believe BIND has an internal operating logic that is apparently entirely opaque unless you use a sniffer on the dns server.

But in the less than near future we're always open to using anything that would help us build a more resilient core, internet, and dns infrastructure, and I do want to learn more about what it would take for anycast.  But such comprehensive improvements would have to wait until some other things fall into place.

Mark


Comments




-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Mar 28, 2013 at 09:11:33PM -0700, Mark Duling wrote: > Well at the least I'd say OSPF or anycast isn't something we'd do soon, > because we're still building out redundancy in our core infrastructure and > such, and we're not even at our own planned ideal point with DNS as we'd > planned it out a couple of years ago. We do have redundant DNS of course > (actually Infoblox). Putting on the flame suit because this is going to go over like a lead balloon... The whole idea of secondary and tertiary servers is that you assign the closest or most robust server as primary to a specific group, with servers *at disparate geographic locations* as secondary and tertiary. When the primary becomes unavailable operating systems should go to their secondary and then tertiary servers. "Unavailable" may be a timeout due to the server being down, it might be a dropped linked or it may just be high latency and responses not making it back in time - all of which are signals to modern operating systems to move on down their list. Some operating systems go a step further by sending a query to ALL DNS servers they have available and using the response from the first reply. It makes for a little burst of chit-chat early on but it works - and they usually "latch on" to the fastest of the group until it becomes unavailable or their network connection is reset so that chatter really is just early on. Technologies like ANYCAST are great for passing out ONE IP instead of three or four but, really, I see it as adding unnecessary complexity to a relatively elegant system - if you want to have disparate servers, deploy disparate servers and pass them out in the order you want specific groups to use them. The management is a LOT simpler than deploying ANYCAST (but not nearly as "cool", so I get that). Unless you're Google, or offering DNS services to hundreds of thousands of users across the globe, in which case ANYCAST (or similar) deployment is simple compared to the other challenges you're facing. kmw -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAlFVtocACgkQsKMTOtQ3fKHOHwCZAafyV4s6i4EF4ITvdHVUX1ft +rwAn2+00UZSUQhTtSY46QmR0CjZFK+o =37os -----END PGP SIGNATURE----- ********** Participation and subscription information for this EDUCAUSE Constituent Group discussion list can be found at http://www.educause.edu/groups/.
Message from mark.duling@biola.edu

Hi Dennis,

I misspoke a bit on the trouble we had recently.  The link was up and BGP saw it as up.  The problem wasn't that data couldn't flow to the next hop, but that it couldn't flow to its destination after that.  You're right that in this circumstance that manual intervention is going to be necessary.  So I confused the matter by too much detail on the cause of the failure and distracting from my main concern about DNS resiliency.

But it clarifies the question about the DNS problem I saw in a multihomed arrangement like ours if it can only be caused by a rare network event that must have manual intervention to fix it.  I guess the real question comes down to should two externally isolated networks designed to protect the campus from failure of any single path use a single DNS server (or redundant pair in our case)?  I think the answer is no, and as I mentioned, a proper secondary on a different network was our planned second DNS improvement next step anyway.  The redundant pair we have now protects us from server hardware failure, but not path failure since they are a single logical entity.

So I guess if we add a separately addressable secondary dns member that uses the other ISP path that would also isolate our DNS setup from a single path failure through the standard DNS failover behavior in ip stacks as described by Kevin.

Whether distributing outgoing DNS requests to outside DNS servers via the router in combination with BIND 9's forwarder selection algorithm as I suggested would work for outbound requests, I think it would be self-defeating since it would do nothing for incoming DNS requests to our server in the case of the "circuit up but downstream path failure" I described.  Many requests coming to us depending on their path simply wouldn't arrive on the problem path.  Shutting down the interface also forced BGP to expire the path incoming.

So I think I've worked through the problem and with the help of the comments offerred here and figured out the long-term resolution unless I'm missing something.  To get the maximum resiliency from DNS in our multihomed environment, we just need to move forward with our planned DNS server expansion and that should do it.  Let me know if I've got anything wrong here or missed anything, but thanks to everyone for their help.

Mark


Message from mark.duling@biola.edu

Hi Kevin,

Your comments were very helpful.  Seems like it should be standard procedure for OS manufacturers to describe their DNS algorithms very precisely and when it changes instead of the guesswork that it is.  But yes, after thinking it through I do suspect the simplicity of the traditional DNS failover behavior in ip stacks won't be sufficient for a site like ours.

Mark


I'm pretty sure that in our case, campus anycast really isn't warranted.  We serve out two IPs for recursive DNS.  The first is behind a hardware load balancer, so any one of several servers can answer the query.  The load balancer does health checks, so malfunctioning servers get yanked from the pool within a few seconds.

The second IP runs as a VIP on a pair of servers.  If one server fails, we bring the VIP up on the other guy (manually right now... yech).  CARP or Pacemaker/Corosync would be the proper way to approach things; it just wasn't as high on the wish list as getting off our old secondary.  Better add it to the to-do list.

If something super-drastic happens, we have warm spares (VMs) ready to go for both IPs.

John




Message from mark.duling@biola.edu

One thing I'm not clear on in the DNS client behavior, is what is the behavior when a forwarder is unreachable?  In other words, if a local DNS primary server of a given client has no reachable forwarders, does the local DNS server return a lookup failure to the client so it can know to try another DNS server in its list, gracefully or not according to client logic as the case may be?  Or does the local server return a response after the forwarder times out that is indistinguishable from a "not-found" response.
********** Participation and subscription information for this EDUCAUSE Constituent Group discussion list can be found at http://www.educause.edu/groups/.

Hi Mark,

At least with BIND, there's two ways to do it.  "Fall-back" (don't know if that's the right term for it) forwarding, which I believe is the default, means the recursive server will try to answer the query itself if the forwarders aren't reachable.  Forward-only behavior means that the recursive server will only try its forwarders for an answer (forward only; option).  If they aren't reachable, the server should return a SERVFAIL answer, and the client will move to the next nameserver in its list.  If the server returned NXDOMAIN, that would be a bad thing indeed.

As always, you can test this out by setting up a test recursive server, then pointing it to a non-existent forwarder.

John


Message from mark.duling@biola.edu

The behavior that BIND uses with multiple forwarders is that it uses the one with the shortest round-trip time (not sure how often it reevaluates the selection).  It looks to me like there isn't a way to find out what forwarder BIND is actually using at a given time via an api or CLI command of some sort.  Can anyone else confirm whether that is true or not?

Our campus is multihomed and does load-sharing using BGP with AS-PATH prepending inbound, and policy-based routing outbound with ip tracking so policy routing outbound will switch to the other circuit when a circuit goes down.  I think others on the list do that as well.

This setup works very well, but rarely a problem will develop with an ISP where the next hop is pingable but the circuit can't carry data and this defeats ip tracking until we shut the interface down on the border router.  For example a fiber was trenched up somewhere in our metropolitan area recently and this happened.  Though it is a rare event and we know what to do when it happens, when it happens and during troubleshooting the problem dns can't do recursive queries and is effectively down for everyone if the circuit the dns server sits on is the one that went down.  A circuit down and dns server issues can be hard to distinguish.  So during the time it takes to get onsite and figure out what is going on internet access is effectively not available even though we're multihomed.

What would be wrong with having ACLs on the router that route dns queries for half of the forwarders list to one ISP path and the rest to the other ISP.  That way if the dns server can't do recursive queries and goes through the list of forwarders, it would find some forwarding servers that can resolve external addresses since some use a different path.

As I said, it is a rare event, but the thought occurred to me that this could be a way to isolate dns when it does.  Or is there a flaw in this logic using routing?  Has anyone else experienced this issue, and if so have you found a way to mitigate this issue?  Thanks
********** Participation and subscription information for this EDUCAUSE Constituent Group discussion list can be found at http://www.educause.edu/groups/.

Sounds like several different problems.  I assume moving to OSPF is not on the table.
Is using anycast for your DNS grid feasible?
We moved to anycasting the DNS addresses but were forced to map the old addresses to the new as crutches until the distributed server admins updated their statically addressed machines.

Keith Noah
University Information Technology Services
University of Wisconsin-Milwaukee
Network Operations Center
Cell:414-810-6789
Office:414-229-4972

From: "Mark Duling" <mark.duling@BIOLA.EDU>
To: NETMAN@LISTSERV.EDUCAUSE.EDU
Sent: Thursday, March 28, 2013 6:51:32 PM
Subject: [NETMAN] DNS, multiple forwarders, and multiple paths

The behavior that BIND uses with multiple forwarders is that it uses the one with the shortest round-trip time (not sure how often it reevaluates the selection).  It looks to me like there isn't a way to find out what forwarder BIND is actually using at a given time via an api or CLI command of some sort.  Can anyone else confirm whether that is true or not?

Our campus is multihomed and does load-sharing using BGP with AS-PATH prepending inbound, and policy-based routing outbound with ip tracking so policy routing outbound will switch to the other circuit when a circuit goes down.  I think others on the list do that as well.

This setup works very well, but rarely a problem will develop with an ISP where the next hop is pingable but the circuit can't carry data and this defeats ip tracking until we shut the interface down on the border router.  For example a fiber was trenched up somewhere in our metropolitan area recently and this happened.  Though it is a rare event and we know what to do when it happens, when it happens and during troubleshooting the problem dns can't do recursive queries and is effectively down for everyone if the circuit the dns server sits on is the one that went down.  A circuit down and dns server issues can be hard to distinguish.  So during the time it takes to get onsite and figure out what is going on internet access is effectively not available even though we're multihomed.

What would be wrong with having ACLs on the router that route dns queries for half of the forwarders list to one ISP path and the rest to the other ISP.  That way if the dns server can't do recursive queries and goes through the list of forwarders, it would find some forwarding servers that can resolve external addresses since some use a different path.

As I said, it is a rare event, but the thought occurred to me that this could be a way to isolate dns when it does.  Or is there a flaw in this logic using routing?  Has anyone else experienced this issue, and if so have you found a way to mitigate this issue?  Thanks
********** Participation and subscription information for this EDUCAUSE Constituent Group discussion list can be found at http://www.educause.edu/groups/.

********** Participation and subscription information for this EDUCAUSE Constituent Group discussion list can be found at http://www.educause.edu/groups/.

Message from mark.duling@biola.edu

Hi Keith,

Well at the least I'd say OSPF or anycast isn't something we'd do soon, because we're still building out redundancy in our core infrastructure and such, and we're not even at our own planned ideal point with DNS as we'd planned it out a couple of years ago.  We do have redundant DNS of course (actually Infoblox).

I guess I was partly in amazement that BIND has an opaque forwarder logic and couldn't quite believe it, and also partly wondering if there were an easy way to improve the situation incrementally for the rare circumstance I mentioned.  I still can hardly believe BIND has an internal operating logic that is apparently entirely opaque unless you use a sniffer on the dns server.

But in the less than near future we're always open to using anything that would help us build a more resilient core, internet, and dns infrastructure, and I do want to learn more about what it would take for anycast.  But such comprehensive improvements would have to wait until some other things fall into place.

Mark


Infoblox supports anycast however I've never used it so I haven't thought much about how to best integrate it.  Is it possible to host anycast networks just outside your internet routers,  on each of the ISP's point-to-point links?  If that anycast network becomes unavailable due to a routing failure, its network won't be redistributed so it's not a viable destination for clients.

You might have a discussion with Infoblox, they should be able to tell you if its feasible.

Best,
Matt


On 3/29/13 12:11 AM, Mark Duling wrote:
Hi Keith,

Well at the least I'd say OSPF or anycast isn't something we'd do soon, because we're still building out redundancy in our core infrastructure and such, and we're not even at our own planned ideal point with DNS as we'd planned it out a couple of years ago.  We do have redundant DNS of course (actually Infoblox).

I guess I was partly in amazement that BIND has an opaque forwarder logic and couldn't quite believe it, and also partly wondering if there were an easy way to improve the situation incrementally for the rare circumstance I mentioned.  I still can hardly believe BIND has an internal operating logic that is apparently entirely opaque unless you use a sniffer on the dns server.

But in the less than near future we're always open to using anything that would help us build a more resilient core, internet, and dns infrastructure, and I do want to learn more about what it would take for anycast.  But such comprehensive improvements would have to wait until some other things fall into place.

Mark





-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Mar 28, 2013 at 09:11:33PM -0700, Mark Duling wrote: > Well at the least I'd say OSPF or anycast isn't something we'd do soon, > because we're still building out redundancy in our core infrastructure and > such, and we're not even at our own planned ideal point with DNS as we'd > planned it out a couple of years ago. We do have redundant DNS of course > (actually Infoblox). Putting on the flame suit because this is going to go over like a lead balloon... The whole idea of secondary and tertiary servers is that you assign the closest or most robust server as primary to a specific group, with servers *at disparate geographic locations* as secondary and tertiary. When the primary becomes unavailable operating systems should go to their secondary and then tertiary servers. "Unavailable" may be a timeout due to the server being down, it might be a dropped linked or it may just be high latency and responses not making it back in time - all of which are signals to modern operating systems to move on down their list. Some operating systems go a step further by sending a query to ALL DNS servers they have available and using the response from the first reply. It makes for a little burst of chit-chat early on but it works - and they usually "latch on" to the fastest of the group until it becomes unavailable or their network connection is reset so that chatter really is just early on. Technologies like ANYCAST are great for passing out ONE IP instead of three or four but, really, I see it as adding unnecessary complexity to a relatively elegant system - if you want to have disparate servers, deploy disparate servers and pass them out in the order you want specific groups to use them. The management is a LOT simpler than deploying ANYCAST (but not nearly as "cool", so I get that). Unless you're Google, or offering DNS services to hundreds of thousands of users across the globe, in which case ANYCAST (or similar) deployment is simple compared to the other challenges you're facing. kmw -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iEYEARECAAYFAlFVtocACgkQsKMTOtQ3fKHOHwCZAafyV4s6i4EF4ITvdHVUX1ft +rwAn2+00UZSUQhTtSY46QmR0CjZFK+o =37os -----END PGP SIGNATURE----- ********** Participation and subscription information for this EDUCAUSE Constituent Group discussion list can be found at http://www.educause.edu/groups/.
Message from mark.duling@biola.edu

Hi Dennis,

I misspoke a bit on the trouble we had recently.  The link was up and BGP saw it as up.  The problem wasn't that data couldn't flow to the next hop, but that it couldn't flow to its destination after that.  You're right that in this circumstance that manual intervention is going to be necessary.  So I confused the matter by too much detail on the cause of the failure and distracting from my main concern about DNS resiliency.

But it clarifies the question about the DNS problem I saw in a multihomed arrangement like ours if it can only be caused by a rare network event that must have manual intervention to fix it.  I guess the real question comes down to should two externally isolated networks designed to protect the campus from failure of any single path use a single DNS server (or redundant pair in our case)?  I think the answer is no, and as I mentioned, a proper secondary on a different network was our planned second DNS improvement next step anyway.  The redundant pair we have now protects us from server hardware failure, but not path failure since they are a single logical entity.

So I guess if we add a separately addressable secondary dns member that uses the other ISP path that would also isolate our DNS setup from a single path failure through the standard DNS failover behavior in ip stacks as described by Kevin.

Whether distributing outgoing DNS requests to outside DNS servers via the router in combination with BIND 9's forwarder selection algorithm as I suggested would work for outbound requests, I think it would be self-defeating since it would do nothing for incoming DNS requests to our server in the case of the "circuit up but downstream path failure" I described.  Many requests coming to us depending on their path simply wouldn't arrive on the problem path.  Shutting down the interface also forced BGP to expire the path incoming.

So I think I've worked through the problem and with the help of the comments offerred here and figured out the long-term resolution unless I'm missing something.  To get the maximum resiliency from DNS in our multihomed environment, we just need to move forward with our planned DNS server expansion and that should do it.  Let me know if I've got anything wrong here or missed anything, but thanks to everyone for their help.

Mark


Message from mark.duling@biola.edu

Hi Kevin,

Your comments were very helpful.  Seems like it should be standard procedure for OS manufacturers to describe their DNS algorithms very precisely and when it changes instead of the guesswork that it is.  But yes, after thinking it through I do suspect the simplicity of the traditional DNS failover behavior in ip stacks won't be sufficient for a site like ours.

Mark


Message from mark.duling@biola.edu

Oops.  I typed the converse of what I meant to say.  I meant to say I suspect the traditional DNS failover mechanism in the IP stack (and registrar) WILL be sufficient for a site like ours with the addition of a proper secondary.


Message from jhealy@logn.net

I'm pretty sure that in our case, campus anycast really isn't warranted.  We serve out two IPs for recursive DNS.  The first is behind a hardware load balancer, so any one of several servers can answer the query.  The load balancer does health checks, so malfunctioning servers get yanked from the pool within a few seconds.

The second IP runs as a VIP on a pair of servers.  If one server fails, we bring the VIP up on the other guy (manually right now... yech).  CARP or Pacemaker/Corosync would be the proper way to approach things; it just wasn't as high on the wish list as getting off our old secondary.  Better add it to the to-do list.

If something super-drastic happens, we have warm spares (VMs) ready to go for both IPs.

John




Message from mark.duling@biola.edu

One thing I'm not clear on in the DNS client behavior, is what is the behavior when a forwarder is unreachable?  In other words, if a local DNS primary server of a given client has no reachable forwarders, does the local DNS server return a lookup failure to the client so it can know to try another DNS server in its list, gracefully or not according to client logic as the case may be?  Or does the local server return a response after the forwarder times out that is indistinguishable from a "not-found" response.
********** Participation and subscription information for this EDUCAUSE Constituent Group discussion list can be found at http://www.educause.edu/groups/.

Hi Mark,

At least with BIND, there's two ways to do it.  "Fall-back" (don't know if that's the right term for it) forwarding, which I believe is the default, means the recursive server will try to answer the query itself if the forwarders aren't reachable.  Forward-only behavior means that the recursive server will only try its forwarders for an answer (forward only; option).  If they aren't reachable, the server should return a SERVFAIL answer, and the client will move to the next nameserver in its list.  If the server returned NXDOMAIN, that would be a bad thing indeed.

As always, you can test this out by setting up a test recursive server, then pointing it to a non-existent forwarder.

John


If the first server that a client queries is down, then the client has to wait for that query to time out before it tries the next server on its list.  This is what makes anycast so attractive - the clients only need to know about one server address, and as long as any server is working the network will route traffic to it.  Clients can be configured with a second server as a last-resort - for example 8.8.8.8 is Google's public resolver - but unless all of the anycast servers is down, they should never need it.

We're in the process of bringing up anycast DNS on our campus; the plan is to have one resolver directly connected to each of our four core routers - *not* located in the data centers, but co-located with the routers themselves, which are in our old telephone switch locations.  

We use anycast for several other services on campus now - for example RADIUS - and it's saved our bacon more than once; I highly recommend that people take a good look at it, it's not that complex and it makes both client configuration and server maintenance much easier.

Recommend

Close
Close


Annual Conference
September 29–October 2
Register Now!

Events for all Levels and Interests

Whether you're looking for a conference to attend face-to-face to connect with peers, or for an online event for team professional development, see what's upcoming.

Close

Digital Badges
Member recognition effort
Earn yours >

Career Center


Leadership and Management Programs

EDUCAUSE Institute
Project Management

 

 

Jump Start Your Career Growth

Explore EDUCAUSE professional development opportunities that match your career aspirations and desired level of time investment through our interactive online guide.

 

Close
EDUCAUSE organizes its efforts around three IT Focus Areas

 

 

Join These Programs If Your Focus Is

Close

Get on the Higher Ed IT Map

Employees of EDUCAUSE member institutions and organizations are invited to create individual profiles.
 

 

Close

2014 Strategic Priorities

  • Building the Profession
  • IT as a Game Changer
  • Foundations


Learn More >

Uncommon Thinking for the Common Good™

EDUCAUSE is the foremost community of higher education IT leaders and professionals.