1 |
On 8/17/2011 2:43 PM, Alan McKinnon wrote: |
2 |
> |
3 |
> I'm just itching to type up the long list of horror stories I've |
4 |
> stored from people doing their own DNS thinking it was real easy. |
5 |
> |
6 |
> But there's this little thing called an NDA and it says I can't :-( |
7 |
|
8 |
heh, I think I can dredge one up for you that no one will care about |
9 |
these days. |
10 |
|
11 |
This was at a large ISP in '99 known for their free Internet. Bind 8 |
12 |
was fresh on the scene and somehow Network Engineering was in charge of |
13 |
DNS rather than Systems. My intern and I came up with a plan to have |
14 |
ns00.int as the internal master and make the rest of name servers slave |
15 |
off of it. All ns00 did was supply the production name servers with zones. |
16 |
|
17 |
ns00 --> ns01(vip) --> ns01-[01-03] |
18 |
\--> ns02(vip) --> ns02-[01-03] |
19 |
\-> ns03(vip) --> ns03-[01-03] |
20 |
|
21 |
Three virtual IPs and three name servers behind each vip. |
22 |
|
23 |
This way we could have tools deal with updating zones on ns00 on the |
24 |
internal network and not have to push to a number of name servers. This |
25 |
worked well for a few months and we generally forgot about it. Almost a |
26 |
month after a reorganization in the local datacenter DNS went down. Well |
27 |
not down down, but our zones weren't working. After a hectic hour of |
28 |
freaking out, troubleshooting random things, and bouncing from machine |
29 |
to machine by IP address because none of DNS worked we realized our |
30 |
mistake. The TTL of the zone itself was set to three weeks. In the move |
31 |
Bind had silently died on ns00 which we didn't monitor because it was |
32 |
inside the corp network. The slaves dutifully stayed up and working till |
33 |
they hit the TTL of the zones and demanded to speak to the master again. |
34 |
Restarting Bind on the prod servers did nothing other than remove the |
35 |
already expired cache. |
36 |
Once restarted Bind on ns00 (and made it part of the runlevel) the prod |
37 |
server checked in and all was well. |
38 |
|
39 |
The lessons: |
40 |
Monitor *all* of your DNS infrastructure |
41 |
DNS can break even with a large distributed system and it is never pretty. |
42 |
|
43 |
kashani |