![]() |
![]() |
home | sitemap | abstract | introduction | chaos | thinking | checklist | migrating | recovery pushpull | cost | career | workshop | isconf | list_and_community | papers | references |
Disaster Recovery
The fewer unique bytes you have on any host's hard drive, the better -- always think about how you would be able to quickly (and with the least skilled person in your group) recreate that hard drive if it were to fail. The test we use when designing infrastructures is "Can I
grab a random machine that's never been backed up
and throw it out the tenth-floor window
Likewise, if the entire infrastructure, our enterprise cluster, were to fail, due to power outage or terrorist bomb (this was New York, 1997), then we should expect replacement of the whole infrastructure to be no more time-consuming than replacement of a conventionally-managed UNIX host. We originally started with two independent infrastructures -- developers, who we used as beta testers for infrastructure code; and traders, who were in a separate production floor infrastructure, in another building, on a different power grid and PBX switch. This gave us the unexpected side benefit of having two nearly duplicate infrastructures -- we were able to very successfully use the development infrastructure as the disaster-recovery site for the trading floor. In tests we were able to recover the entire production floor
-- including servers -- in under two hours. We did this by
co-opting our development infrastructure. This gave us full
recovery of applications, business data, and even the contents
of traders' home directories and their desktop color settings.
If you recall, in our model the DNS domain name was the name
of the enterprise cluster. You may also recall that we normally
used meaningful CNAMES for server hosts -- gold.mydom.com,
|
|
||||||||
© Copyright 1994-2007 Steve
Traugott,
Joel Huddleston,
Joyce Cao Traugott
In partnership with TerraLuna, LLC and CD International |