||Add To My Personal Library
August 31, 2007
Vol.29 Issue 35|
Page(s) 9 in print issue
Downtime, What’s That?
Tips For Improving Server Uptime
In April 2007, Co-op Atlantic (a wholesaler offering products and services to 135 co-ops and 237,000 co-op members) was offline for 18 hours—its database server failed, and then its backup server failed. The next month, the entire June issue of Business 2.0 magazine was accidentally deleted with no backup because of a failed backup server. The following month, customers of the largest Web and application hosting company in Sydney, Australia, WebCentral, went without email service for a week due to server failure.
How is this possible? According to Bruce Taylor, chief strategist at the Uptime Institute (www.uptimeinstitute.com), maintaining the integrity of server farms and data centers is now (or should be) a matter of corporate governance and risk management. And developing extensive and comprehensive business continuity and disaster recovery plans is essential to companies survival.
"Protecting systems is accomplished by eliminating single points of failure in servers, storage systems, and networks, says Taylor. Redundant hardware and networks, along with alternate sources of electrical power and network connections, make the data center resilient. IT operating procedures reinforce the data centers robustness.
Taylor suggests that IT managers responsible for the physical facilities infrastructure must begin to look at their data centers from a whole-system perspective and should firmly integrate IT infrastructure planning with site infrastructure subsystems (power and cooling, primarily systems) planning.
Master Your Code
"Push all of your infrastructure with your code," says Rick Bentley, CEO of Connexed Technologies (www.connexed.com). Connexed, like many other companies, is an SaaS [software as a service] provider. This means that we not only have to manage our own hosted infrastructure, we have to write and test our own code. Like many companies, we have multiple environments (development, QA, stage, and production). If we were to just push code from one environment to the next, we would be testing the code but not the infrastructure.
For example, continues Bentley, after you buy that big, new, and expensive storage array, you connect it to your development environment. Once everything is working, you connect it to your QA environment (with the next code build). Then you connect it to stage and finally to production. The first time you connect it to your development environment, you will probably make some mistakes.
Then, says Bentley, you might have to reboot the development boxes more than once (causing downtime each time). By the time youve deployed the hardware for the fourth time, when you push the hardware to production, things should go much more smoothly than they did the first time, minimizing downtime.
Consider An Open-Source OS
According to Matt Olander, CTO at iXsystems (www.ixsystems.com), a simple single method for increasing server uptime is to choose an OS that is robust, reliable, and stable. The open-source FreeBSD (www.freebsd.org) OS with its roots in BSD Unix has a history of solid development and is world-renowned for its performance and stability. Many large installation system administrators consider FreeBSD to be one of the Internets best-kept secrets, says Olander.
Organizations such as Yahoo!, Juniper Networks, Network Appliance, IronPort, Isilon, The Weather Channel, and NASA all rely on FreeBSD to deliver enterprise-class products and services, protect their networks, and serve millions of Web pages a day. FreeBSD increases uptime by focusing on development practices that produce well-tested code that is reliable, stable, and secure. FreeBSD is the perfect choice for the enterprise data center, as well as for small to medium businesses, continues Olander.
Agreed, says Bentley. We all have fixed budgets. Are you running Oracle on Solaris? How much would you save if you moved to PostgreSQL or My-SQL on Linux? If you have enough CPUs that youre paying license fees on, you could probably hire another full-time DBA to improve your architecture so things are more reliable to begin with.
Bentley notes: Why spend money on proprietary monitoring software when there are open-source alternatives such as Nagios (www.nagios.org) or Cacti (cacti.net)? With the money you save using open-source, you can spend more on hardware. Money equals up-time. Spend it wisely, he says.
Eliminate Single Points Of Failure
In critical computing environments, redundancy is a must, says Taylor. Many supposedly redundant systems still contain single points of failure; for example, how useful is a servers dual power supplies if theyre both connected to a single (failure-prone) PDU? How useful are your redundant PDUs if they both draw power from the same UPS system? If theres a small fire or accident in your facility, will you find that both your wire and your backup wire run side-by-side in the same conduit?
According to Taylor, backup systems fail because of either poor practices or single points of failure. For data centers that demand maximum reliability (to the Institutes Tier III or Tier IV fault-tolerance specifications), there must be two independent power paths all the way from the grid to the back of the server.
Dont Forget Security
Downtime is not always caused by hardware failure, says iXsystems Olander. Security must be a primary concern in server uptime. Add redundant firewalls to protect the network from malicious attacks that can cause interrupted service.
The open-source and free CARP (Common Address Redundancy Protocol) manages failover at the intersection of Layers 2 and 3 in the OSI Model (link layer and IP layer), continues Olander. CARP allows a backup host to assume the identity of the primary host. Combined with pf (packet filter), the free and open-source firewall solution in the FreeBSD and OpenBSD (www.openbsd.org) operating systems provide an excellent technique to build scalable redundant firewalls that will help keep the servers in a network safe and secure.
Many free resources are available on the Internet that can assist administrators with deploying scalable FreeBSD or OpenBSD servers running pf firewalls using CARP for failover. An open-source redundant firewall built on commodity server hardware can help protect against interrupted service from malicious attacks and easily be scaled as traffic grows, concludes Olander.
by Julie Sartain
Biggest Immediate Payback |
Think cluster, says Rick Bentley, CEO of Connexed Technologies (www.connexed.com). Remember when RAID stood for Redundant Array of Inexpensive Disks? Many cheap drives in RAID are more reliable than one expensive drive. Think of your servers the same way. Rather than buying expensive servers with dual hot-swappable power supplies, hot-swappable fan banks, etc. for $3,000, $5,000, or more, why not buy several boxes for $1,000 each and put them in a cluster?
For the same amount of money, notes Bentley, would you rather have three times the computing power you need (which is great if you get a big load peak) and occasionally have to swap out the motherboard on one of your 1U pizza boxes, or buy fewer, more expensive servers, essentially placing more eggs in fewer baskets?
Think about your database, continues Bentley. Is it on one big, monolithic box? What happens if/when that box fails? Do you have another big, expensive box waiting on standby? How fast does that standby kick in? How do you know if it will work—do you have the guts to occasionally pull the plug? Two boxes set up in automated failover means, basically, youre running an experiment on failover time the first time your primary box fails.
If you set up five boxes in a DB cluster, concludes Bentley, You will feel confident enough to pull the plug on any one of them, any time you want, because now you have a fully tested system.