
|
 |
|
General Information
|
Add To My Personal Library |
September 10, 2010
Vol.32 Issue 19 Page(s) 42 in print issue
|
Minimize The Chances For Human Error In The Data Center
To Err Is Human; To Work Sharper, Divine
|
| Key Points • Reducing human error begins with limiting the number of human beings who access the data center in the first place. • Careful design and placement of key controls can prevent mishaps by eliminating or circumventing their contributing factors. • High stress and mundane routine can both cause errors by knocking people out of a calm, focused mindset. | | Studies repeatedly show that the top cause of data loss and downtime is human error. This will never change. Human beings aren’t machine-tooled for precision, and the greater the pressure and the longer we work without stopping, the more susceptible we are to unpredictability and failure. Still, data center goof-ups can be made fewer and farther between with mindful attention to policy, design, precautions, and communication.
Restricted Access Reducing human error begins with limiting the number of people who access the data center in the first place. In almost any enterprise, only a small fraction of its staff ever needs to be there. The rest can be kept out by equipping the DC doors with keycard or punch-code locks or us-ing badge-reliant security, even though it’s rarely about stopping someone with malice. “Damage can be done by people with good intentions who think they know more than they really do,” says Mic Jones, senior systems administrator for a national public sector client. “Don’t hand anybody more access than they need to do their job.” Mike Oliveri, technology coordinator at a K-12 Illinois school district, agrees. “People leaning on equipment or hitting the wrong buttons, or a careless custodian slinging a mop too close to a server rack, can be disastrous.” If the DC isn’t the exclusive turf of your enterprise, then access should be restricted on the component level. “Most servers come with locks on their chassis, or some server racks come with locking doors. In a shared data center or [colocation center], all cables, wiring, and equipment should be locked up and accessible only by the given customer,” Oliveri says.
Access By Non-IT Staff On the other hand, most data centers will still be visited by non-IT staff--regularly by janitorial staff and on occasion by electricians and other technicians. These workers must be briefed on the environment they’re about to enter. “Signs are not enough. There should be meetings or training so janitors know what they can or can’t touch,” Oliveri says. “I know a guy who had a lot of server errors and reboots in the logs on some of his equipment. He stayed late one night to figure it out and happened to see a custodian come in and unplug the UPS on a server rack so he could plug in a floor buffer.” Electricians, carpenters, and similar workers should review schematics and other diagrams of the areas they’ll be working in, so they don’t, for example, cut into a vital line. If they’re covering equipment to protect it, they should be apprised of how much airflow components require.
Design & Environment Much like childproofing a home can reduce accidents, attention to facility design and placement can prevent mishaps by eliminating or circumventing their contributing factors. For instance, controls for temperature and humidity should be inaccessible to everyone who has no business setting them. Also, any button that could be disastrous if accidentally activated, such as an emergency power off, shouldn’t be a push-in type, but pull-out. Even though a lot of DC work is done remotely, physical access is sometimes necessary. “On the occasion when you need to go into the server room or to a rack . . . It needs to be labeled,” Jones says. Clearly visible labels that confirm you’re where you need to be are a simple way of heading off big blunders. Physical safeguards can also play a role. In a busy environment, vulnerable racks may benefit from railings or bumpers that protect them from being rammed by chairs, carts, etc.
The Perils Of Routine Paradoxically, routine operations can open the door the widest to error. With mundane tasks, we’re often prone to going on autopilot or forgetting when we last did them. “Not following procedure in routine or repetitive jobs can be common,” says Oliveri. “For example, forgetting to swap backup tapes in a drive or vault or not checking the tapes for errors. Perhaps even swapping in the wrong tapes.” This can even extend to monitoring systems designed to alert you to problems. “Sometimes a lack of numbers is harder to catch than the wrong numbers,” Jones says. “Everybody’s trained to look for the wrong output, but when the output stops, or something changes, it doesn’t show up in a way that they’re trained to expect.” Any kind of reminder system that helps direct someone’s focus to the task at hand is bound to help.
Stress Reduction Mistakes are likely to compound when people are feeling pressure to make something happen yesterday. Oliveri recommends insulating IT staff from fellow employees or customers whenever the heat of a service outage is on. Fielding complaints can only slow getting the situation resolved. That goes for inside the data center, too. Mid-crisis is no time for heavy-handed discipline, yet for some supervisors that’s still the first impulse. “I’ve seen a lot of energy expended on punitive blame being placed on individuals before the problem is resolved,” says Jones. It really pays to prioritize soft skills here"simply focusing on the resolution and whatever can be done to prevent the problem from recurring.
Learn From The Past It’s been noted that IT rarely publishes or speaks publicly about specific mishaps, possibly out of fear that this could undermine confidence in a company for its present and potential clients. This stance is in stark contrast to the aviation industry, to name one example, which uses investigative transparency in its efforts to make air travel ever safer. Even if you don’t go public about your own company’s incidents, it can still be of great benefit to chronicle them in-house"for instance, using them as case studies in training documentation or as part of a searchable knowledge base. The old saying still holds true: Those who don’t learn from the past are doomed to repeat it. by Brian Hodge
TOP TIPS Although people aren’t machine-precise, IT work often calls for that level of accuracy. “A typo can mess up your whole morning,” says Mic Jones, senior systems administrator for a national public sector client. Here are a few ways to hit the bull’s-eye. • For all but the simplest tasks, prepare a step-by-step script or checklist. • When working from a checklist, work as a team, with one person giving the next step and providing oversight while the other person focuses on one task at a time. Even conferring over the phone is better than flying solo. • When typing keyboard commands, don’t keep your finger poised over the ENTER button. Take your hands away and double-check that it’s typo-free, that you’re in the right SSH window, in the right directory, etc. • Physical or mental fatigue can wreak havoc during prolonged or repetitive tasks. Take a break whenever you sense you need one, or at the first sign of even a minor slip-up. |
|
|