If you’ve been following this blog for more than a few weeks you know I manage a data center. If you do too, then you probably hear a lot in the media about how hard it is to keep things in a data center running. If you read the articles you would think power goes out 10 times a day and the AC’s never work and servers randomly explode.
But in reality running a data center isn’t that hard. I hate to say it but if you have a data center that was built and designed to be a data center, as opposed to an empty closet that has servers jammed in it, it’s pretty easy.
We have redundant power. A lot of people say they do but we really have separate feeds from the street come into separate transformers, switch gear, transfer switches, generators and UPS’s. All of that is separate and redundant. All of our servers, except one, have two or more power supplies with one plugged into power from one feed and the other on the redundant feed. The one without redundant power is not critical and also wasn’t bought by IT. Trust me I would have spent the extra $74 for the second power supply, even if I had to open my own wallet to pay for it.
We regularly run our generators and do routine maintenance on our UPS batteries. We even, once a month, measure the power to all the main circuits to make sure they are within specifications. (You should only run circuits at 80% of their rating, so 100Amp breaker should not exceed 80 amps, to be safe.)
We’ve tested our power redundancy, and found some problems the first time, but now can take half of the power away without a glitch. So power doesn’t keep me up.
When we built our data center we figured out the maximum cooling load we would need and designed two (again separate) cooling systems to cover it. One of our 30 ton units runs from a pair of chillers on the roof that actually can provide 350 tons each of cooling. (We run our labs and some other units from this).
That system is pretty solid since we have the two chillers and three pumps, but just in case we also have a 30 ton DX unit which is completely separate from the chilled water and gets power from a different feed. This unit isn’t as efficient so we normally run this as a backup, but once a month facility guys flip and run this to make sure it works. Testing before you need something is one of the best ways to not get bitten.
So cooling and power both don’t cause me any issues on the data center.
OK, so we build networks, I’d not only be embarrassed if our network wasn’t redundant, I’d probably get my butt kicked. We do, on average 3-4 upgrades a week to our network. We are always testing the latest solutions and products so we upgrade a lot more than most people would. If we had to do these upgrades during off hours, we would never get home to see our families. People don’t like that, so we made sure it was redundant enough for us to to upgrades during the day
Now, when we have one of our core routers down for maintenance, we do take a risk since during that time we have no redundancy. Really, though, with redundant power in a room that is always cool enough, I can’t remember that last time we actually had a hardware failure. We might have had a power supply or a GBIC fail a few years ago, but since it’s redundant something like that is hardly memorable.
We have redundant WAN connections too, from different providers. I know all the big telcos claim they can provide redundancy for me, but, frankly, I don’t really trust them. I’ve been burned when one of the big providers had a major core router go down and all of my connections, primary and secondary, went down. They were redundant feeds from the street to different routers, but I still went down. So we do use different providers. Everyone resells everyone else, so you still need to know where your connection really goes.
The WAN engineer we have on staff can tell me which poles our different circuits go to, through which conduit and then which central office they go to. He knows where the peering between the different providers happens. I don’t know how the heck he knows this, but he does. Because of this I know that a single backhoe can’t take me down. I guess a fleet of backhoes could, but how likely is there to be a gang of marauding backhoes trying to get me? I know one provider going down will cause me performance issues, but I’ll still be running.
So, nope, networking isn’t a problem either. No, it’s really the process that gets us, or used to get us, in trouble. The real problems with data centers are people issues. The times I got in trouble were when someone brought a new server online and skipped steps. It happens. Even if you think it won’t happen because you have a policy, it will. There never used to be a way to technically enforce it.
A few years ago we implemented network authentication into our data center. If you think about it servers that aren’t on the network rarely cause issues, especially since we have 20% extra power, just in case. At first it was a little scary, but since then we never miss steps. Well, let me take that back. When we miss steps the server doesn’t get to be on the network, so we remember the step pretty quickly. It definitely makes life easier. We liked it so much that Enterasys now sells this. Ask about Data Center Manager.
Now nothing keeps me up at night, well except the big bowl of coffee ice cream I just had. That may have been a bad idea at 10:30 PM…
P.S. While it’s a small part, I do get “reviewed” on the number of comments I get. So not trying to bribe anyone, or stack the deck against my fellow ETS bloggers, but if you would comment (and link back to this post) that’d be really cool…
Too subtle? 🙂