Bad into Good:
If there is a better way to finally get rid of orphaned servers and VM sprawl leftovers than a data center wide power outage I haven't seen it. ...if...IF you have the flexibility or buy-in to make the decision not to just universally power everything back on. Assuming you, like most companies I have seen, do not do a great job of life cycle management and you are given the opportunity to power servers back on one by one as, and only as, they are requested by your customers, you will be amazed at the number of servers you get to leave powered off.Bad into Worse:
The tendency when something bad happens is to reflect and regret what could and should have been done. This is one of the proper and needed responses so that we can learn from our mistakes and oversights. The danger is in the desire to "fix" these past discretions with a knee jerk and immediate reaction. The reason this is dangerous is that the "could have"/"should have" actions that you think you are fixing are ultimately there because of a failure to plan. By implementing all of the things that should have been done before an incident immediately after an incident is only inviting a new and more deadly disaster. As much as you want to jump right in and fix things so that a similar incident doesn't happen again this is the time to properly plan changes to set yourself up for future success.Focusing on the Good:
Now, you have made some good from bad and left a number of servers off, ultimately decommissioning them after a respectable amount of time, it is time to take that on demand lesson and expand its practice. With more talk about cloud and on demand there are plenty of products and ideologies centered around the making of server provisioning, life cycle and deprovisioning into a self serve process.Think about where that entirely self serve method goes wrong. If David Developer gets an email asking if his server is still needed the vast majority of the time he will say yes. This is ultimately the same issue as asking for one up approval, most leaders are going to approve requests from their team that they don't understand or that don't directly affect them.
Consider now one of the most time consuming aspects of manually deploying applications which is troubleshooting the install. If you were deploying a web application to a farm of servers and had to do each one manually the chances of something getting missed goes up with each additional server you need to deploy to. The solution of course is a tool to automate the deployment. Once you have that deployment tool the act of deploying software becomes trivial. You are now in a place where both the provisioning of servers and the deployment of software is trivial and you can implement practices that incorporate self serve and automation.
Holy grail time. David Developer needs a farm of servers to develop and test on. He puts in the request and the servers are automatically provisioned. He then automatically, through a tool or script, configures the roles and features he needs. Then his code gets deployed to the entire farm of servers at the push of a button. So far this is all the same but this is where it changes. Why does D.D. need to get an email after 90 days to see if he still needs his servers? Why don't they automatically get retired and the resources returned to the pool after the predetermined life of the servers? Once D.D.'s servers disappear he puts in a new request and the same thing happens all over again. Besides the benefits of never having a server you don't need consider the patching implications. With a 90 day life of your servers you essentially have a quarterly patch management process out of the box without ever having to install a single patch.
If even some of this good can come from bad then I say bring on the power outage! Thoughts?
- Z