Disciplined Agile

IT Operations Strategies for DevOps

There are several technical IT operations strategies that support Disciplined DevOps:

  1. Solution monitoring. As the name suggests, this is the operational practice of monitoring running solutions and applications once they are in production. Technology infrastructure platforms such as operating systems, application servers, and communication services often provide monitoring capabilities that can be leveraged by monitoring tools. However, for monitoring application-specific functionality, such as what user interface (UI) features are being used by given types of users, instrumentation that is compliant with your organization’s monitoring infrastructure will need to be built into the applications. Development teams need to be aware of this operational requirement or, better yet, have access to a toolkit that makes it straightforward to provide such instrumentation.
  2. Standard platforms. Software development practices, such as continuous deployment and initial architecture envisioning, are enabled by consistency within your operational infrastructure. It is much easier to deploy to a handful of standard hardware configurations than it is to a myriad of unique ones. It is easier to deploy when there are consistent versions of infrastructure software (e.g. operating systems, databases, middleware, and so on) deployed across your environment. For example, all instances of your Oracle DB are 11.2.1.3, you don’t have 11.2.0.0, 12.0.1.0, and 11.2.1.3 installed in various places. Furthermore, it is much easier to make architecture decisions when there is consistency of infrastructure software packages in the first place. For example you standardize on Linuz for your server operating system, you don’t also have Windows, z/OS and others also in production (and if you do you’re actively retiring them).
  3. Deployment testing. After a solution, or an update to a component of your operational infrastructure, has been deployed you should run a quick set of tests to verify that the deployment was successful. Were the right versions of the files installed where they need to be? And were they deployed to all appropriate servers? Were database transformations applied successfully? Did the appropriate announcements, if any, get sent out? Did the overall deployment process run within the desired time frame?
  4. Automated deployment. Deployments should be automated, not manual. This increases the consistency of your deployments and supports the practice of continuous deployment. Part of your automation effort should be to support both self-recovery and self-testing as native aspects of your deployment strategy.
  5. Operations intelligence. This is the application of data warehouse (DW)/business intelligence (BI) solutions to provide insight into your operations and support efforts. Your operations team may have individual dashboards for each solution, they may combine information being generated by individual solutions into an integrated dashboard, and better yet share that information for an IT portfolio management view.

There is a collection of operations strategies focused on operational disaster mitigation that IT departments may choose to adopt:

  1. Disaster planning. Disciplined organizations will plan for operational disasters. Potential disasters include servers going down, network connectivity going down, power outages, failed solution deployments, failed infrastructure deployments, natural disasters such as fires and floods, terrorist attacks, and many more. This planning will include identification of potential problems, identification of strategies to address those problems, and putting mechanisms in place to hopefully mitigate the disasters. Potential strategies to address these disasters include building solutions that self-test and self-recover, building redundancies into your operational infrastructure, having disaster procedures in place, and practicing those procedures in simulated disasters.

  2. Scheduled disaster simulation. It is one thing to have disaster mitigations plans in place, it is another to know whether they actually work. Disciplined organizations will run through disaster scenarios to verify how well their mitigation strategies work in practice. For example, to test whether your power outage emergency plan works you would purposely simulate a power outage at one of your data centers and then work through your recovery plan. Like fire drills, these simulations should be done on a regular basis so that staff members build up the “body memory” required to act swiftly and appropriately in an emergency. The advantage of a scheduled disaster simulation is that you knowingly run it at a time where you will have minimal impact on your stakeholders. A disadvantage, at least when people are informed of the simulation ahead of time, is that people are mentally prepared for the simulation and aren’t caught unaware and thereby you don’t simulate the real level of stress that people would be under during an actual emergency.

  3. Random disaster simulation. Very disciplined organizations will implement a service within their operational environment that causes problems such as server or service outages at random times. An example of this is the Chaos Monkey functionality in Amazon’s Web Services (AWS) offering, functionality that is being implemented within many organizations now. The Chaos Monkey injects random problems into production to verify that the IT operations organization is capable of overcoming them. This is done to verify that your solutions really are able to automatically recover from problems and failing that at least operators are alerted to the problem.

As you would expect, truly disciplined organizations have adopted all of these strategies.