This article describes a disciplined approach to DevOps. It begins by defining DevOps, no small task given the continued debate within the DevOps community, as well as what a disciplined approach to DevOps entails. We go on to show how DevOps strategies are explicitly addressed by the Disciplined Agile (DA) toolkit, thereby taking some of the mystery about how DevOps works in practice. Finally, the article ends with some parting thoughts regarding how to transition to a DevOps mindset.
1. Defining DevOps
For the purposes of this article, we propose the following definition:
DevOps is the streamlining of the activities surrounding IT solution development (dev) and IT operations (ops).
In organizations that have not yet adopted a DevOps mindset we say that there is a “DevOps gap,” as we depict in Figure 1 below, between solution delivery teams and IT operations. This gap results in lengthy solution deployments and hence higher costs to deploy; a long mean time between deployments (MTBD) which is often measured in terms of months; reduced market competitiveness; and reduced ability to govern your IT efforts due to lack of real-time intelligence.
Figure 1. The DevOps gap
When organizations address these challenges by removing the barriers that inhibit effective collaboration, they close the DevOps gap. A common depiction of this strategy, sometimes called the “DevOps loop” is shown in Figure 2. Organizations do this by: adopting a mindset that promotes collaborative; learning-centric ways of working that are supported by agile practices; and very often investing significantly in automation. On the development side they tend to adopt a continuous delivery (CD) approach and their operations efforts become streamlined and more automated.
Figure 2. A closed DevOps gap
1.1 Defining Disciplined DevOps
What does it mean to take a disciplined approach to DevOps? It requires discipline to do the things that are good for you that you may not normally choose to do. Given this, we propose the following definition:
Disciplined DevOps is the streamlining of IT solution development and IT operations activities, along with supporting enterprise-IT activities such as Security and Data Management, to provide more effective outcomes to an organization.
We see several key forces in the current marketplace which makes it difficult to settle on a common definition:
- Specialized IT practitioners. Many IT professionals still tend to specialize — someone will choose to focus on being a programmer, an operations engineer, an enterprise architect, a database administrator (DBA) and so on. As a result they tend to see the world through the lens of their specialty. Programmers will focus on the software development aspects of DevOps, operations engineers the operations aspects of DevOps, enterprise architects on the long-term planning and modelling aspects, and DBAs on the data management aspects. Few people are looking at the overall “big picture.”
- Agilists are focused on continuous delivery. Right now agile and lean developers are investing a lot of effort to figure out continuous delivery practices so as to streamline the regular deployment of value into production. Advanced teams are releasing daily if not several times a day due to adoption of practices such as automated regression testing, continuous integration (CI), and continuous delivery (CD). As a result most of the DevOps discussion in these communities focuses on these topics, sometimes straying into other practices such as canary testing, feature toggles, and production monitoring frameworks. Clearly important techniques, but still not covering the full potential range of DevOps. These practices and more are described later.
- Operations professionals are often frustrated. Many operations groups are overwhelmed already with the rate of updates being foisted upon them by development teams. This is often exacerbated by the inconsistent use of technologies — the impact of the lack of enterprise awareness within undisciplined development teams is largely felt by the operations group who needs to support the plethora of technology platforms used by the full range of development teams. Worse yet, the internal operations processes are often based on heavy implementations of ITIL or ITSM and have yet to be streamlined so that operations engineers are in a better position to collaborate with development teams.
- Tool vendors have limited offerings. As a result of this the DevOps messaging from tool vendors will focus on just the aspects of DevOps supported by their tools, narrowing the discussion to what they have on offer. Yes, tools are important, but they are only part of the DevOps picture. Even if there was a vendor with a full range of tools, and if they actually interoperated smoothly (yes ALM vendors, we’re referring to you), you would still need to understand how to use those tools effectively. To paraphrase an old saying — A fool with a DevOps tool is still a fool.
- Service vendors have limited offerings. Similar to the issues surrounding tool vendors, service vendors are also making great claims about their deep expertise in DevOps. Upon examination you will often find, like the tool vendors, their definition of DevOps will focus on whatever they can currently support.
- Tool vendors treat DevOps as a marketing buzzword. To be blunt, many vendors have taken their existing products, and started marketing them as DevOps products (regardless of how well those products actually support DevOps practices). Granted, these products may have been very good at supporting traditional ways of working, but when it comes to supporting DevOps they prove to be rather clunky even though they may have added a few new features.
- The DevOps=Cloud vision. There is a lot of rhetoric, particularly coming from Cloud vendors, about how cloud-based tooling and deployment environments are critical to success in DevOps. Yes, having a cloud-based infrastructure clearly enables many DevOps practices and given the choice we prefer to work in an environment which leverages cloud-based technologies whenever appropriate. But, that doesn’t mean that the cloud is a prerequisite for doing DevOps.
The point is that there are several contributing factors to the lack of agreement within our industry as to what DevOps means in practice. The implication is that when someone is giving you advice about DevOps that you need to understand the scope of what they’re actually discussing. Another way to understand what DevOps is and how it may apply to your organization is to explore the various DevOps strategies and practices available to you.
2. The Workflow of Disciplined DevOps
Figure 2 is a good start, as an initial definition of DevOps, but as an industry we are nowhere near agreement as to what DevOps really is, and there are several complimentary visions. Let’s work through these visions one at a time so as to build to a coherent vision for Disciplined DevOps that addresses the challenges faced by modern organizations. These visions are:
- The BizDevOps Vision
- The DevSecOps Vision
- The DevDataOps Vision
- Making Release Management and Support Explicit
- Disciplined DevOps
First and foremost, a key improvement over the basic DevOps vision is to explicitly bring your customers into the picture. This is something that is commonly referred to as “BizDevOps” (or BusDevOps) and its workflow is depicted in Figure 3. There are two important differences in this diagram:
- We’re clear that DevOps isn’t just for teams following an agile or continuous delivery lifecycle but is potentially applicable to any team following a lifecycle that supports incremental delivery. Having said that, a continuous delivery approach to development is certainly preferred.
- The workflow includes Business Operations which are the activities of delivering of products and services to your organization’s customers. There is little value in having a responsive IT organization if the rest of your enterprise isn’t able to take advantage of it. BizDevOps seeks to streamline the entire value stream, not just the IT portion of it.
Now let’s go in a different direction. Another common improvement over the basic DevOps vision is something called DevSecOps, the workflow for which is overviewed in Figure 4. The goal of DevSecOps is to ensure data security through improving the awareness and understanding of security issues, by adopting proactive security practices, and by incrementally identifying and addressing the most urgent security gaps [DevSecOps]. Security strategies that support DevOps includes collaborative security engineers, exploit testing, real-time security monitoring, and building “rugged software” that has built-in security controls. In some ways security and DevOps have competing goals — Security wants to keep everything safe where DevOps wants to enable quick, responsive changes to the marketplace. Ensuring safety will slow things down yet being responsive increases the chance of inadvertently introducing security holes. Because DevOps will never be a replacement for formal security practices it is important to find a viable middle ground as provided by DevSecOps.
Figure 4. The workflow of DevSecOps
Similarly, Data Management is often missing from the DevOps picture. To our knowledge no one has coined the term “DevDataOps” so we’ll do so for the sake of expediency. The goal of DevDataOps is to ensure a fair balance between the competing needs of data management in that it wants to provide timely and accurate information to your organization and DevOps in wanting to be responsive to the marketplace. Where DevSecOps is a safety vs flexibility tradeoff, DevDataOps is an accuracy vs. flexibility issue. The DevDataOps workflow is depicted in Figure 5. Supporting data management activities include the definition, support, and evolution of data and information standards and guidelines; the creation, support, evolution, and operation of data sources of record within your organization; and the creation, support, evolution, and operation of data warehouse (DW)/business intelligence (BI) solutions.
Figure 5. The workflow of DevDataOps
Let’s consider a fourth viewpoint to extending DevOps. Some people choose to include Release Management and Support (help desk) activities in with the IT Operations efforts. Although this is a perfectly fine decision, our experience is that in doing so you risk downplaying these important activities. Furthermore, in large organizations Release Management can become a critical activity. When you have a handful of delivery teams coordinating their releases, it is straightforward. But when there are dozens or even hundreds of solution delivery teams working in parallel it can be very challenging to ensure that their releases go smoothly. In Figure 6 we explicitly call out Release Management, IT Operations, and Support to make the workflow clear.
Figure 6. The workflow of DevOps with explicit Release Management and Support
We’re now in a position to understand what a Disciplined DevOps strategy actually entails. Figure 7 depicts the workflow of Disciplined DevOps, which is a combination of the workflows of Figures 3 through 6. Our point is that the BizDevOps, DevSecOps, and DevDataOps strategies all have their merits as does the strategy of making Release Management, Support, and IT Operations explicit.
Figure 7. The workflow of Disciplined DevOps
3. DevOps Strategies Throughout Disciplined Agile
This section overviews a collection of DevOps-friendly strategies. We’ve organized them into the following categories:
- General Strategies
- Teaming Strategies
- Development Strategies
- Operations Strategies
- Support (Help Desk) Strategies
- Release Management Strategies
- Data Management Strategies
- Security Strategies
- Enterprise Architecture Strategies
Table 1 below lists the strategies covered in this article.
Table 1. Potential DevOps Strategies
There are several “general” strategies that support Disciplined DevOps:
- Collaborative work. A fundamental philosophy of DevOps is that developers, operations staff, and support people must work closely together on a regular basis. An implication is that they must see one other as important stakeholders and actively seek to work together. A common practice within the agile community is “onsite customer,” adopted from Extreme Programming (XP), which motivates agile developers to work closely with the business. Disciplined agilists take this one step further with the practice of active stakeholder participation, which says that developers should work closely with all of their stakeholders, including operations and support staff—not just business stakeholders. This is a two-way street: Operations and support staff must also be willing to work closely with developers.
- Automated dashboards. The practice of using automated dashboards is called IT intelligence, effectively the application of business intelligence (BI) strategies for IT. There are two aspects to this, development intelligence and operational intelligence. Development intelligence requires the use of development tools that are instrumented to generate metrics; for example, your configuration management (CM) tools already record who checked in what and when they did it. Continuous integration tools could similarly record when a build occurred, how many tests ran, how long the tests ran, whether the build was successful, how many tests we successful, and so on. This sort of raw data can then be analyzed and displayed in automated dashboards. Operational intelligence is an aspect of application monitoring discussed previously. With automated dashboards, an organization’s overall metrics overhead can be dramatically reduced (although not completely eliminated because not everything can be automated). Automated dashboards provide real-time insight to an organization’s governance teams.
- Integrated configuration management. Developers and operations people often have different views about configuration management (CM) ‐ developers see configuration management as a collection of strategies to manage the assets that they create when building or configuring a solution whereas operations people see CM as a collection of strategies for managing the assets which make up the IT production infrastructure of your organization. With an integrated approach to CM we combine both views. This can be a major change for some developers because they’re often used to thinking about CM only in terms of the solution they are currently working on. In a DevOps environment, developers need to be enterprise aware and look at the bigger picture. How will their solution work with and take advantage of other assets in production? Will other assets leverage the solution being developed? The implication is that development teams will need to understand, and manage, the full range of dependencies for their product. Integrated configuration management enables operations staff to understand the potential impact of a new release, thereby making it easy to decide when to allow the new release to occur.
- Integrated change management. From an IT perspective, change management is the act of ensuring successful and meaningful evolution of the IT infrastructure to better support the overall organization. This is tricky enough at a project-team level because many technologies, and even versions of similar technologies, will be used in the development of a single solution. Because DevOps brings the enterprise-level issues associated with operations into the mix, an integrated change management strategy can be far more complex, due to the need to consider a large number of solutions running and interacting in production simultaneously. With integrated change management, development teams must work closely with operations teams to understand the implications of any technology changes at an organization level. This approach depends on the earlier practices of active stakeholder participation, integrated configuration management, and automated testing.
- Training, education, and mentoring. As you would expect, people will need help to learn and adopt your DevOps strategies.
- Continuous improvement. Disciplined agile teams strive to learn from their experiences as well as from others so that they can continuously improve the way that they work together, including how they approach DevOps.
- One team. An important aspect of the DevOps mindset is shifting away from a “them and us mindset” to an “us mindset.” We all work together as a single, streamlined team. An extreme form of this is the “you build it, you run it” philosophy where there are no separate development, operations, data administration teams but instead product teams who are responsible for the entire lifecycle of a product.
There are several teaming strategies that you can choose to adopt when it comes to getting development professionals and operations professionals to work together. Starting with the least effective and working our way to the most effective, they are:
- Production hand-off. When a development team releases a solution into production the operations team takes on the responsibility for running and supporting the solution. At this point the development team is often disbanded or moves on to another effort. A sustainment team of one or more developers may be formed to perform maintenance updates as needed over time, or the responsibility to do this work is given to an existing sustainment team. The advantage of this approach is that your organization no longer has to fund the full development team moving forward. However, you risk losing the knowledge and expertise of the team that is required to maintain and evolve the solution over time. This can be particularly problematic when there are high-severity defects to be fixed.
- Warranty period. With this strategy the development team commits to fixing critical defects for a pre-defined period of time after the solution is released into production. For example, a development team may be required to fix any severity 1 or severity 2 defects free of charge for the first thirty days following a production release. Warranty periods are often combined with the production hand-off strategy to reduce the risks associated with it. Warranty periods are also common when development teams are funded via a fixed-price funding model or in outsourcing situations because the stakeholders typically want to ensure that they received the level of quality that they paid for.
- Production support. In enterprise environments most application development teams are working on new releases of a solution that already exist in production. Not only will they be working on the new release, they will also have the responsibility of addressing serious production problems that are escalated to them. The development team will often be referred to as “level three support” for the application because they will be the third (and last) team to be involved with fixing critical production problems. The primary advantage is that production emergencies associated with a specific solution are often resolved by the most qualified people — the actual developers of that solution. Another advantage is that it gives developers an appreciation of the kinds of things that occur in production, providing them with learning opportunities to improve the way that they design solutions in the first place. A potentially significant disadvantage is that the need to fix production emergencies will distract the development away from working on new functionality.
- Developer-led operations. This strategy turns up the dial on production support by having the development team be responsible for operating and supporting their own solution. This is often referred to as “you build it you run it.” This strategy has the benefits that it focuses the team on ensuring that their solution is easy to operate and support and it ensures that the most qualified people are the ones evolving the solution. However, this strategy results in Scrum teams producing silo solutions running on disparate platforms — luckily DAD teams are enterprise aware and include someone in the role of architecture owner who will guide the team in avoiding this very sort of architecture mistake. Another common strategy is to include someone with strong operations experience in your team. A developer-led operations strategy also runs the risk of varying levels of support quality as some teams will be better than others at this. Once again, teams that are enterprise aware will be following common guidelines and will reach out to other teams for help in improving their approach.
Of the four approaches listed above, the only one that is clearly a DevOps strategy is developer-led operations. The production support strategy is definitely a step in the right direction and is often seen as sufficient in many enterprises. If this is the case in your organization we recommend that you experiment with the developer-led operations strategy on a few teams to see how well it works for you. We suspect that you’ll be pleasantly surprised.
There are several common development practices that support Disciplined DevOps:
- Canary tests. A canary test is a small experiment where new functionality is deployed to a subset of end users so you can determine whether that functionality is of interest to them. This in turn provides insight to the development team as to the true potential value of the functionality (if any). For example, an e-commerce company might believe that a new feature where people can buy two related items at a discount will help to increase sales. At the same time they fear this could decrease overall revenue. So they decide to run a canary test where 5% of their customers are provided this functionality for a two-week period. Sales and revenue are tracked and compared against customers not given access to this new functionality. If a new feature successfully passes a canary test it is then made available to a wider range of end users (you may choose to several rounds of canary tests before finally deploying the functionality to all users). You can think of canary testing as an extreme form of pilot testing.
- Split tests. A split test, also known as an A/B test, is an experiment where two or more options are run in parallel so that their effectiveness can be compared. For example, a bank may identify three different screen design strategies to transfer funds between two accounts via an automated teller machine (ATM). Instead of holding endless meetings, focus groups, or modelling sessions the bank instead decides to implement all three strategies and put them into production in parallel. When I use an ATM I’m always presented with strategy A, when you login you always get strategy B, and so on. Because the ATM solution is instrumented to track important usage metrics the bank is able to determine which of the three strategies is most effective. After the split test is completed the winning strategy is made available to all users of ATMs.
- Automated regression testing. Agile software developers are said to be “quality infected” because of their focus on writing quality code and their desire to test as often and early as possible. As a result, automated regression testing is a common practice adopted by agile teams, which is sometimes extended to test-first approaches such as test-driven development (TDD) and behavior-driven development (BDD). The regression test suite(s) may address function testing, performance testing, system integration testing (SIT), and acceptance testing and many more categories of tests. Because agile teams commonly run their automated test suites many times a day, and because they fix any problems they find right away, they enjoy higher levels of quality than teams that don’t. Because some tests can take a long time to run, in particular load/stress tests and performance tests, that a team will choose to have several test suites running at different cadences (i.e. some tests run at every code check in, some tests run at scheduled times each day, some once every evening, some over the weekend, and so on). This greater focus on quality is good news for operations staff that insists a solution must be of sufficient quality before approving its release into production.
- Continuous integration (CI). Continuous integration (CI) is the discipline of building and validating a project automatically whenever a file is checked into your configuration management (CM) system. As you see in the following diagram, validation can occur via several strategies such as automated regression testing and even static or dynamic code and schema analysis. CI enables developers to develop a high-quality working solution safely in small, regular steps by providing immediate feedback on code defects.
- Continuous deployment (CD). Continuous deployment extends the practice of continuous integration. With continuous deployment, when your integration is successful in one sandbox your changes are automatically promoted to the next sandbox. The CI strategy running in that environment automatically integrates your solution there because of the updated source files. As you can see in the following diagram this automatic promotion continues until the point where any changes must be verified by a person, typically at the transition point between development and operations. Having said that, advanced teams are now automatically deploying into production as well. Continuous deployment enables development teams to reduce the time between a new feature being identified and being deployed into production. It enables the business to be more responsive. However, when development teams aren’t sufficiently disciplined continuous deployment can increase operational risk by increasing the potential for defects to be introduced into production. Successful continuous deployment in an enterprise environment requires an effective continuous integration strategy in place in all sandboxes.
- Development intelligence. This is the application of data warehouse (DW)/business intelligence (BI) solutions to provide insight into how delivery teams are working. The automated team dashboards provided by many development platforms are a simple form of development intelligence, a more sophisticated (and useful) strategy is to combine information from various development tools to display it in an integrated dashboard for the team, and more sophisticated yet is to roll up/combine information from different delivery teams into a portfolio management dashboard. Development intelligence is a subset of IT Intelligence.
There are also several common operations-friendly features that developers with a Disciplined DevOps mindset will choose to build into their solutions:
- Feature access control. To support experimentation strategies such as canary tests and split tests it must be possible to limit end user access to certain features. This strategy must be easy to configure and deploy, a common approach is to have XML-based configuration files that are read into memory that contain the meta-data required to drive an access control framework.
- Monitoring instrumentation. Developers with a Disciplined DevOps mindset will build instrumentation functionality — logging and better yet real-time alerts — into their solutions. The purpose is to enable monitoring, in (near) real-time, of their systems when they are operating in production. This is important to the people responsible for keeping the solution running, to people supporting the solution, to people responsible for debugging and fixing any problems, and to your operational intelligence efforts. Monitoring instrumentation enables canary tests and split tests in that it provides the data required to determine the effectiveness of the feature or strategy under test.
- Feature toggles. A feature toggle is effectively a software switch that allows you to turn features on (and off) when appropriate. A common strategy is to turn on a collection of related functionality that provide a value stream, often described by an epic or use case, all at once when end users are ready to accept it. Feature toggles are also used to turn off individual features when it’s discovered that the feature isn’t performing well (perhaps the new functionality isn’t found to be useful by end users, perhaps it results in lower sales, …). Another benefit of feature toggles is that they enable you to test and deploy functionality into production on an incremental basis.
- Self-testing. One strategy to make a solution more robust, and thus easier to operate, is to make it self testing. The basic idea is that each component of a solution includes basic tests to validate that it can properly run while in production. For example, an application server may run basic tests at startup such as verifying the version of the operating system or of frameworks that it relies on. While the server is running it might regularly check to see if other components that it relies on, such as data sources and middleware services, are available. When a problem is detected it minimally should be logged, better yet an alert should be posted if intervention by a person is required, and even better yet the solution should try to recover from the problem.
- Self-recovery. When a system runs into a problem it should do it’s best to automatically recover and continue on as before. For example, if the system detects that a data source is no longer available it should try to restart that data service. If that fails, it should record change transactions where possible and then process them until the data service becomes available again. A good example of this is an ATM. When ATMs lose their connection to a bank’s financial processing system they will continue on for a period of time independently albeit with limited functionality. They will allow people to withdraw money from their accounts, perhaps putting a limit on the amount withdrawn to limit potential problems with overdrawn accounts. People will still be able to deposit money but will not be able to get a current balance or see a statement of recent transactions. Self-recovery functionality provides a better experience to end users and reduces the operational burden on your organization.
Now that we have overviewed a collection of development practices and implementation features, let’s explore strategies that streamline your operations efforts.
There are several technical operations strategies that support a Disciplined DevOps mindset:
- Solution monitoring. As the name suggests, this is the operational practice of monitoring running solutions and applications once they are in production. Technology infrastructure platforms such as operating systems, application servers, and communication services often provide monitoring capabilities that can be leveraged by monitoring tools (such as Microsoft Management Console, IBM Tivoli Monitoring, and jManage). However, for monitoring application-specific functionality, such as what user interface (UI) features are being used by given types of users, instrumentation that is compliant with your organization’s monitoring infrastructure will need to be built into the applications. Development teams need to be aware of this operational requirement or, better yet, have access to a toolkit that makes it straightforward to provide such instrumentation.
- Standard platforms. Software development practices, such as continuous deployment and initial architecture envisioning, are enabled by consistency within your operational infrastructure. It is much easier to deploy to a handful of standard hardware configurations than it is to a myriad of unique ones. It is easier to deploy when there are consistent versions of infrastructure software (e.g. operating systems, databases, middleware, and so on) deployed across your environment. For example, all instances of your Oracle DB are 184.108.40.206, you don’t have 220.127.116.11, 18.104.22.168, and 22.214.171.124 installed in various places. Furthermore, it is much easier to make architecture decisions when there is consistency of infrastructure software packages in the first place. For example you standardize on Linuz for your server operating system, you don’t also have Windows, z/OS and others also in production (and if you do you’re actively retiring them).
- Deployment testing. After a solution, or an update to a component of your operational infrastructure, has been deployed you should run a quick set of tests to verify that the deployment was successful. Were the right versions of the files installed where they need to be? And were they deployed to all appropriate servers? Were database transformations applied successfully? Did the appropriate announcements, if any, get sent out? Did the overall deployment process run within the desired time frame?
- Automated deployment. Deployments should be automated, not manual. This increases the consistency of your deployments and supports the practice of continuous deployment. Part of your automation effort should be to support both self-recovery and self-testing as native aspects of your deployment strategy.
- Operations intelligence. This is the application of data warehouse (DW)/business intelligence (BI) solutions to provide insight into your operations and support efforts. Your operations team may have individual dashboards for each solution, they may combine information being generated by individual solutions into an integrated dashboard, and better yet share that information for an IT portfolio management view. Operations intelligence is a subset of IT Intelligence.
There is a collection of operations strategies focused on operational disaster mitigation that IT departments may choose to adopt:
- Disaster planning. Disciplined organizations will plan for operational disasters. Potential disasters include servers going down, network connectivity going down, power outages, failed solution deployments, failed infrastructure deployments, natural disasters such as fires and floods, terrorist attacks, and many more. This planning will include identification of potential problems, identification of strategies to address those problems, and putting mechanisms in place to hopefully mitigate the disasters. Potential strategies to address these disasters include building solutions that self-test and self-recover, building redundancies into your operational infrastructure, having disaster procedures in place, and practicing those procedures in simulated disasters.
- Scheduled disaster simulation. It is one thing to have disaster mitigations plans in place, it is another to know whether they actually work. Disciplined organizations will run through disaster scenarios to verify how well their mitigation strategies work in practice. For example, to test whether your power outage emergency plan works you would purposely simulate a power outage at one of your data centers and then work through your recovery plan. Like fire drills, these simulations should be done on a regular basis so that staff members build up the “body memory” required to act swiftly and appropriately in an emergency. The advantage of a scheduled disaster simulation is that you knowingly run it at a time where you will have minimal impact on your stakeholders. A disadvantage, at least when people are informed of the simulation ahead of time, is that people are mentally prepared for the simulation and aren’t caught unaware and thereby you don’t simulate the real level of stress that people would be under during an actual emergency.
- Random disaster simulation. Very disciplined organizations will implement a service within their operational environment that causes problems such as server or service outages at random times. An example of this is the Chaos Monkey functionality in Amazon’s Web Services (AWS) offering, functionality that is being implemented within many organizations now. The Chaos Monkey injects random problems into production to verify that the IT operations organization is capable of overcoming them. This is done to verify that your solutions really are able to automatically recover from problems and failing that at least operators are alerted to the problem.
As you would expect, truly disciplined organizations have adopted all of these strategies.
There are several solution support (help desk) strategies, which can be combined, that you may choose to adopt. These options are:
- Online information. A very common “self serve” support strategy is to develop and maintain online assets such as frequently asked questions (FAQ) pages, training videos, and user manuals to name a few. This enables end users to potentially support themselves, although suffers from the TAGRI (They Ain’t Gonna Read It) syndrome.
- Online discussion forums. Many organizations choose to implement internal discussion forums so that their end users can help each other in learning how to use their systems. This is effectively a collaborative self-serve support option for end users. The primary advantage is that your “power users”, or in some cases members of the development team, will come to the aid of other users who are struggling with an issue. A potential disadvantage is that you risk your discussion forum becoming a complaints forum if problems aren’t addressed in a timely manner.
- Asynchronous support. With asynchronous support strategies an end user will put in a request for help and then sometime later somebody gets back to them with help (hopefully). Common ways to implement asynchronous support include implementing a standard support email or a support request page/screen. It is common in many organizations to put a service level agreement (SLA) in place putting limits on how long people will need to wait for help.
- Synchronous support. With synchronous support strategies end users are put in contact with support people (who may even be one of the application developers) in real-time. This is often done via online chat software, video conferencing, or telephone calls. The key advantage of synchronous support is responsiveness. However, synchronous support can be expensive to operate and potentially frustrating for end users, particularly when the support desk function is outsourced to people following scripts.
- Support alerts. With this strategy your solution itself detects serious problems affecting end users, such as a data source or a service/component being unavailable. When such an event occurs, and the solution isn’t able to swiftly recover, the end user is informed of the problem and presented with a “Would you like help?” option. If yes, they are put in direct contact with an appropriate support person who then helps them in real-time. This is part of your solution’s self-recovery process.
- Developer-led support. This strategy has development teams performing the support services for their own solutions and was described previously in DevOps Strategies: Development.
Furthermore, anyone doing solution support, even if it is the development team itself, is likely to need an environment in which they can reproduce problems that end users experience. There are several options available to you:
- Production. In some cases your production environment is sufficient, although many regulatory regimes, particularly life-critical and financial-critical ones, will not allow this.
- Pre-production test sandbox. Some support teams will find that they can use their pre-production test environment to try to simulate production problems. The advantage is that you don’t put production at risk when trying to reproduce problems, the disadvantage is that you the test environment will be different than production and as a result you may not be able to simulate all reported problems.
- Support sandbox. Some organizations choose to have a specific environment set up to enable support staff to simulate production problems. This strategy has the same trade-offs as using a pre-production test sandbox plus the additional cost and maintenance associated with yet another environment.
There are four general release scheduling strategies that potentially support DevOps. These strategies, from least effective to most effective, are:
- Release windows (slow cadence). A release window is a period of time during which one or more teams may release into production. A release slot is subset within that release window (and may be the entire window) during which a team may deploy their solution into production. For example, your organization may have a policy that production releases occur between 1am and 5am on Saturday evenings (the release window) and that up to four releases may occur during that window (the release slots). In lean terms, a release slot is effectively a Kanban card that allows a team to deploy. Release windows tend to align with periods where system usage is lower, although in the modern world of global 24/7 operations these periods have all but disappeared. With a slow cadence approach to this strategy the release windows occur far apart, as seldom as once a week or even longer. The advantages of this approach are that it provides a consistent release cadence to business stakeholders and it provides consistent release date targets for delivery teams. The primary disadvantage with slow cadence release windows is that they become bottlenecks for release management processes that need to support multiple teams. There are only so many release slots available during each window and this number can be easily exceeded, forcing teams to aim for a future release window. This problem becomes exacerbated when teams start to move to a continuous delivery strategy.
- Release train. The idea with a release train is that every team involved with that “train” has the same release cadence — for example this train releases once a quarter, or once a month, or even once a week. This strategy is commonly used in large programs, or teams of teams, where the individual teams are each working on part of a larger whole. Having the common drumbeat of a release train provides a consistent release schedule for stakeholders and serves as a rallying point for development teams. The train metaphor works quite well in practice. If your team misses the release date, if you miss the train, then the train goes on without you and you need to wait for space on the next on. Dependencies are also respected. For example, if several components need to ship together they must all go on the same train (similar to a family taking a trip together). The primary disadvantage is that development teams are constrained to a common release schedule, making it difficult to support lean or continuous delivery strategies. A potential disadvantage is that release trains may also suffer from the bottleneck problems of slow cadence release windows.
- Release windows (quick cadence). To support continuous deployment, particularly across many delivery teams, you will need a large number of release slots. The implication is that you will also likely need more release windows more often. The advantage of quick cadence release windows is that they are less likely to suffer from the bottleneck challenges associated with slow cadence release windows and release trains.
- Continuous release availability. With this approach delivery teams are allowed to release their solutions into production whenever they need to. In many ways this is simply an extension of the release window strategy to be 24/7. This is the only strategy that truly supports continuous delivery. To make it work a host of DevOps practices are required, such as fully automated deployment, fully automated regression testing, feature toggles, self-recovering components, and many others are required.
Our experience is that most enterprises today employ a slow cadence release window approach although are starting to evolve into the quick cadence version of this strategy. This is usually motivated by the adoption of agile techniques by solution delivery teams and more often than not by continuous delivery practices. We also see large programs take a release train approach, a strategy pioneered in the 1990s by large software companies such as Microsoft and Rational that sold software suites comprised of many products that needed to be shipped together. In recent years the OpenUP and SAFe frameworks have popularized the release train strategy. The strategy of continuous release availability is commonly used in advanced DevOps organizations such as Etsy and Amazon.
In addition to the release scheduling strategies listed above, there are several technical release management strategies that support DevOps:
- Integrated deployment planning. From the point of view of development teams, deployment planning has always required interaction with an organization’s operations and release management staff; in some cases, via liaison specialists within operations often called release engineers. When you adopt a Disciplined DevOps mindset, you quickly realize the need to take a cross-team approach to deployment planning due to the need for operations staff to work with all of your development teams. This isn’t news to operations staff, but it can be a surprise to development teams accustomed to working in their own, siloed environments (luckily this strategy is built into DAD’s Accelerate Value Delivery process goal). If your team is not doing this already, you will need to start vying for release slots in the overall organizational deployment schedule.
- Standard development and testing environments based on production. Development teams know that the greater the consistency between their development, testing, and production environments the easier it is to test and deploy. In multi-team environments the implication is that this will result in de facto standardization of many aspects of your environments. Developers may choose different development tools, but aspects of the infrastructure such as operating systems, application servers, middleware, databases and so on will become consistent over time to streamline the overall release process.
- Release service streams. A key tenant of DAD is that every team is unique, and an implication of that is that some teams will need more help than others. Teams will produce different levels of quality, they will have different amounts of automation, they will have different release cadences, and so on. As a result your release management strategy needs to be flexible enough to address these different situations. One way to do so is to offer different server streams, or service levels as it were, to solution delivery teams. For example, you may have a basic release management service stream where release management engineers actively help delivery teams to deploy their solutions into production and even help them to start automating some of their processes. At the other end of the spectrum you may have a continuous delivery service stream for delivery teams that have fully automated their testing and deployment processes and that can be trusted to successfully deploy on their own. And of course you could have several other service streams between those two extremes. The advantage of this approach is that it is very flexible albeit at the cost of slightly greater scheduling complexity.
- Release blackout periods. Some organizations have periods of time where they choose not to release new functionality into production unless it is absolutely critical. These blackout periods typically occur during high-volumes of business transactions. For example, many North American retail companies will have blackout periods between mid-November and early January for the holiday sales seasons. Many organizations will have blackout periods near the end of their fiscal years to enable them to focus on the process required to close out the year.
- Shared release practices. Although this is really a process improvement issue, it’s worthwhile to point out that whoever is involved with release management should actively strive to share effective practices between teams. Sharing learnings across teams is an important aspect of enterprise awareness.
In the Disciplined Agile Delivery (DAD) toolkit Data Management is a Run (operational) activity that focuses on the execution of data-oriented architectures, policies, and processes. Note that the long-term planning efforts around data-oriented aspects of your organization are part of your Enterprise Architecture efforts. Similarly, development of the data-oriented aspects of your organizational ecosystem is addressed by solution delivery teams.
Because data management is an important aspect of your Run endeavors it will be affected by your Disciplined DevOps strategy. Our experience is there are several data management strategies that support DevOps:
- Data and information guidelines. A straightforward way to promote greater consistency in the development and application of data and information sources is to have common guidance that teams will adopt and then follow. This guidance, including both standard policies and guidelines, will need to be defined, supported, and evolved over time in a collaborative and open manner.
- Quality data sources. Your production data sources, including files, databases, and data feeds, should be high quality assets that are easy to work with. When it comes to data sources of record it is particularly important for them to be of high quality so that they are easy to work with and evolve. Unfortunately this is often little more than fanciful thinking in many organizations. With a Disciplined DevOps mindset teams realize that they should be very careful about increasing the technical debt within their data sources, and more importantly invest in the effort to pay down any technical debt that they find.
- IT intelligence. IT intelligence is the creation, support, evolution, and operation of data warehouse (DW)/business intelligence (BI) solutions that support the management and governance of your IT efforts. From a Disciplined DevOps perspective this there are two important aspects of IT intelligence: development intelligence that provides insight into how delivery teams are working and operational intelligence that provides insight into what occurs in production. The automated team dashboards provided by many development platforms are a simple form of development intelligence, a more sophisticated (and useful) strategy is to combine information from various development tools to display it in an integrated dashboard for the team, and more sophisticated yet is to roll up/combine information from different delivery teams into a portfolio management dashboard. Similarly your operations team may have individual dashboards for each solution, they may combine information being generated by individual solutions into an integrated dashboard, and better yet share that information for an IT portfolio management view.
There are several Security strategies that support Disciplined DevOps you may wish to adopt:
- Build “rugged software.” Rugged software is a recent movement in the IT industry that recognizes the need for robustness, quality and security. An implication of this is that software-based solutions should have appropriate security control features built in, including but not limited to access control, monitoring, validated input, and sanitized data transfers.
- Automated separation of duties (SoD). The need for regulatory compliance, particularly around security, is very common. Standards such as Payment Card Industry Data Security Standard (PCI DSS) or ISO 27001 typically require separation of duties (SoD). Although much is made of the issue that the person who develops something should be different than the person who deploys, a continuous deployment (CD) strategy where things are deployed by your tools (and appropriate logging occurs) can still pass a compliancy audit. In fact, this level of automation tends to provide better SoD control than what you find when people are involved with manually running scripts.
- Collaborative security engineers. As with other enterprise IT staff — such as enterprise architects, reuse engineers, or data managers — security engineers can and should collaborate closely with the teams they support. They should actively strive to transfer their skills and knowledge whenever they can so as to enable teams to be as self-sufficient as possible.
- Exploit testing. Also known as penetration testing, the goal is to simulate common ways that attackers can exploit potential security gaps. It is common to have such testing tools as part of your continuous integration (CI) strategy.
- Real-time security monitoring. Your operational systems should be monitored in real-time for potential attacks/exploits. This is an important aspect of your operational intelligence.
The Disciplined Agile Delivery (DAD) toolkit explicitly includes architecture-related activities, the role of Architecture Owner, and promotes the philosophy of enterprise awareness. Our experience is that agile enterprise architecture proves to be a key enabler for organizations in the process of adopting a Disciplined DevOps mindset.
In addition to general DevOps strategies, there are several enterprise architecture activities that support DevOps:
- Reuse mindset. An important thing that your enterprise architecture efforts will accomplish is the promotion of a reuse mindset within IT, and throughout your organization in general. Delivery teams with a reuse mindset strive to leverage existing data sources, services, components, frameworks, templates, and many other assets. This reuse mindset is enabled through education, coaching and mentoring by your enterprise architects (who are ideally active members of IT delivery teams in the role of Architecture Owner). It is also enabled by technical roadmaps that indicate the technologies that IT delivery teams should, and shouldn’t, be working with. And of course, having high-quality assets that are easy to discover, to understand, and to apply in the course of providing real value to your stakeholders enables reuse.
- Technical-debt mindset. Your enterprise architecture effort should promote strategies that motivate delivery teams to pay down technical debt when they find it and more important do what they can to avoid it in the first place. Many technical debt strategies are embedded right in DAD, but without a technical-debt mindset this often comes to naught. Enterprise architects, often acting as Architecture Owners on delivery teams, should coach and mentor developers around the issues associated with technical debt. Similarly they should help to educate the senior managers and stakeholders whom they collaborate with in technical debt as well. It requires investment to avoid and remove technical debt, and IT investment decisions are typically in the hands of these people.
- Development guidelines. An important aspect of enterprise architecture is the development of guidelines for addressing common concerns across IT delivery teams. Your organization may develop security guidelines, connectivity guidelines, coding standards, and many others. By following common development guidelines your IT delivery teams produce more consistent solutions which in turn makes them easier to operate and support once in production, thereby supporting your DevOps strategy. A potential drawback of common development guidelines is that developers may feel constrained by them. To counteract this problem the guidelines should be developed and evolved in a collaborative manner with the delivery teams, not imposed from above.
- Technical roadmaps. Your enterprise architecture efforts include the definition, support, and evolution of technical roadmaps that guide the efforts of the rest of the organization (business roadmaps, also important, are the purview of Product Management). This in turn supports the creation of a common and consistent technical infrastructure within your production environments, enabling common DevOps practices such as continuous deployment, automated end-to-end regression testing, and operational monitoring that we discussed in previous blog postings.
An important aspect of your technical roadmap is to capture both the existing IT infrastructure and the future vision for that infrastructure. Your IT infrastructure potentially includes your network, software services, servers, middleware, and data sources to name a few elements. As you can see in the following diagram, when developing your technical infrastructure vision there are two issues to consider:
- Ownership. Does your organization own and operate its own infrastructure or does it outsource some or all of it to external experts. Outsourcing options include traditional strategies such as having another organization (such as HP or IBM) run your data centers to using cloud-technologies hosted by external organizations (such as Amazon or Google). The advantage of owning your own infrastructure is the greater level of control that it provides you, something that is critical when you must guarantee the security and integrity of your IT solutions. Outsourcing potentially offers greater flexibility in managing your IT infrastructure and cost savings from economies of scale. However, outsourcing requires more sophisticated governance and in the case of traditional strategies is a potential bottleneck when the outsourcer cannot respond in a timely manner to your requests.
- Virtualization. Are the elements of your IT infrastructure built to meet the needs of specific solutions or are they softwarized to provide malleability and ease of evolution? With softwarization, also known as software-defined infrastructure (SDI), the elements of your IT infrastructure are fully virtualized. Softwarization includes IT infrastructure models such as a software defined data center (SDDC), software defined storage (SDS), and software defined network (SDN). Softwarization is typically implemented using cloud-based technologies on either side of your firewall. Greater virtualization offers to increase flexibility and programmability of your IT infrastructure, and thereby enabling you to optimize resources. However, the greater flexibility of virtualization can increase the complexity of your testing efforts and make operational incident simulation more difficult.
4. Why Disciplined DevOps?
In this section we briefly explore why both you and your organization should consider adopting a Disciplined DevOps mindset. For yourself as an individual there are several interesting benefits. First, you become more productive as an IT professional, increasing your chance at promotion and making you more attractive in the marketplace. Second, you are in a position where you can focus on interesting, value-added work, which should lead to greater job satisfaction for you. Third, much of the dysfunctional politics exhibited in traditional IT organizations is effectively squeezed out as you move to a Disciplined DevOps mindset, making your work environment a more enjoyable place to be.
There are many reasons why your organization should consider adopting a Disciplined DevOps mindset. Table 2 below summarizes the potential benefits and how they are achieved.
Table 2. Organizational effects of adopting a Disciplined DevOps Mindset
|Decreased time to market||
|Decreased cost to deploy||
|Improved mean time between deployments||
|Improved market competitiveness||
5. Adopting Disciplined DevOps
We have found the following strategies useful for organizations adopting a Disciplined DevOps approach:
- Invest in your people. In our experience 80 to 90% of your overall effort will be in helping people to learn new skills and ways of thinking and to rethink if not abandon many of the “best practices” of yesteryear. This requires training, education, and coaching over a long period of time — most people will require many months, and sometimes years, to make the transition to this new mindset.
- Create a safe learning environment. Teams must be free to experiment, to try new strategies to discover what works for them in the situation that they face. Very often this will work out well, with a few stumbles along the way, but every so often the experiment will show that the strategy just isn’t right for this team. That should still be considered a successful experiment, learning what doesn’t work is just as valuable as learning what does, and the team should not be worried about recrimination for “failed” experiments.
- Look at the “whole system.” Disciplined DevOps is about more than just continuous delivery (although that is a great start) and in most enterprises it’s about more than just streamlining development and operations. With a Disciplined DevOps mindset we strive to improve flow through the entire ecosystem, including development, operations, support, enterprise architecture, data management, release management, and most importantly to the business itself.
- Improve locally, transform globally. Each team, including your solution delivery teams, your enterprise architecture team, your operations staff, and many others must strive to improve and streamline the way that they work. These local improvement efforts must be supported by a “global” transformation effort that focuses on improving DevOps across your entire ecosystem. Every team will affect other teams, motivating them to make improvements which in turn affects how they work with others. Your IT department is a complex adaptive system where people and teams learn and improve over time in a dynamic, evolutionary manner. If you just focus on local improvements your DevOps effort is likely to devolve into chaos. If you just focus on a company-wide, global transformation it is likely to get bogged down in bureaucracy. An “improve locally, transform globally” approach is a viable middle ground that benefits from these two extremes while avoiding the disadvantages.
- Have a communication plan (and work it). Adopting a Disciplined DevOps strategy within your organization typically requires a lot of (often small) changes. Although it may be clear to you why this is important it isn’t always clear to everyone else. People need to understand why you’re making these changes, what’s in it for them, what the overall change strategy is, how far along the plan you currently are, what changes are coming soon, and so on. You communication plan may include regular newsletters, posters overviewing key concepts, brown bag lunches where people share their experiences, electronic discussion forums, management presentations, and many more. The keys to success are to have a constant drum beat of information, to be as open and honest about what is actually occurring, to provide opportunities to everyone to learn, and to motivate everyone to share their learnings.
- Think long-term. Disciplined DevOps is a journey, not a destination. It takes time for people to adopt a new mindset, months and often years before it is truly ingrained in their way of thinking. This paradigm shift does not occur by management fiat, nor does it occur as the result of a day or two of training (although training is important), nor does it occur because you’ve invested in new tooling. Organizations that successfully make this paradigm shift do so by investing in their people, their process, and their tooling over the long term.
6. Parting Thoughts
After describing these critical strategies that support Disciplined DevOps, we’d like to conclude with what we feel to be critical success factors:
- Build a collaborative and respectful culture across your entire IT organization. Our experience is that people, and the way that they work together, are the primary determinants of success when it comes to adopting a Disciplined DevOps strategy. Unfortunately, it is considerably more difficult to bring about cultural change in an organization than it is to adopt a handful of new practices.
- Focus on people, but don’t forget process and tooling. DevOps is primarily a mindset, but as you’ve seen in this article there is a large number of potential practices/strategies (yes, that process stuff) that you need to consider adopting. In turn these practices/strategies are supported by tooling, either existing tooling that you have in place (albeit now used in a different manner) or new tooling that you will need to adopt.
- Choice is good. This article has made it clear that there are many options available to you, each of which has its advantages and disadvantages. No single approach is perfect, and no single approach works in all situations. You not only need to have choices, it’s incredibly good to have choices. Remember the principle Choice is Good!