There are several common development practices that support Disciplined DevOps:
- Canary tests. A canary test is a small experiment where new functionality is deployed to a subset of end users so you can determine whether that functionality is of interest to them. This in turn provides insight to the development team as to the true potential value of the functionality (if any). For example, an e-commerce company might believe that a new feature where people can buy two related items at a discount will help to increase sales. At the same time they fear this could decrease overall revenue. So they decide to run a canary test where 5% of their customers are provided this functionality for a two-week period. Sales and revenue are tracked and compared against customers not given access to this new functionality. If a new feature successfully passes a canary test it is then made available to a wider range of end users (you may choose to several rounds of canary tests before finally deploying the functionality to all users). You can think of canary testing as an extreme form of pilot testing.
- Split tests. A split test, also known as an A/B test, is an experiment where two or more options are run in parallel so that their effectiveness can be compared. For example, a bank may identify three different screen design strategies to transfer funds between two accounts via an automated teller machine (ATM). Instead of holding endless meetings, focus groups, or modelling sessions the bank instead decides to implement all three strategies and put them into production in parallel. When I use an ATM I’m always presented with strategy A, when you login you always get strategy B, and so on. Because the ATM solution is instrumented to track important usage metrics the bank is able to determine which of the three strategies is most effective. After the split test is completed the winning strategy is made available to all users of ATMs.
- Automated regression testing. Agile software developers are said to be “quality infected” because of their focus on writing quality code and their desire to test as often and early as possible. As a result, automated regression testing is a common practice adopted by agile teams, which is sometimes extended to test-first approaches such as test-driven development (TDD) and behavior-driven development (BDD). The regression test suite(s) may address function testing, performance testing, system integration testing (SIT), and acceptance testing and many more categories of tests. Because agile teams commonly run their automated test suites many times a day, and because they fix any problems they find right away, they enjoy higher levels of quality than teams that don’t. Because some tests can take a long time to run, in particular load/stress tests and performance tests, that a team will choose to have several test suites running at different cadences (i.e. some tests run at every code check in, some tests run at scheduled times each day, some once every evening, some over the weekend, and so on). This greater focus on quality is good news for operations staff that insists a solution must be of sufficient quality before approving its release into production.
- Continuous integration (CI). Continuous integration (CI) is the discipline of building and validating a project automatically whenever a file is checked into your configuration management (CM) system. As you see in Figure 1, validation can occur via several strategies such as automated regression testing and even static or dynamic code and schema analysis. CI enables developers to develop a high-quality working solution safely in small, regular steps by providing immediate feedback on code defects.
- Continuous deployment (CD). Continuous deployment extends the practice of continuous integration. With continuous deployment, when your integration is successful in one sandbox your changes are automatically promoted to the next sandbox. The CI strategy running in that environment automatically integrates your solution there because of the updated source files. As you can see in Figure 2 this automatic promotion continues until the point where any changes must be verified by a person, typically at the transition point between development and operations. Having said that, advanced teams are now automatically deploying into production as well. Continuous deployment enables development teams to reduce the time between a new feature being identified and being deployed into production. It enables the business to be more responsive. However, when development teams aren’t sufficiently disciplined continuous deployment can increase operational risk by increasing the potential for defects to be introduced into production. Successful continuous deployment in an enterprise environment requires an effective continuous integration strategy in place in all sandboxes.
- Development intelligence. This is the application of data warehouse (DW)/business intelligence (BI) solutions to provide insight into how delivery teams are working. The automated team dashboards provided by many development platforms are a simple form of development intelligence, a more sophisticated (and useful) strategy is to combine information from various development tools to display it in an integrated dashboard for the team, and more sophisticated yet is to roll up/combine information from different delivery teams into a portfolio management dashboard. Development intelligence is a subset of IT Intelligence.
There are also several common operations-friendly features that developers with a Disciplined DevOps mindset will choose to build into their solutions:
- Feature access control. To support experimentation strategies such as canary tests and split tests it must be possible to limit end user access to certain features. This strategy must be easy to configure and deploy, a common approach is to have XML-based configuration files that are read into memory that contain the meta-data required to drive an access control framework.
- Monitoring instrumentation. Developers with a Disciplined DevOps mindset will build instrumentation functionality — logging and better yet real-time alerts — into their solutions. The purpose is to enable monitoring, in (near) real-time, of their systems when they are operating in production. This is important to the people responsible for keeping the solution running, to people supporting the solution, to people responsible for debugging and fixing any problems, and to your operational intelligence efforts. Monitoring instrumentation enables canary tests and split tests in that it provides the data required to determine the effectiveness of the feature or strategy under test.
- Feature toggles. A feature toggle is effectively a software switch that allows you to turn features on (and off) when appropriate. A common strategy is to turn on a collection of related functionality that provide a value stream, often described by an epic or use case, all at once when end users are ready to accept it. Feature toggles are also used to turn off individual features when it’s discovered that the feature isn’t performing well (perhaps the new functionality isn’t found to be useful by end users, perhaps it results in lower sales, …). Another benefit of feature toggles is that they enable you to test and deploy functionality into production on an incremental basis.
- Self-testing. One strategy to make a solution more robust, and thus easier to operate, is to make it self testing. The basic idea is that each component of a solution includes basic tests to validate that it can properly run while in production. For example, an application server may run basic tests at startup such as verifying the version of the operating system or of frameworks that it relies on. While the server is running it might regularly check to see if other components that it relies on, such as data sources and middleware services, are available. When a problem is detected it minimally should be logged, better yet an alert should be posted if intervention by a person is required, and even better yet the solution should try to recover from the problem.
- Self-recovery. When a system runs into a problem it should do it’s best to automatically recover and continue on as before. For example, if the system detects that a data source is no longer available it should try to restart that data service. If that fails, it should record change transactions where possible and then process them until the data service becomes available again. A good example of this is an ATM. When ATMs lose their connection to a bank’s financial processing system they will continue on for a period of time independently albeit with limited functionality. They will allow people to withdraw money from their accounts, perhaps putting a limit on the amount withdrawn to limit potential problems with overdrawn accounts. People will still be able to deposit money but will not be able to get a current balance or see a statement of recent transactions. Self-recovery functionality provides a better experience to end users and reduces the operational burden on your organization.
Now that we have overviewed a collection of development practices and implementation features, let’s explore strategies that streamline your operations efforts.