Introduction
For years Project Management benefits have been demonstrated in technology project delivery but it's benefits are also now being realized in IT support organizations executing Service Delivery and Service Support best practices described by the Information Technology Infrastructure Library (ITIL). So what is this ITIL? Why should I care? What is the connection between this and project management anyway? What value is this going to bring to my organization because we are already pretty busy doing “real work”?
Information Technology Infrastructure Library (ITIL) Framework
The Information Technology Infrastructure Library, more commonly referred to as ITIL, was developed in the 1980's. It is intended to be an industry best practice guideline used to address Service Management issues. Recognizing that not all organizations are the same, the best practice guideline approach allows for each organization to take the goals and activities and to apply them to the organizational context in which they operate. “It provides a framework in which to place existing methods, activities in a structured context, providing a strategic context that improves tactical decision-making and has an aligning influence on the tasks of Service Management” (ITIL, 2001).
The ITIL decomposes Service Management into two key areas; Service Support and Service Delivery. Service Support is focused on the ongoing operational functions while Service Delivery is more concerned with the long term planning of improvements to the IT environment. Each of these areas are organized by the processes shown in Exhibit 1.
Exhibit 1: Service Management Processes (Macfarland et al, 2001)
One of the primary strengths of the ITIL is the focus on the business requirements. Each of the process areas reinforces the concept that service should be delivered that meets the needs of the business that it supports. It also emphasizes the relationships between the processes to make them mutually supporting. These two factors have contributed to the ITIL in becoming the de facto worldwide standard in Service Management.
In lieu of giving a high level overview of each of these processes, the approach of this paper will be to focus on two key processes, Incident Management and Availability Management. We will explore these in some level of detail and address their relationships with Project Management.
Incident Management
ITIL Incident Management Overview
“The primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. ‘Normal service operation’ is defined here as service operation within Service Level Agreement (SLA) limits.” (ITIL, 2001)
An incident can be any event from an application that is not available, to hardware failure and even service requests such as a forgotten password. The key point here is that the objective is to minimize impact on the business. To that end, the sole focus of Incident Management is to restore service to the user as quickly as possible. At this point the user is not particularly interested as to why service is not available, they just want it fixed and we can talk about root causes and what we are going to do about it later. The key inputs, activities and outputs of the incident management process are shown in Exhibit 2.
Exhibit 2: Inputs, Activities, and Outputs of Incident Management
Project Management in Action During Incident Management
An incident has a definite beginning and end delivering a unique type of service. The service that is provided is that a problem is solved and that technology service is restored to the customer. Look again at the activities listed in Exhibit 2 and note the key words ownership, monitoring, tracking, communication, and closure. Let's see…if we have a “temporary endeavor undertaken to create a unique product or service” (PMBOK, 2000, p.4) and we need someone to do things like take ownership, monitor, track, communicate and closedown what is the best approach?
Approaching the management of a major incident as a project with a project manager is very effective in achieving the goal of restoring service to the customer as quickly as possible. There may be some purists that will quibble over the nuances between projects and operations, but the fact remains that project management discipline in the major Incident Management context absolutely adds value. Many incidents are resolved in the operational context, the so-called ‘run of the mill’ incidents, but when the impact is a major incident, more management is required.
A major incident has three phases initiation/ planning, incident resolution, and closure. A major incident needs to be defined by the business drivers of your organization. ‘Priority’ ranking and elapsed time from start of an incident is a quantifiable way of drawing the line between normal incidents and major incidents. The two facets that are combined to determine priority are number of users and impact on the business.
The initiation/ planning phase of a major incident is key to resolving the incident quickly. Much of the discipline is embedded in standardized processes but it is the responsibility of the project manager to assemble the team and move the incident forward into resolution. The project manager during the initiation phase is responsible for:
- Assigning Roles and Responsibilities.
- Establishing the priority of problem based on current information.
- Determining the Scope of impact.
- Determining appropriate Stakeholders.
- Determining communication requirements.
- Establishing the time limits for diagnosis and reporting of proposed technical solutions.
- Conduct initial communication to impacted stakeholders as to the scope of impact and recovery plan.
During the resolution phase, the project manager is directing the technical team by identifying the recovery actions to be taken, what team members are assigned to each task, the scheduled start and end time, actual start and end time, and current status. These items are important for the technical resolution of the incident but comprehensive and timely communication is of equal importance.
The key to effective communication during a major incident is comprehensive planning. A communications plan specifically lists each communication that must take place, a description of the communication, frequency, responsibility and media. The two techniques that are the workhorses of incident communication are periodic conference calls with appropriate stakeholders (status meetings) and Situation Reports (SITREPs). The status meeting calls are conducted away from any ongoing technical communications that are ongoing to resolve the problem. Status calls are a two-way dialogue following a standard agenda. The project manager provides a status of the problem and summary information about the technical resolution and the impacted stakeholders provide feedback as to the impact of the incident on their business. The standard agenda for a status call is:
- Current status of the incident.
- Business Impact including business deadlines.
- High level technical overview.
- Estimated service restoration time.
- Line of business action steps including workarounds in place.
- Next update time.
The Situation Report (SITREP) provides a standardized communication format, which enables the Project Manager to disseminate a written status of the incident efficiently:
- Operational Period
- Situation
- Background
- Current technology situation and state of recovery
- Impacts
- Technology impact
- Business lines impacted
- Functional business impacts
- Planning
- Recovery solutions being reviewed or implemented
- Operational Objectives/Contingencies
- Action steps within the next xx minutes
- Coordinating Information
- Next status conference call time.
- Conference call information number
- Roles assigned
After the technology issues are resolved incident closure ensures that the customer verifies that the technology problem is resolved, the incident results are properly documented, and that lessons learned are discovered. A formal lesson learned session is conducted to facilitate a professional discussion of an incident, focused on performance standards, that enable participants to discover for themselves what happened, why it happened, and how to sustain strengths and improve on weaknesses. It is a tool that is used to create a feedback loop of lessons learned to continually improve future performance.
Key Incident Management Implementation Lessons Learned
- “I have a problem”. Incident identification and reporting is an obvious prerequisite to Incident Management.
- Communication, communication, communication. Stakeholders who are not informed assume the worst. Their business is being impacted so they have a very keen interest in knowing what you are going to do about it. Knowing that you understand their pain and have a plan to address it goes a long way.
- Many battles are won and lost in the first 30 minutes of a major incident. The project manager must quickly get the team organized and move them forward into resolution a quickly as possible.
- Parallel paths to technical resolution. Have more that one possible solution in the planning stages concurrently. The project manager should also be conducting constant risk identification and response planning. A standard question should be “what if that does not fix the problem”.
- Measure incidents against established Service Level Agreement targets.
Availability Management
ITIL Availability Management Overview
“The goal of the Availability Management process is to optimize the capability of the IT infrastructure, services and supporting organization to deliver a cost effective and sustained level of Availability that enables the business to satisfy its business objectives” (ITIL, 2001). According to the ITIL there are three guiding principles, which emphasize the important of the business and needs of its users (ITIL, 2001):
- Availability is at the core of business and user satisfaction.
- Recognizing that when things go wrong, it is still possible to achieve business and user satisfaction.
- Improving availability can only begin after understanding how the IT services support the business.
Availability Management is wide in scope as it addresses not only the design and implementation but also the measurement and management of the availability of the IT infrastructure. Accurate and timely reporting availability is imperative to effective Availability Management. “Measurements need to be meaningful and add value if availability measurement and reporting are to ultimately deliver benefit to the IT and business organization. This will be strongly influenced in the combination of ‘what you measure’ and ‘how you report it’” (Macfarland et al., 2001). The key inputs, activities and outputs of the incident management process are shown in Exhibit 3.
Exhibit 3: Inputs, Activities, and Outputs of Availability Management
Project Management in Action During Availability Management
During the software development lifecycle phase of high level design a Reliability Assessment is conducted to address the design. The purpose of the Reliability Assessment process is to assess the gap between an application's stated requirements for availability and its ability to meet those requirements based on the design and implementation. The Reliability Assessment process is conducted as a sub-project to the larger systems development effort.
The first step in the Reliability Assessment process is to understand the requirements of the business. In this context, operational requirements and critical business functions are discussed. Operational requirements address the non-functional availability requirements for the system under development. Sample operational requirements include basics such as; the hours of operation, number of peak and concurrent users, types of users, usage profiles, volumetrics, and tolerance for failure e.g. frequency and duration of outages. It is incumbent on the technology team to educate the business on the importance of these decisions. Often the businesses first response is ‘we want a system that is available 24 x 7, never breaks, and we have a very limited budget’. Starting this dialogue with the business is the exact purpose of the Reliability Assessment. All the constraints of the project must be considered including cost, quality, time, scope and customer satisfaction and it is the responsibility of the project manager to proactively manage stakeholder expectations.
Once the availability expectations and requirements of the business are understood the proposed design of the system is evaluated to determine if it is going to in fact meet these requirements. The system is decomposed into its components each of which are evaluated for stability. Components are comprised of devices and software that when combined provide a level of service for providing basic services to a business function or application. Once the systems components are understood, they can be mathematically combined to establish an estimate of how the system is actually going to perform in the production environment. The key measures that are estimated are application availability, mean time to resolve, maximum time to resolve and frequency of failure. These estimates are compared against the previously discovered expectations and gaps are identified. The gaps that are identified are treated as stability risks.
A stability risk is an element in the design of the application or the components its uses that can negatively impact the stability, availability and/or recoverability of the application. Mitigation steps should be implemented, where possible, to reduce the risk associated with the defined risks. Mitigation options can have varying degrees of impact on reducing risk. In some cases the project manager/ line of business may choose to accept the risk rather than implement mitigation steps due to the cost of implementing the mitigation options. It is in the end a business decision that governs the mitigation or acceptance of each stability risk.
Addressing stability risks is not inherently different than any other risk that is managed by the project manager. The high level steps when assessing stability risks are:
- Determine other sources of stability risks.
- Use historical data or documents from past projects to assist in identifying additional risks.
- Ensure that the mitigation steps will address the risk associated with the Stability Risk item.
- Determine the Probability of Stability Risks.
- Determine the Impact of Stability Risks.
The impact of stability risks are measured against the objects for mean time to recover, maximum time to recover and frequency of failure. Variance thresholds are identified to determine the impact e.g. a high impact is present if the maximum time to recover is unacceptable to the customer. Probability and impact are combined to score each stability risk (red, yellow, green) for management reporting.
Mitigation options are discussed with the business and may include:
- Modification to design to eliminate/simplify components.
- Perform load testing to evaluate ability to meet concurrency requirements.
- Develop comprehensive proactive monitoring to identify problems.
- Develop recovery documentation to reduce recovery time.
- Increase capacity on hardware.
- Implementation of additional fault tolerance measures e.g. additional redundancy.
Key Availability Management Implementation Lessons Learned
- Availability must be addressed early in the software development lifecycle.
- All requirements need to be gathered from the business including operational requirements in addition to the functional requirements.
- The systems capability in terms of availability must be assessed to establish reasonable target to measure system performance.
- Actual availability must be measured and compared to the established targets.
- Comprehensively address stability risks in conjunction with other project risks using the same risk management rigor.
- Availability must be designed into the system.