Schedule risk analysis simplified
The critical path method (CPM) of scheduling a project is a key tool for project management. A schedule “network” represents the project strategy. Activities, where the work is accomplished, are linked by relationships (e.g., finish-start, start-start, finish-to-finish) showing how the work is planned. Strings of linked predecessor and successor activities constitute “paths” through the network. When two or more paths are to be done simultaneously, they are described as parallel paths.
Some of the most important points in the project are where several parallel paths converge. At these merge or join points, the paths must all be completed before a milestone is recorded for payment, an inspection can be done, subassemblies can be integrated for testing, or the project can be recorded as complete.
CPM computes the shortest project completion duration and/or completion date from the longest path through the network. The “longest pole in the tent” is called the “critical path.” Any delay on the critical path will delay the project. Conversely, CPM shows that paths that are not critical can be delayed or lengthened, if they have enough “float” or scheduling flexibility, without necessarily delaying the project.
On the one hand, the critical path method of scheduling is traditional and well-accepted. It is essential for developing the logic of the project work and for managing the day-to-day project activities.
On the other hand, the CPM schedule is only accurate if every activity is started on its scheduled date and takes just as long as its duration estimate indicates. Project managers understand that projects do not ever go entirely according to plan, which is one reason for frequent status reviews. Since real projects do not work this way, CPM is just the beginning of project schedule management.
Project managers need to understand some key reservations about the standard CPM, and how to use a schedule risk analysis to provide information crucial to a project's success, before they embark on their project:
- The project duration calculated by CPM is accurate only if everything goes according to plan. This is rare in real projects.
- In many cases the completion dates CPM produces are unrealistically optimistic and highly likely to be overrun, even if the schedule logic and duration estimates are accurately implemented.
- The CPM completion date is not even the most likely project completion date, in almost all cases.
- The path identified as the critical path using traditional CPM techniques may not be the one that will be most likely to delay the project and which may need management attention.
Schedule Risk Analysis: The Rest of the Story
In his October 1995 PM Network Software Forum column (and a follow-on column in the April 1996), Harvey Levine challenged “the mathematically disinclined” to determine the degree of schedule overrun risk without getting confounded by mathematical or statistical terminology. To help achieve that goal, three case studies are presented in this article. Using simple schedules of two activities or four activities on two parallel paths, these case studies show both the pitfalls of relying on CPM and the benefits of a schedule risk analysis.
These case studies show that it may not be feasible to complete even simple projects by their CPM-determined date, given risk characteristics that are similar to those often found in real, everyday projects. CPM does not identify these overrun problems very well, or at all. In fact, CPM shows that the project will finish on a certain date. If CPM does not work well in such simple projects, and the problems are multiplied many times for the complex projects facing project managers every day, the benefits of conducting a risk analysis on every project become obvious. (For an analytical treatment of this subject, with more implications than are included here, see my March 1995 article in the Project Management Journal, “Project Schedule Risk Assessment.”)
Popular, accessible commercial software is used for these case studies to show that schedule risk analysis methods are available to anyone. Graphical presentations of the results illustrate the steps required to conduct a quantified risk assessment and the results without using mathematics or statistics.
Case 1: Three Steps to a Successful Analysis
A successful risk analysis has three steps: (1) create the CPM schedule for the project, (2) estimate the uncertainty in the activity durations, and (3) perform a risk analysis of the schedule, usually with a Monte Carlo simulation method available in several software packages.
Step 1: CPM Schedule—The Foundation of a Risk Analysis. CPM analysis of the project schedule is the key building block of a quantified risk assessment. Case 1 presents a very simple project and a typical schedule risk analysis. It illustrates how the CPM completion date can easily be overrun. It shows how a risk analysis can illuminate the issues in CPM and point to their resolution. For the first step, a project with two activities and a finish milestone is shown in Figure 1.
Suppose the durations are set at 50 working days for A101 and 80 working days for A102. Suppose further that this project is scheduled to start on June 10, 1996. CPM shows that this simple project will take 130 working days (50 + 80 = 130) and complete on December 11, 1996.
There are several considerations in developing a successful risk analysis network. For a risk analysis to be successful, the CPM network should be developed at a level of detail that shows the important project structure, i.e., the parallel paths and key merge points where the risk is often increased. Watch for three features to determine the correct level of a schedule:
- First, the network should not be developed at too high a level of detail; for instance, at a level where most of the activities summarize important underlying detail. Summary activities are often characterized by start-to-start or finish-to-finish relationships.
- Second, the schedule should show clearly the parallel paths that could cause the project to be late if not coordinated. Case 2 below shows the “merge bias” that occurs when parallel paths converge.
- Third, the network should not be developed in such great detail that it requires too much information, since that would be unworkable and overly burdensome.
When building the risk analysis network, the temptation is to schedule only those paths assessed as the longest poles in the tent. This is a dangerous practice that should be avoided. It is best to show more paths rather than fewer, since the shorter poles might actually have more risk than the long poles identified by CPM. Case 3 below shows that a shorter path with more risk can actually be the “highest risk path,” i.e., the path most likely to actually delay the project. If shorter paths are not included in the network to begin with, their risk might not be explored at all.
If there are scarce resources that make scheduling some activities in parallel in-feasible, they ought to be identified and included in the schedule risk analysis. If constraints are included in the CPM network, however, these must be deleted in the risk analysis. The points are discussed in more detail later.
Step 2: Determine the Activity Duration Ranges. The activity durations that are used to calculate the critical path are often thought of as the “best guess” or “most likely” amount of time needed to complete the work given the planned resources. Experienced project managers know that the work might take more or less time than the estimate they have assumed for the CPM calculation. These times make up the low and high ranges for a risk analysis.
Duration ranges for each activity are the low (optimistic) and high (pessimistic) durations that the work on the activity might take under different possible extreme scenarios. High ranges, for instance, can be determined by examining the various things that could go wrong, such as technical problems, site conditions, supplier delays, and permitting issues—factors which are often called “risk drivers.”
Risk drivers are identified and explored in interviews with the managers or team leaders of the project's activities because they are the most knowledgeable people on risk in the work under their management and control. They are most familiar with the risk issues in each activity and can best assess their impact on the possible duration of each activity. Experienced guidance is often needed to develop the duration ranges.
Note: This view of the Case 2 network and risk ranges was developed in Microsoft Project v. 4.0 and Risk+ v. 1.5, an MS Project add-in from Program Management Solutions.
The risk interview starts with a description of what could go wrong with the work in an activity. It then turns to the likelihood of that scenario and how long the activity might take if everything goes wrong. This duration is the high-range duration for that activity. A serious and honest airing of the issues involved in doing an activity will soon uncover possible, though perhaps unlikely, pessimistic scenarios.1
Conversely, exploring with the managers what could go right should yield an optimistic scenario and the low estimate of duration for each activity. Sometimes, with aggressive or “success-oriented” schedule estimation, the CPM duration for many activities is already the shortest possible duration conceivable.
Often the CPM duration is viewed as the most likely duration for most of the activities in the project. This may not be true, however. The risk analyst should, with the project manager, seek to describe the most likely scenario and determine if the CPM duration, or some other duration, will accommodate it.
The ranges of longer (Max Dur) and shorter (Min Dur) activities for the two Case 1 activities are shown in Figure 2. The table in that figure shows that there is a greater risk of overrun than opportunity for underrun in these two activities.2 The bar chart shows the CPM completion date of December 11, 1996.
There may be problems in collecting data about project schedule risk. For instance, project managers may be reluctant to commit their activity managers' time to activities to develop the information needed for a risk analysis, since the activity managers are the very people who are busiest managing the project itself. Alternatively, some project managers avoid examining risk at all because it poses difficult issues, highlights bad news, exposes embarrassing problems or raises issues that must be disclosed to the owner or customer.
Even if a project manager wants to explore risk honestly, it is often challenging to develop realistic ranges, particularly for unlikely but possible extreme events. But, gathering duration range data has the major benefit of making everyone aware of the problems facing the project. A risk analysis usually reveals important problems that had not been appreciated or communicated fully before the gathering of the duration range data.
Projects have many activities, sometimes thousands. How should the ranges be developed for such a large project? In this case, “risk banding” is often used. The project manager organizes the activities into groups with common risk characteristics. For each of these “risk types” the low and high duration ranges would be expressed in percentages, e.g., minus 15 percent, plus 25 percent, from the CPM duration. With large networks there is little alternative but to use CPM durations as the most likely in the activity distributions.
Once the ranges, often called “three-point estimates,” are determined, the project manager must adopt a probability distribution shape for each risky activity. A probability distribution takes the three possible durations (low, most likely, and high) and expresses the relative likelihood of alternative outcomes within that overall range. That is, there are some possible durations that are more likely than others.
Triangular distributions are often used in risk analysis because they are easy to specify (just needing three points and a straight-edge) and to use in analysis. Also, the project managers may not know enough to specify any other distribution type, although other distribution shapes are often available in the software.
Suppose that the analyst in Case 1 chooses a simple triangular distribution for each activity, using the data from Figure 2. The distributions are shown in Figure 3. They make it clear that both CPM durations are optimistic, and that the A101 duration of 50 working days is quite optimistic.3
Step 3: Simulate the Project Schedule. Once the activities' duration ranges and distributions have been determined, the schedule risk analysis can determine how risky the entire project schedule is. The most common method of determining schedule overrun risk is to simulate the project by solving (or iterating) it hundreds or thousands of times on the computer.
The simulation results for Case 1, for instance, will answer such important project management questions as:
- Is the completion date estimate of December 11, 1996, reasonable?
- Is December 11 even the “most likely” duration of this simple project?
- How many days are needed for a contingency to reduce the overrun risk exposure to an acceptable level?
Monte Carlo simulation, the method most often used, has several steps:
- Each iteration begins by selecting at random a duration for each risky activity from its range and distribution, like those in Figure 3.
- The total project and key milestone completion dates for that iteration are calculated using CPM for that particular configuration of durations. Those are only possible dates for completion of the project and its milestones, and they may not be representative of all possible solutions.
- To determine the entire pattern of possible completion dates for the project and its important milestones, the risk analyst iterates the project many times. At the end of each iteration, the completion dates for the total project and for any important milestone are collected and stored. The program also records which activities were on the critical path for that iteration.
- At the end of the entire simulation, project completion and important milestone dates computed from all iterations are collected and arrayed in graphs and tables showing the probability distribution, or relative frequency, of all possible dates.
Suppose that the risk analyst examining Case 1 determines that 2,500 iterations will be sufficient for the accuracy needed.4 The result of that simulation is a cumulative likelihood distribution that represents the likelihood of the project completing on or before each possible date. This distribution is shown in Figure 4.
Figure 4 includes a likelihood distribution (bell curve) and a cumulative distribution (S-curve) shown in both pictures and a table for the total project completion date. From the chart we can see:
- The CPM completion date of December 11 is only 10–15 percent likely to be adequate for this simple project. Placing confidence in completion by December 11 is very likely to get the contractor and customer in trouble.
- A look at the bell-shaped distribution reveals that the most likely completion date is close to December 24, not December 11. The commonsense notion that adding “most likely” durations along a critical path will result in the most likely project completion date is simply wrong, in most cases.
- The average completion date is January 7, 1997. If this simple project were done 100 times, its average completion would be about a three-week overrun of the CPM duration, providing for the holidays.
- Suppose a conservative schedule is required, one that has an 80 percent likelihood of success. The results show that January 24, 1997, has such a success likelihood. Hence, a six-week contingency is needed to reduce the risk of overrun to an acceptable level for this conservative company.
Note: The simulation results in the table summarize the cumulative distribution, as shown in the chart. The display is from Risk+ v. 1.5.
CPM clearly establishes December 11, 1996, as the project end-date. Just as clearly, the risk analysis establishes that an end-date of December 11 is highly optimistic. Any owner/customer or contractor that agrees to that date is already in trouble on this project. Without a risk analysis, the existence or degree of trouble is unknown.
Monte Carlo is a well-established method to represent risk. Several commercial scheduling programs have a schedule risk analysis module. For some others, third-party software companies may have provided such capability. Not all network programs have risk analysis packages, however. The risk analyst using a network program that lacks the risk package will have to load the schedule into a network program that does.
Two issues common to scheduling should be resolved before the simulation is begun. These issues are the handling of constraint dates and limited resource requirements.
The first issue is the treatment of constraint dates, such as “not later than” or “must finish on” dates. Constraints are often implemented in the CPM schedule to represent key dates contained in a contract or some other requirement. Constraints are used to highlight key contract dates because these dates must be met; otherwise the project is in trouble.
Constraint dates have no place in a risk analysis, however. One main goal of a risk analysis is to determine whether the important contract dates, those established in the network with constraint dates, are in jeopardy. If constraint dates were implemented in the simulation, each iteration would be forced to meet those dates. Simulation with constraints operating during each iteration cannot possibly investigate the feasibility of meeting the dates because success is enforced. Constraints must be taken off the schedule before doing a risk analysis because they invalidate the analysis.
The second issue is the treatment of limited resources. If several activities using the same resource are potentially scheduled at the same time, they may require more of that resource than is available. Resource leveling in CPM pushes one or another resource-using activity out in time so that a resource's usage does not exceed its availability in any time period.
Because each solution in a risk analysis must at least be feasible, it should not violate any resource limitations that exist. Each iteration must be resource-leveled if some resource(s) is (are) limited. The risk analysis software package should be able to level resources as it is iterating. This increases run time substantially.
Case 2: Risk with Parallel Paths—The “Merge Bias”
The three steps of a schedule risk analysis can be used in more complicated cases than Case 1. For instance, most projects have activities planned simultaneously along parallel paths. At the end of the project, and often at important internal milestones, these paths converge.
At path convergence (or merge) points, projects can be delayed if the probability distributions of the converging paths' durations overlap because a delay on any one of the paths will delay the work. Examples include (a) several types of construction work that must be completed before an inspection can be conducted, or (b) several components that must be finished before integration and testing can be done.
Of course, some merge points have more than two converging paths, and the opportunity for delay at such points is thus magnified. Clearly, path merge points can never be good news for a risk analysis. The increase in project risk at merge points, called the merge bias, is the subject of Case 2.
In Case 2, a simple parallel path project is assumed. Suppose B101 is exactly like A101 and B102 is exactly like A102 in all respects; CPM durations and low and high ranges, as illustrated in Figure 5. (As in Case 1, this analysis was conducted using Risk+, an add-in for Microsoft Project.)
The CPM result for this schedule is exactly the same as Case 1; 130 working days to a completion date scheduled at December 11, 1996. Both Path A and Path B are identified as critical paths by CPM.
A project with parallel paths (like Case 2) is almost always more likely to be overrun than the simple single-path schedule (such as that in Case 1). The cause of this pessimistic result is the merge bias, which has been known for some 25 years. K.R. MacCrimmon and C.A. Ryavec discussed this in “An Analytical Study of the PERT Assumptions,” published as Appendix C in Network-Based Management Systems (PERT/CPM) by Russell Archibald and Richard Villoria (1967, Wiley and Sons).
The distribution of possible completion dates from a Monte Carlo simulation of Case 2, shown in Figure 6, reflects the impact of the merge bias:
- The CPM date, which is still December 11, 1996, is now not even 5 percent likely to occur.
- The most likely completion date is about January 7, 1997, not December 24, 1996, as in Case 1.
- The average completion date is January 17, not January 7, 1997, as in Case 1. This project has an average or “expected” overrun of five calendar weeks from the CPM estimate.
- The conservative company requiring an 80 percent likelihood of success now requires a completion date of February 4, 1997, not January 24, as in Case 1.
The CPM schedule did not reveal these problems at all. Even with parallel critical paths, the CPM analysis forecasted the same completion date of December 11, 1996, in both cases. The risk analysis showed that the risk of overrun and necessary schedule contingency were materially higher in Case 2 over those in Case 1.
This article has shown that two simple projects, one with two activities and another with four activities on two parallel paths, can be in trouble. It follows that real projects, which are infinitely more complex, are even more likely to be infeasible because they have more parallel paths and merge points. A risk analysis identifies and quantifies the difficulties faced by the project manager.
Case 3: Risk Management and the “Highest Risk Path”
Suppose the project manager who received these risk analysis results took steps to mitigate the risk forecast in Case 2. Several steps might be taken.
- First, the project manager might reduce the risk in activities A101 and A102, perhaps through different strategies, e.g., buy rather than make a tricky component or utilize a different supplier thought to be more reliable.
- Second, the project manager could shorten the duration of activity B101 by five days, to 45 working days, perhaps by adding a second shift to part of the work. For Case 3, it is assumed that the risk ranges around both activities' durations in Path B are the same as in Case 2, perhaps reflecting geological conditions or the technological challenges which remain.
Suppose that the new risk ranges resulting from these steps are shown in Figure 7.
Experienced schedulers will recognize that shortening activity B101 makes Path B a “near-critical” path that has a CPM duration of 125 days and a float of five working days. Path A is now the only critical path with a total CPM duration of 130 working days. The completion date is still scheduled for December 11, 1996.
But, is Path A the most likely to delay the project? Is it the highest-risk path? A risk analysis is now required not only to estimate the possible durations of this project after risk management but also to identify the highest-risk path.
As mentioned above, the simulation software identifies which activities are on the critical path for each iteration. At the end of the simulation, the percentage of the iterations in which the activities were critical indicates the relative likelihood that delays in their completion will delay the project, their “relative criticality.”
In Case 3, Path A may be the critical path using CPM durations, but that is not the end of the story. In the software display shown in Figure 7, relative criticality for each activity is indicated by the numbers to the left of the bars in the bar chart. These results indicate that Path B is the highest-risk path, the one with the greatest likelihood to delay the project. It has a 69 percent likelihood, even though its CPM duration is five days shorter than Path A. The risk management steps taken on the CPM critical Path A succeeded; Path A is forecast to delay the project only 31 percent of the time. In this way a risk analysis helps identify the highest risk path for risk management.
Case 3 simulation results are presented in Figure 8, which shows that the risk management actions have reduced the risk of the total project. The average completion date is now January 3, 1997, not January 17 as in Case 2 before risk management.
Will the project complete on time after these risk management actions? The CPM date of December 11, 1996, is still less than 5 percent likely and further risk management steps are needed to reduce the risk to an acceptable level. Those steps might most profitably be applied to Path B, the highest-risk path, at this time. Certainly, the project manager should look closely at the progress in B101 and B102, for those are high-risk activities.
Risk assessment of a project requires three steps: create the CPM schedule; gather risk information such as optimistic and pessimistic durations and probability distributions; and simulate the network using a Monte Carlo approach. The greatest amount of effort and judgment goes into developing the three-point activity duration estimates to use in a schedule risk analysis.
The simple case examples highlight the benefits of risk assessment and the pitfalls of relying on a CPM analysis alone. Some of these benefits are finding out how likely the CPM completion date is; determining the contingency needed to reduce the overrun risk to an acceptable level; identifying the highest-risk path for project risk management; and evaluating the effect of risk management actions.
The risk analysis has the potential of providing key information for project managers in advance so that risk mitigation plans can be developed and implemented now. Experience with risk analysis shows also that developing the data and reviewing the results enables the participants to understand and to manage their project better.
There is risk in every project. Ignoring risk doesn't make it go away. A three-step risk analysis should be conducted for every important project.
1. A 1960's government manual on PERT recommended that the high-range pessimistic estimate should include failure and making a fresh start if the likelihood of such an event was at least 1 percent. See U.S. Department of Defense, “PERT Fundamentals: POTC Textbook,” Bolling AFB, PERT Orientation and Training Center (date unknown), p. 3.28.
2. Asymmetrical distributions are quite common. The CPM durations are often optimistic scenarios that do not work out in practice.
3. One measure of how optimistic these durations are in a comparison of the CPM duration to the average that would occur if the activities were to be done many times. The average durations that correspond to the given ranges and triangular distributions are computed by (low + most likely + high) / 3. For A101, an average overrun is calculated as (40 + 50 + 100) / 3 = 190 / 3 = 63.3 working days. An overrun of 13.3 days from the CPM duration of 50 days is almost three weeks.
4. More iterations will provide more accuracy and smoother output graphs. More complex models will require more iterations to produce accuracy. If different scenarios are embedded in the model (e.g., 20 percent chance of failing a test and starting over), more iterations will be needed. Often iterations will take computer time, especially with complex projects and resource leveling, so there is a practical side to how many iterations to use. Try a number of iterations on the specific schedule, using different “seed values” to start the Monte Carlo, until the results do not change materially as the number is increased. ■
David Hulett, Ph.D., consults on project cost and schedule risk analysis and project scheduling through his firm in Santa Monica, Calif., and with Humphreys & Associates, Inc., of Mission Viejo, Calif.
PM Network • July 1996