A recent tweet from Market Urbanism about bus bunching early along the route of the B35 got me thinking about the ways we measure transit performance. Given the bias towards big capital projects in the US, it’s not surprising that our service performance metrics can be a little underpowered. I couldn’t find a service policy for NYCMTA on their website, but I did scare up LACMTA’s 2011 Transit Service Policy for here in Los Angeles and the MBTA’s 2010 Service Delivery Policy for Boston. If anyone knows where to find a similar standard for NYCMTA, I’d be happy to update this post to include it.
Note: for this post, I’m talking only about measuring the operational quality of the transit services we have chosen to provide, not the quality derived from things like span of service, frequency of service, and coverage of service, and not measures of efficiency that are also sometimes conflated with quality.
LACMTA’s policy on service quality is remarkably brief (see page 33 of the pdf). Quality is measured by on-time performance (OTP), with a target threshold of 80%, and the volume of customer complaints, relative to an established baseline that references complaints on the poorest performing routes in 2008. This shows that the tools for measuring transit performance in Los Angeles have not yet caught up to our increasing dedication to expanding the system, as manifested by Measure R.
OTP is the easiest thing to measure, but unfortunately, for many transit services, it’s the least relevant to passengers. On-time departures and arrivals are important for long headway services, like commuter rail and low-frequency bus, where passengers time their arrivals at the stop to a published schedule. If you use a service that comes every hour or half-hour, like say Metro Local 158, you don’t just roll up to the stop whenever and wait for a bus. In this case, the bus being late translates directly into delays for you.
A Better Way to Measure Long-Headway Service
For long-headway transit, problems during the peak period have more of an impact on the perception of service quality than problems late at night, because more people are riding during peak periods. A late trip during the peak period delays more people than a late off-peak trip. Therefore, for long-headway service, we should look at the passenger-weighted OTP.

Where OTPi is the on-time performance for trip i, and ni is the number of passengers on trip i. For example, consider a commuter rail service with 5 peak period trains carrying 1,500 people each, and 15 off-peak trains carrying 100 people each. Under conventional OTP, if any one of the trains is delayed, OTP will be 95%. With PWOTP, a delay to a peak period train results in performance of 83%. A delay to an off-peak train results in performance of 99%.
Now, you could have 4 off-peak trains be delayed and still meet a 95% PWOTP threshold, and that doesn’t seem like great service either. So I think the way to go for long-headway services is to say both OTP and PWOTP need to meet a policy threshold. The current policy, which is just OTP, lets operators meet their standards by running a bunch of on-time trips late at night to make up for things being fouled up during rush hour.
What’s Important For Short-Headway Service?
If we’re talking about short-headway service, OTP of individual transit vehicles doesn’t really matter. What matters is headway regularity and travel-time reliability.
Headway regularity is important on a short-headway service because passengers don’t time their arrivals at the stop to the schedules for individual trips. The Blue Line runs every 6 minutes during rush hour, so if you need to ride, you just go the station, knowing that it will never be very long until a train comes – as long as the headways are hewing to the schedule. So for the Blue Line during rush hour, a much better performance metric is something that relates to headways. Note that if headways have become irregular, not many people are going to be on the trips with a short headway, but a lot of people are going to be stuck waiting for the trips with a long headway. Therefore, the long headway trip is more important to perceptions of service quality.
Travel-time reliability is pretty self-explanatory and serves as a substitute for OTP for short-headway services. If a specific trip departs five minutes late and arrives five minutes late, that’s irrelevant from a passenger’s point of view if the headway regularity is good.
Now back to Market Urbanism’s tweet. Note that if your service performance metric is OTP, your dispatchers might be incentivized to pursue operational strategies that make the service worse for your passengers.
Let’s consider a simple example. A bus route operates on 10 minute headways. The buses operating trips A, B, and C are approaching one end of the route, where they will turn for trips A’, B’, and C’. Due to disturbances along the route, trips A and C are 9 minutes behind schedule, meaning that trip B is on-time and only 1 minute behind trip A. The best thing to do for passengers would be to hold the bus operating trip B at the end of the route for 9 minutes, and start trip B’ 9 minutes late, because this would restore 10 minute headways for B’ and C’.
However, if the only performance metric is OTP, this strategy will make the apparent quality of service go down, because now B’ is late as well as A’ and C’. This encourages dispatchers to boost OTP by sending out B’ at the scheduled time, even though it will make things worse for passengers. Note that this strategy is also detrimental to travel-time reliability, because the long headway in front of C’ will ensure that it faces a higher than normal passenger load, further throwing that vehicle off schedule.
How Should We Measure Short-Headway Service?
The MBTA’s policy is a step ahead of LACMTA regarding short-headway service, because it uses headway-related metrics for all rapid transit services, and for bus services that operate at a headway of 10 minutes of less. It also uses trip-time metrics for these service. The MBTA’s policy (see pages 10-11 of the pdf) is for trips to operate within 1.5 times the scheduled headway, and within specified ranges relative to scheduled travel time.
That’s much better than OTP, but it’s not sensitive to the magnitude of headway variability. I can think of a few other things we ought to measure to get a really good picture of service quality. For short-headway service, we should look at the passenger-weighted average wait time (PWAWT), passenger waits exceeding threshold (PWET), passenger-weighted excess wait time (PWEWT), or standard deviation of headway.
Passenger-Weighted Average Wait Time (PWAWT)
PWAWT is just a weighted average of how long passengers weight. An unweighted average would just be equal to half the scheduled headway, regardless of headway variability. The weighted average accounts for the fact that more people wait for the longer headway trip. PWAWT will always be greater than half the schedule headway.

Where ni is the number of passengers on trip i, and hi is the headway on trip i. Note that for the short-headway services, we are assuming uniform passenger arrivals during each interval between trips, which allows us to assume the average weight time for each trip is 0.5hi. If we assume that passenger arrivals are uniform throughout the entire period in question, then the number of passengers is just a linear function of the headway, and we don’t even need to know how many passengers are on each trip. It should go without saying that the schedule headway must be constant throughout the period in question if we are using PWAWT.
Passenger Waits Exceeding Threshold (PWET)
PWET, the percentage of passengers whose wait exceeds a threshold, could be used if we wanted to look at a period without a constant headway, like the entire day. The threshold could be absolute, e.g. must wait longer than headway plus 2 minutes, or relative, e.g. must wait longer than 1.25 headways. For the example below, I’m going to set the threshold at headway plus 1 minute, because you start to get annoyed about waiting pretty quickly when your wait goes beyond one headway.

This one’s a little more complicated, so a quick explanation: the denominator is just the total number of passengers. The numerator is an if statement that tells us to do nothing if the headway for trip i is less than the threshold, since no passengers for that trip experienced a wait that was too long. If the headway is greater than the threshold, we add the number of passengers who waited for too long, assuming uniform passenger arrivals during that headway period. Note that if we are looking at a period with variable headways, we probably can’t assume that passenger arrivals are uniform for the entire period, so we need to know the number of passengers for each trip.
The weakness of PWET would be that it treats all delays beyond the threshold the same, when the magnitude is obviously important. Passengers are more annoyed if they have to wait an extra 5 minutes versus an extra 1 minute. PWAWT and PWET together might give a good picture.
Passenger-Weighted Excess Wait Time (PWEWT)
PWEWT would allow for a weighted-average metric that emphasizes the importance of very long headways without requiring headways to be constant throughout the period of analysis. It would be a weighted-average of only the excess wait time, and could be defined either with an absolute threshold or a relative threshold. Relative to an absolute threshold, where hsi is the scheduled headway for trip i:

For PWEWT with a relative threshold of bh, just replace every hsi + a in the previous formulation with bhsi.
Standard Deviation of Headway
An alternate to PWAWT, PWET, and PWEWT would be to use the standard deviation of headway*. For example, if the policy guideline for standard deviation of headway was set at 25% of schedule headway, that would result in a service that met the an MBTA type policy with a 95% threshold, and exhibited less variability than is possible under that policy alone. Standard deviation could only be used for periods with constant scheduled headways.
Note that any of these standards would encourage the dispatchers to pursue operational strategies beneficial to passengers. In the past, it might have been difficult to calculate these statistics and figure out the best real-time operational strategies, but with technology like modern AFC and AVI, it shouldn’t be hard.
Enough Theory, Show Me Some Examples
Continuing from the previous example, let’s assume we dispatch B’ on time and C’ is 9 minutes late. For these two trips, OTP is 50%. By the MBTA’s 1.5 times headway standard, 50% of trips meet the policy. The PWAWT is 9.05 minutes. Therefore, if buses are bunched in groups of two along the route, passengers must expect to wait almost an entire published headway for service. The PWET is 40%. Assuming a threshold of h + 1, the PWEWT is 1.6 minutes.
Now let’s assume that the OTP threshold is 5 minutes, and the dispatcher decides to try to help passengers out without hurting OTP stats, so he holds B’ for 4 minutes. Now, B’ departs with a 5 minute headway and C’ with a 15 minute headway. OTP is 50%, but now 100% of trips meet the MBTA’s policy. The PWAWT is 6.25 minutes, a major improvement. The PWET is 20% and the PWEWT is 0.4 minutes.
Finally, if the dispatcher holds B’ for 9 minutes, then both B’ and C’ depart with 10 minute headways. OTP is 0%, but the PWAWT goes down to 5.00 minutes. PWET is 0% and PWEWT is 0 minutes. Note that most of the improvement comes on the front end of the hold, so even holding a bus for a few minutes can do a lot in terms of headway regularity. This is an important insight because it may be desirable to not hold B’ for the full 9 minutes, in order to save some operational flexibility for later in the dispatch period. A bus that is early can always be held more, but it is very difficult for a late bus to catch up to schedule.
Some thoughts: the MBTA standard isn’t a bad proxy, but it’s still imprecise. A service that swung between 5 minute and 15 minute headways would satisfy the MBTA’s policy, and generate PWAWT of 6.25 minutes and PWEWT of 0.4 minutes. The metrics don’t sound that bad, but this doesn’t seem like a great service. That suggests that we are going to use PWAWT and PWEWT, the standard needs to be pretty tight. PWET makes an important contribution here, because PWET of 20% definitely sounds bad.
I’ve also prepared a more detailed example that looks at these metrics under a somewhat random distribution of buses (as random as my mind can make it on the fly), a moderate bunching scenario, and a severe bunching scenario. The premise is a 10-minute headway service, with OTP threshold of 5 minutes and PWET/PWEWT threshold of 11 minutes. Moderate bunching assumes alternating 5 and 15 minute headways. Severe bunching assumes alternating headways of 1 and 19 minutes. Buses are held for a maximum of 4 minutes under partial holds, and for as long as needed to balance headways under full holds. The results are in the table below. (Contact me if you’d like the source spreadsheet.)

Conclusions
OTP metrics are appropriate for long-headway services, but they should be passenger-weighted. They are inappropriate for short-headway services, which should be measured by metrics like the MBTA’s headway variability standard, PWAWT, PWET, and PWEWT. Agencies should set standards and then define dispatcher procedures that will improve these metrics. As was seen in both the brief example and the detailed example, even when bus bunching is bad, short holds can have a significant impact on improving passenger experience.
Of course, we haven’t broached the subject of what the headway and OTP thresholds should be, but that’s a topic for another time.
*In fact, PWAWT, PWET, and PWEWT can be expressed as a function of the standard deviation.