Felipe Massa should have won today’s Singapore Grand Prix. He was the fastest driver, in the fastest car, using tyres which worked better on his car than the competition. Instead, he finished next-to-last, out of the points, and saw Hamilton extend his lead in the Championship. And the cause of his failure was a classic piece of over-engineering.
Those familiar with Formula One may find this a bit obvious. But others, especially some of the software engineers with whom I work, may find it instructive.
In the course of a Formula One race, drivers will routinely stop to add fuel and change tyres. Unlike US motor racing, Formula One teams can have as many people working on the car simultaneously as they like. As a result, a pit stop is an elaborately choreographed procedure, with mechanics changing tyres, adding fuel, cleaning debris out of the radiators, adjusting wing settings, and even polishing the driver’s visor. With so much going on, the driver is in no position to see when everyone’s finished. And there’s another consideration: the pit lane is narrow, and there may be other cars overtaking; the driver can’t drive away until it’s safe to do so.
The traditional solution has been “the lollypop man”, a mechanic holding a sign on a stick, right in front of the driver. Using the sign, this mechanic signals to the driver when to engage gear, and when it’s safe to leave.
Recently, some clever person in the Ferrari team thought, “This pole thingy is a bit inefficient. It takes quite a while (about a second – an eternity in racing!) to lift it, and sometimes the mechanic will hesitate. And having a person standing next to the car holding the pole just makes things more crowded. Wouldn’t it be better to replace the pole with a remotely-controlled light signal?”
And so they did. And when Massa made his first pit stop, the mechanic controlling the light flipped it to “green” while the fuel hose was still attached. Massa took off, dragging the fuel hose behind him, and knocking over a mechanic (who was rushed to hospital). By the time everything had been sorted out, Massa was dead last.
This was not the first time that the light signal system had failed, and the TV commentators were unsure whether the system had been enhanced with an electronic interlock, so that the light would be kept at red until the fuel hose had been removed. Obviously, if there were such an interlock it must have failed.
It seems to me that this is an interesting systems design problem, with a number of useful lessons. The intended function of the system is pretty straightforward, and the costs and benefits (including faster starting) are clear. Simplified, the system is intended to work as follows:
- Signal the driver to stop.
- Wait until the pit stop service has been completed.
- When it is safe to do so, signal the driver to go.
But this describes the correct behaviour of the system. We need to go beyond this, and think about the ways in which the system can fail, the probabilities of each failure, and the consequences. Broadly speaking, there are two types of failure that may arise:
- The mechanic evaluates the situation incorrectly, for example misjudging the position of an obstruction or the speed of another car in the pit lane.
- The physical mechanism fails to reflect the mechanic’s intent.
Let’s assume that the mechanic is equally competent in both cases: he (or she?) is just as likely to make an error of judgement with either mechanism. This seems plausible, although if the light system did include some kind of interlock, it is possible that the mechanic might tend to rely upon that rather than making an independent assessment. (“I can’t see if the fuel hose is all the way out, but everything else is clear, and the interlock will catch it if I’m wrong, so… CLICK!” Not consciously, perhaps…)
But what about the mechanism? To be specific, what is the probability that the mechanic will inadvertently press the “start” button unintentionally, and how does this compare with the probability that he might inadvertently lift the pole? The answer seems pretty clear. Anyone who has played a video game, or typed, or performed any other kind of test involving hand-eye coordination knows how easy it is to “jump the gun”. And the light system has other undesirable failure characteristics. If the mechanic realizes that he’s made a mistake, he has to do something (press another button?) on the light control, which takes at least 500 ms (based on what we saw at Singapore). The “lollypop” is relatively fail-safe; if the mechanic stops lifting it, gravity will do the rest.
So why do I describe this as “overengineering”? For me, the term refers to additional engineering work which reduces the net value (benefits less costs) of the system. The light system was intended to provide a benefit of perhaps 2 seconds per driver per race, which was presumably expected to translate into points in the Championship and Constructor rankings. So far this season the mechanism has cost Ferrari at least 10 points, probably more. The actual benefit has been negligible. In addition, mechanics have been injured. And it’s plausible that this could have been predicted with an analysis of the potential failure modes, coupled with some simple behavioural stress testing.
One final thought: it doesn’t seem like a coincidence that these issues emerged after Michael Schumacher and Ross Brawn had left Ferrari. Both of them were passionate about what in my world is termed “operational excellence”. The Ferrari engineers are still among the very best in the world, but the operational quality is slipping.