## Fast progress requires strong gradients

#### October 2016

One of the things I did early on in Transcriptic’s history was invite successful operators by for lunch to speak to us. One question I’d always ask if no one else brought it up was “what should the balance be between firefighting versus building for the future?” That tension had driven my team crazy basically since the beginning of the company.

Without exception the answer was “oh you firefight all the way down,” with specific color ranging from “we were literally rm -rfing files off production servers to keep Gmail from running out of disk space,” to “you realize we’re talking about Twitter, right?”

My high level observation was that many of the most successful startups spent a huge amount of their time on the verge of completely falling over (but rarely did) while many companies with great beautiful technology ran out of money and died. The mental model this built for me was that iteration speed was a defining characteristic of successful companies, and that contact with the customer — and growth — was worth many times its weight in internal stability as long as customers were appropriately shielded.

Another counterintuitive effect I’ve seen over the years is that in some cases giving extremely talented engineers long-range build-the-right-thing projects, really what I’d thought they’d wanted, ended up burning them out and creating a weird malaise.

“Firefighting is good” never seemed like a great thesis, though, and so while I had what seemed like useful heuristics the theory felt very incomplete.

I had dinner a while ago with a friend who works at Google Brain who told me that a side effect of getting into deep learning was that he suddenly found himself looking for feedback signals (“gradients”) everywhere. If you’re a runner, the improvement in your mile time is a gradient; if you’re a web marketer it might be change in click-through rate.

To make a wild simplification, one of the big ideas behind deep learning is that even though a model might be huge and complex, you can figure out direction of lower error (the gradient) at each layer and this information can be be easily propagated through many successive layers using the chain rule. Thus you can efficiently train these big models with lots of parameters using gradient descent.

This means that a big factor in how fast your model learns at any point in time is the steepness of the gradients, which depends in large part on your loss function. Things are of course more complicated than this, but the idea works well enough for the metaphor.

Things that many people frequently but unsuccessfully attempt, such as losing weight or learning a new language, are often marked by shallow gradients. We work around this by imposing external structure or commitment mechanisms such as social pressure or school. Since directly measuring weight loss has such a weak gradient (the signal is very noisy and changes very slowly) most people really need a better signal. Commonly this is calorie tracking, but that’s also pretty noisy and it’s expensive. Maybe something faster and cheaper that correlates though is more distant, like blood glucose, would be better.

This is in large part what is meant by “gamification”: rapid feedback around clear success metrics can enable compelling goal seeking behavior.

Rational actors should understand the importance of driving safely, but this has a very poorly behaved loss function, and so we have the concept of speeding tickets. Speeding tickets have the effect of creating a very clear synthetic gradient that is easier to reason about than the idea of just driving safely to avoid accidents and injuries.

Firefighting is a fairly common and natural strong gradient for startups. It’s not a great one for obvious reasons but it’s easy to access and is generally pretty accurate. If things are breaking under load from your customers it’s very obvious what to do to make the product better in a way that is perceived by the market, and it helps you focus less on attractive ideas that customers care less about.

Premature optimization being evil also fits into this framework. What makes optimization premature or not? The presence of feedback signals to tell you what needs improving and how you’re doing against that.

In many cases there aren’t obvious fast error signals to follow. I’ve come to believe this is one of the big reasons that The Big Rewrite often fails while incrementally building something better on the mess you have works. I now see this as one of the major responsibilities of management: constructing strong feedback loops that might involve creating synthetic gradients when necessary. A “synthetic” gradient is much less desirable than a natural one because it depends more on your ability to plan and be generally smart, and humans overall tend to be not great at planning and not very smart. But I’d still take a bad error signal that we can correct later over no error signal any day, for reasons of both progress and morale. There are lots of issues I’ve run into over the years that had vague or assorted superficial explanations where I now believe the true root cause was lack of strong gradients.

There’s a saying that “you make what you measure” and it’s interesting the extent to which this is really true, almost for free. Like my friend at Google, I now find myself seeing gradients everywhere and this has been an actionable framework for reasoning about why some things are easy and other things are hard for organizations to do.