## Defensive code for a high reliability environment

### Make it as easy as possible for your code to be correct.

This is the overarching theme that I’ll keep coming back to throughout this article. The truth is that humans aren’t very good at writing correct and reliable programs at any meaningful level of complexity. All software has bugs and it required extensive debugging with extreme care to excise them. When writing code for high reliability, you are your own worst enemy. You must build systems around yourself – social and technical – to protect the code from your own human fallibility.

### Create frequent, limited scope commits with full code review.

Before I get to specific technical approaches, it’s important to remember that the best checks are social. The quality of code review generally varies inversely with the volume of code to be reviewed, so it’s imperative to be commit patches of limited scope frequently for each family of edits you make.

GitHub is a great tool for code reviews. Reference the person you want to check your code in your commit message (for example, “added new option parsing @maxhodak”) and they’ll get an email notifying them of the commit. Inline code comments are great places to have discussions about the code. Once all issues have been fixed (as seen through Compare View), receiving a “looks good to me” or “lgtm” comment on the whole commit means you’re ready to move on.

Though you should commit often, don’t commit too often. The bar should be that all of the relevant code for a review is in a single commit. The main motivation for this is that GitHub doesn’t allow leaving comments in Compare View, so reviews of multiple commits are much more complicated than reviews of single commits. Working on branches and merging is a good way to handle this for larger features.

Code review is ordinarily collaborative, but you should also practice adversarial review when working in a high reliability setting. In adversarial code review, you should write your own tests, but then nominate someone else to go through your tests and add their own. This goes above and beyond simple diff-based code review: the goal is to step through the whole of the logic, not just the diff, line-by-line looking for potential issues and write tests for them. Review is done when all of these tests pass. It’s important that the test review and expansion is done by someone different that the original author of the code. Both the original developer an the opposing developer must sign the QA acceptance form prior to deployment.

### Have flawless whitespace control and totally consistent syntax.

Bad whitespace and syntax are my brown M&Ms. It represents a lack of precision and care in your work; at worst, it’s indicative of fuzzy thinking, and at best it’s just sloppy. Be sure to show invisible characters in your editor. Newlines should directly follow the last visible character on that line. Indentation of empty lines should be consistent, whether it’s zero or nonzero. Whitespace and syntax are basics – if you can’t get those right, I’m not going to bother spending time going through the logic.

Good programmers get used to spending thousands of hours reading code, and proper syntax is much easier to read – and read correctly – than bad syntax. In addition to being sloppy, it obscures the logic and is harder to visually parse. There’s no excuse not to get it right. There is probably a widely-accepted style guide for your language. If so, you should use that.

Syntax is a social contract: there’s no “right” answer for most decisions. That’s why it’s important to pick a style guide and follow it. The arguments to be had over syntax are stupid and it doesn’t really matter what you choose as long as it’s consistent; deferring to the community is an easy way to avoid these questions and simultaneously make it easier for new programmers to come in and visually parse the code base easily.

Another underestimated reason to get whitespace right and not screw around with tweaking it later is that seriously interferes with code reviews.

### Always return a strongly typed, affirmative result.

Functions should always return an explicit result (see Explicit Is Better Than Implicit). In most languages, unit/void return types are usually incorrect. Avoid null cases that specify an (N+1)th return type by omission.

The last example above could still be improved: it lacks comprehensive error handling. Pattern matching conditions should be added for receiving various classes of errors that might be encountered and handling them.

### Avoid exceptions.

This is a corollary of “return an affirmative value from every function”. Exceptions are glorified GOTO statements. In the worst case, they can crash a thread and put your program into a bad state. Instead, you should handle exceptional cases in your logic. Following from the code example above:

In this case, we use an Either monad to represent a request that might fail. Why can’t we just use try/catch in the calling code? Why are these two examples any different? In Java, the compiler checks for proper exception handling. The answer goes back to Rule #1: make it as easy as possible for your code to be correct. Most languages don’t validate exception handling, and in Java it’s too easy to simply add throws FooException to your method signature and propagate the error to somewhere else where you’ve lost clarity on your logic.

In the “bad” example, there are two systems: the function’s return value and the exception’s GOTO logic. In the “good” example, there is one system: the function’s return value. The return type of the function includes the knowledge that the request might fail, making it easier for you to be right.

A simpler example may make this clearer. The following code will happily compile, despite being very incorrect:

By changing the return type to Option[Int], it now becomes a compiler error, since you can’t add an integer to an option:

The exception should be replaced by None, of course, but I left it there to make a point. Even if you absent-mindedly leave the exception there at first, it’s still marginally harder to be wrong.

### Avoid mutable state whenever possible.

Avoiding mutable state is important to two reasons: first, it dramatically simplifies concurrency, and second it helps avoid several classes of bugs that derive from “action at a distance” errors, even in a single thread.

A functional style forces good habits, reduces opportunities for mistakes, and is generally cleaner than an imperative style. Compare:

Avoiding mutable state also makes it harder to get “out of sync” with an external state of the world:

Mutable state usually isn’t necessary. Sometimes it’s the right answer, but you should have a high level of suspicion about it. It’s usually a hint that you’re thinking about a problem the wrong way.

### Explicit is better than implicit.

This applies both to naming and to logic. Take the case of receiving from a stream. You want to read messages until some condition is true and you have all of the information that you need. In this example, stream.receive will “tick” at least once every four seconds, emitting an Empty if nothing has been received in that duration. This has the potential be create an infinite loop if a timeout isn’t built around these Empty packets.

In this case, we could say that the timeout logic is implicit. If the programmer isn’t aware of the full semantics of this type of stream, it’s very easy to create an infinite loop by omission of the case Empty logic. This might be a very easy mistake to make if in tests the fixture (or real hardware) always returns data. However, a whole class of bugs will be missed in this context. What happens when the implementation of query and a bug is introduced leading to periodic dropping of first packets? You’ll never get another response, and be stuck receiving a lazy stream of Emptys indefinitely.

Another way to do this is add a required parameter to query, affecting the stream returned:

In this example, we’ve introduced a parameter governing when to automatically close the stream if no new data has been received in that duration. If this parameter is omitted, it’s a compiler error. We’ve effectively eliminated a whole class of bugs by making the timeout logic explicit. In the corner case where you do want a lazy stream of Empty responses for a while, you can specify a large timeout. But, it’s easier to be right.

Another example of explicit-versus-implicit is in typing function parameters. Compare:

or:

Which is less conducive to an accidental logic error by some contributor six months from now?

### Make it difficult to express something invalid.

The last example is only “better” and not “good” because TrayState as defined above has a TrayUnknown member object. Therefore, it’s possible to express something invalid. You’ll need to pattern match on toState to prevent bad things from happening. You could have ValidTrayState or something similar:

You’d have to think carefully about this, though, or you’ll risk getting a huge proliferation of messy case classes, objects, and traits that just end up confusing things.

In an ideal world, it should be impossible to express something invalid.

### Avoid ad-hoc arrangements; make it repeatable and learnable.

The REPL has a place in trying things out interactively, but you should never into a situation where you’re repeatedly loading a REPL and pasting in a whole block of code. Instead, make an object that builds the objects you’re interest in and import that. In three months when you need to revisit that area of the codebase, you’ll have forgotten the series of incantations that led you to copy-pasting a bunch of crap onto the REPL.

If you have a neat series of objects that returns the objects of interest – better yet, add these as methods onto the relevant existing parts of the code being debugged – it’s much easier to come back to the correct context later on and for new people to learn the codebase. It removes a source of potential errors in the copy-pasting, too: it’s an easy mistake to change something minor, paste it back into the REPL, and forget to update the code.

### You should never see the same error twice.

Every time you encounter a game-breaking error, go deep and handle it in a way you believe to be thorough. You should never be in a situation where when something crashes you can say, “yeah that happens when X, Y, and Z are true… just foo or bar to avoid it.” If you see the same error a second time, it should be an honest surprise. For it to be an “honest surprise” you need to stop and think seriously about what caused the error and of the logic that surrounds it. A quick fix, recompilation, and green test is probably not a comprehensive solution.