Max Hodak Writings

Defensive code for a high reliability environment

June 2013

Make it as easy as possible for your code to be correct.

This is the overarching theme that I’ll keep coming back to throughout this article. The truth is that humans aren’t very good at writing correct and reliable programs at any meaningful level of complexity. All software has bugs and it required extensive debugging with extreme care to excise them. When writing code for high reliability, you are your own worst enemy. You must build systems around yourself – social and technical – to protect the code from your own human fallibility.

Create frequent, limited scope commits with full code review.

Before I get to specific technical approaches, it’s important to remember that the best checks are social. The quality of code review generally varies inversely with the volume of code to be reviewed, so it’s imperative to be commit patches of limited scope frequently for each family of edits you make.

GitHub is a great tool for code reviews. Reference the person you want to check your code in your commit message (for example, “added new option parsing @maxhodak”) and they’ll get an email notifying them of the commit. Inline code comments are great places to have discussions about the code. Once all issues have been fixed (as seen through Compare View), receiving a “looks good to me” or “lgtm” comment on the whole commit means you’re ready to move on.

Though you should commit often, don’t commit too often. The bar should be that all of the relevant code for a review is in a single commit. The main motivation for this is that GitHub doesn’t allow leaving comments in Compare View, so reviews of multiple commits are much more complicated than reviews of single commits. Working on branches and merging is a good way to handle this for larger features.

Code review is ordinarily collaborative, but you should also practice adversarial review when working in a high reliability setting. In adversarial code review, you should write your own tests, but then nominate someone else to go through your tests and add their own. This goes above and beyond simple diff-based code review: the goal is to step through the whole of the logic, not just the diff, line-by-line looking for potential issues and write tests for them. Review is done when all of these tests pass. It’s important that the test review and expansion is done by someone different that the original author of the code. Both the original developer an the opposing developer must sign the QA acceptance form prior to deployment.

Have flawless whitespace control and totally consistent syntax.

Bad whitespace and syntax are my brown M&Ms. It represents a lack of precision and care in your work; at worst, it’s indicative of fuzzy thinking, and at best it’s just sloppy. Be sure to show invisible characters in your editor. Newlines should directly follow the last visible character on that line. Indentation of empty lines should be consistent, whether it’s zero or nonzero. Whitespace and syntax are basics – if you can’t get those right, I’m not going to bother spending time going through the logic.

Good programmers get used to spending thousands of hours reading code, and proper syntax is much easier to read – and read correctly – than bad syntax. In addition to being sloppy, it obscures the logic and is harder to visually parse. There’s no excuse not to get it right. There is probably a widely-accepted style guide for your language. If so, you should use that.

Syntax is a social contract: there’s no “right” answer for most decisions. That’s why it’s important to pick a style guide and follow it. The arguments to be had over syntax are stupid and it doesn’t really matter what you choose as long as it’s consistent; deferring to the community is an easy way to avoid these questions and simultaneously make it easier for new programmers to come in and visually parse the code base easily.

Another underestimated reason to get whitespace right and not screw around with tweaking it later is that seriously interferes with code reviews.

Always return a strongly typed, affirmative result.

Functions should always return an explicit result (see Explicit Is Better Than Implicit). In most languages, unit/void return types are usually incorrect. Avoid null cases that specify an (N+1)th return type by omission.

// VERY BAD
def trayOpen: Unit = query("TRAY OPEN")

// BAD
def trayOpen: Boolean = query("TRAY OPEN") match {
  case Response("OK") => true
  case other => false
}

// GOOD
sealed trait TrayState
case object TrayOpen extends TrayState
case object TrayClosed extends TrayState
case object TrayUnknown extends TrayState

def trayState: TrayState = query("TRAY?") match {
  case Response("OPEN") => TrayOpen
  case Response("CLOSED") => TrayClosed
  case other = TrayUnknown
}

def trayOpen: TrayState = trayState match {
  case TrayOpen => TrayOpen
  case TrayUnknown => TrayUnknown
  case TrayClosed => query("TRAY OPEN") match {
    case Response("OK") => trayState
    case other => TrayUnknown
  }
}

The last example above could still be improved: it lacks comprehensive error handling. Pattern matching conditions should be added for receiving various classes of errors that might be encountered and handling them.

Avoid exceptions.

This is a corollary of “return an affirmative value from every function”. Exceptions are glorified GOTO statements. In the worst case, they can crash a thread and put your program into a bad state. Instead, you should handle exceptional cases in your logic. Following from the code example above:

// BAD
def takeReading: Data = {
  trayOpen match {
    case TrayOpen => insertSample match {
      case true => // ok
      case false => throw new Exception("Couldn't insert sample!")
    }
    case other => throw new Exception("Tray didn't open!")
  }
  query("READ foo, bar, quux") match {
    case Response(d) => d
    case Error(code, message) => code match {
      case 1 => throw new Exception(
        "Got exception during read: %d %s".format(code, message))
      // etc
    }
  }
}

// GOOD
def takeReading: Either[Data, Error] = {
  trayOpen match {
    case TrayOpen => insertSample match {
      case true => // ok
      case false => return Right(Error(2, "Couldn't insert sample"))
    }
    case other => return Right(Error(1, "Tray didn't open"))
  }

  // take reading
  query("READ foo, bar, quux") match {
    case Response(d) => Left(d)
    case e @ Error(code, message) => Right(e)
  }
}

In this case, we use an Either monad to represent a request that might fail. Why can’t we just use try/catch in the calling code? Why are these two examples any different? In Java, the compiler checks for proper exception handling. The answer goes back to Rule #1: make it as easy as possible for your code to be correct. Most languages don’t validate exception handling, and in Java it’s too easy to simply add throws FooException to your method signature and propagate the error to somewhere else where you’ve lost clarity on your logic.

In the “bad” example, there are two systems: the function’s return value and the exception’s GOTO logic. In the “good” example, there is one system: the function’s return value. The return type of the function includes the knowledge that the request might fail, making it easier for you to be right.

A simpler example may make this clearer. The following code will happily compile, despite being very incorrect:

def foo: Int = throw new Exception("I am an exception")

def bar = {
  var i = 0
  val quux = foo
  i = i + quxx
}

By changing the return type to Option[Int], it now becomes a compiler error, since you can’t add an integer to an option:

def foo: Option[Int] = throw new Exception("I am an exception")

def bar = {
  var i = 0
  val quux = foo
  i = i + quxx
}

The exception should be replaced by None, of course, but I left it there to make a point. Even if you absent-mindedly leave the exception there at first, it’s still marginally harder to be wrong.

Avoid mutable state whenever possible.

Avoiding mutable state is important to two reasons: first, it dramatically simplifies concurrency, and second it helps avoid several classes of bugs that derive from “action at a distance” errors, even in a single thread.

A functional style forces good habits, reduces opportunities for mistakes, and is generally cleaner than an imperative style. Compare:

// BAD
var i = 0
mylist.foreach { item =>
  i = i + item
}

// GOOD
val i = mylist.foldLeft(0) { (total, current) =>
  total + current
}

Avoiding mutable state also makes it harder to get “out of sync” with an external state of the world:

sealed trait TrayState
case object TrayOpen extends TrayState
case object TrayClosed extends TrayState

// BAD
var trayState = TrayOpen
def trayOpen = trayState match {
  case TrayOpen => TrayOpen
  case TrayClosed => {
    // do logic
    trayState = TrayClosed
  }
}
def trayClose = trayState match {
  case TrayClosed => TrayClosed
  case TrayOpen => {
    // do logic
    trayState = TrayOpen
  }
}

// GOOD
def trayState: TrayState = query("TRAY?") match {
  case Response("OPEN") => TrayOpen
  case Response("CLOSED") => TrayClosed
}
def trayOpen: TrayState = trayState match {
  case TrayOpen => TrayOpen
  case TrayClosed => // do logic
}
def trayClose: TrayState = trayState match {
  case TrayClosed => TrayClosed
  case TrayOpen => // do logic
}

Mutable state usually isn’t necessary. Sometimes it’s the right answer, but you should have a high level of suspicion about it. It’s usually a hint that you’re thinking about a problem the wrong way.

Explicit is better than implicit.

This applies both to naming and to logic. Take the case of receiving from a stream. You want to read messages until some condition is true and you have all of the information that you need. In this example, stream.receive will “tick” at least once every four seconds, emitting an Empty if nothing has been received in that duration. This has the potential be create an infinite loop if a timeout isn’t built around these Empty packets.

// BAD
def handleStream: Array[Data] = {
  val stream = query("READ foo, bar, quux")
  val buffer = ListBuffer[Data]
  var i = 0
  while(stream.ready) {
    stream.receive match {
      case Response(data) => {
        buffer.append(data)
      }
      case OK => stream.close
      case Empty => if(i > 4) {
        stream.close
      } else {
        i = i + 1
      }
    }
  }
  buffer.toArray
}

In this case, we could say that the timeout logic is implicit. If the programmer isn’t aware of the full semantics of this type of stream, it’s very easy to create an infinite loop by omission of the case Empty logic. This might be a very easy mistake to make if in tests the fixture (or real hardware) always returns data. However, a whole class of bugs will be missed in this context. What happens when the implementation of query and a bug is introduced leading to periodic dropping of first packets? You’ll never get another response, and be stuck receiving a lazy stream of Emptys indefinitely.

Another way to do this is add a required parameter to query, affecting the stream returned:

// GOOD
def handleStream: Array[Data] = {
  val stream = query("READ foo, bar, quux", 10.seconds)
  val buffer = ListBuffer[Data]
  while(stream.ready) {
    stream.receive match {
      case Response(data) => buffer.append(data)
      case OK => stream.close
    }
  }
  buffer.toArray
}

In this example, we’ve introduced a parameter governing when to automatically close the stream if no new data has been received in that duration. If this parameter is omitted, it’s a compiler error. We’ve effectively eliminated a whole class of bugs by making the timeout logic explicit. In the corner case where you do want a lazy stream of Empty responses for a while, you can specify a large timeout. But, it’s easier to be right.

Another example of explicit-versus-implicit is in typing function parameters. Compare:

// BAD
def moveTray(fast: Boolean): Boolean

// GOOD
sealed trait MovementSpeed
case object Fast extends MovementSpeed
case object Slow extends MovementSpeed
def moveTray(speed: MovementSpeed): TrayState

// moveTray(Fast) or moveTray(Slow)

or:

// BAD
// open = true, open the tray; open = false, close the tray
def moveTray(open: Boolean): Boolean

// BETTER
def moveTray(toState: TrayState): TrayState

Which is less conducive to an accidental logic error by some contributor six months from now?

Make it difficult to express something invalid.

The last example is only “better” and not “good” because TrayState as defined above has a TrayUnknown member object. Therefore, it’s possible to express something invalid. You’ll need to pattern match on toState to prevent bad things from happening. You could have ValidTrayState or something similar:

sealed trait TrayState
sealed trait ValidTrayState
case object TrayOpen extends TrayState with ValidTrayState
case object TrayClosed extends TrayState with ValidTrayState
case object TrayUnknown extends TrayState

def moveTray(toState: ValidTrayState): TrayState

You’d have to think carefully about this, though, or you’ll risk getting a huge proliferation of messy case classes, objects, and traits that just end up confusing things.

In an ideal world, it should be impossible to express something invalid.

Avoid ad-hoc arrangements; make it repeatable and learnable.

The REPL has a place in trying things out interactively, but you should never into a situation where you’re repeatedly loading a REPL and pasting in a whole block of code. Instead, make an object that builds the objects you’re interest in and import that. In three months when you need to revisit that area of the codebase, you’ll have forgotten the series of incantations that led you to copy-pasting a bunch of crap onto the REPL.

If you have a neat series of objects that returns the objects of interest – better yet, add these as methods onto the relevant existing parts of the code being debugged – it’s much easier to come back to the correct context later on and for new people to learn the codebase. It removes a source of potential errors in the copy-pasting, too: it’s an easy mistake to change something minor, paste it back into the REPL, and forget to update the code.

You should never see the same error twice.

Every time you encounter a game-breaking error, go deep and handle it in a way you believe to be thorough. You should never be in a situation where when something crashes you can say, “yeah that happens when X, Y, and Z are true… just foo or bar to avoid it.” If you see the same error a second time, it should be an honest surprise. For it to be an “honest surprise” you need to stop and think seriously about what caused the error and of the logic that surrounds it. A quick fix, recompilation, and green test is probably not a comprehensive solution.

Use comments sparingly.

Assume that the person reading your code is a better developer than you are. Avoid comments that describe the code.

// Increments by 1
i = i + 1

// What I want to do:
// Execute the query
// Parse the results
val data = query("foo bar")
parse(data)
// magic
i = 0x5f3759df - ( i >> 1 );
y = _ ( float _ ) &i;
y = y _ ( threehalfs - ( x2 _ y \* y ) );

Good comments add information that isn’t already in the code.

// the time the device is expected to be busy in microseconds
val time = "BY#28000".substring(3).toInt

// a unit vector pointing along the plate from well 1 toward well 12
val orientation: (Int, Int)
// http://en.wikipedia.org/wiki/Fast_inverse_square_root
i = 0x5f3759df - ( i >> 1 );
y = _ ( float _ ) &i;
y = y _ ( threehalfs - ( x2 _ y \* y ) );

As a rule of thumb, well-written code should be self-explanatory. Good programmers are used to reading code and they don’t need an English translation to help them out. In fact, grey or blue noise scattered everywhere significantly distracts from reading the code.

A major exception to these rules is when the comments are necessary for generating documentation. Many systems, like rdoc, javadoc, scaladoc, and so on, can parse comment blocks to generate very useful HTML documentation. Those are often ok, especially for code that’s used by more than a handful of people, despite the noise it adds to the file. Remember that these block comments are liabilities: they rapidly become out of date and require active maintenance to stay relevant.