On Forests and Trees

When an English speaker is drowning in details that make the big picture hard to see, she might complain, “I can’t see the forest for the trees.”

image credit: Miguel Virkkunen Carvalho (Flickr)

It’s an odd expression, partly ironic and partly humorous. When I hear it, I sometimes think of my sister, who, after moving from Indiana to Utah, complained that the mountains were getting in the way of her view. (Her tongue was firmly in her cheek… :-)

The expression also describes an important problem of software engineering–one that a lot of engineers don’t understand well enough. It’s a problem with generalization.

Why generalization matters

Generalization is the process of finding emergent patterns in our creations or our processes, and taking time to codify the patterns so as many specifics as possible become unremarkable. It’s among the human mind’s most powerful techniques for coping with complexity, and it’s a hallmark of vigorous thinkers in any technical discipline. I like what Hegel said:

An idea is always a generalization, and generalization is a property of thinking. To generalize means to think.

Many techniques in software engineering are rooted in this mode of careful pattern-oriented thinking. Interfaces, inheritance, and class instances group features and data underlying many concrete variations. Subroutines generalize the flow of logic in an algorithm. Refactoring often collapses differences that are less useful than originally guessed. Templates and generics and design patterns provide cookie-cutter outlines into which details can be plugged. Modules and components and libraries let us mix by formula. Fuzz testing tries to generalize about acceptable inputs to functions.

Where code is wisely generalized, maintenance goes down, testability goes up, and it’s easy to learn a correct mental model. The inverse is also true: bad choices about generalization usually hide the forest behind the trees, which causes pernicious tech debt.

Drowning in details

With obvious benefits and technology built to help, you’d think software engineers would be wizards of generalization in what they write, test, and manage.

Unfortunately, I find this skill surprisingly rare in techies. It exists, to be sure–but it’s disheartening how often simple and high-value generalization gets neglected.

Example 1: I spent a couple months this summer refactoring a large, old, mission-critical codebase that contained both an embedded web server and an embedded web client. The code had multiple routines to parse incoming requests and responses. These were big, complex routines, poorly tested and full of unhandled corner cases–and there was no relationship between request parsing and response parsing, even though http requests and responses have identical structure after the first line. Big generalization miss!

The code also had numerous functions to help build requests and responses by accumulating pieces of data in a buffer. Most of these functions were similar in the way they managed header values like Content-Length and Content-Type, but they used buffers of different sizes, wrote to them in different ways, and handled and reported errors inconsistently. Useless divergence… In one case, a function was nearly 1000 lines long, and had scores of repetitive statement clusters that inited a pointer to a string constant, then looped over the chars in the constant, appending the char. Why the author never thought to use strncpy() is a mystery to me; I shrunk the function to 1/3 of its original size with that one change. (Aside: 3 of the repetitive statement clusters turned out to increment the pointer differently from the others. I had to write a test before I figured this out; that detail was totally obscured and would never have been caught by a casual maintainer. It hadn’t been commented as a weirdness, either.)

Example 2: A few months ago, I noticed that a volume on one of our production servers was nearly full. No alarms had gone off about it–I just stumbled on the problem–and that concerned me. I did some research and determined that the cause was a misbehaving app that wrote 200k new files per day into log folders. Since I couldn’t modify the app, I wrote a script to clean up the bogus files. Then I discovered that the script needed root privileges, which I lacked–so I sent an email to IT guys reporting the problem and offering the script. I also emailed the owners of the offending app, suggesting that we fix the root cause. And I asked that we set up a monitor to alert us if the problem recurred. Nothing happened except for a manual cleanup. A while later the volume was in crisis again–and once again, we reacted days later than we should have, with surprise at the cause and head-scratching about how to fix it. We hadn’t generalized from one problem to a systemic weakness very well.

What generalization looks like

The easiest way to tell that code’s been wisely generalized is to ask yourself this question: “Can I see the forest for the trees?“ If a quick glance at any level of detail (a class, a function, a module, a project definition) gives you a broad, useful picture of what’s inside–with opportunities to drill deeper as needed, but without overwhelming noise–then a careful generalizer has done their job. Same deal if lateral and hierarchical and temporal relationships are obvious. It’s not an accident that I’m describing “good code” here–the kind we all like to work in…

Generalization is partly why small files and small functions are your friends. It also explains why boilerplate comments are worse than useless, and bears on why encapsulation and loose coupling are so crucial.

Why we don’t generalize

I’m not saying that generalization is easy, though.

One reason we don’t generalize is because we are being crushed by tech debt. We feel like we can’t afford it. This is a very real problem, but it is solvable–or at least improvable.

Another reason we don’t generalize is because we’re addicted to details. I have heard performance zealots say that they couldn’t break up massive C/C++ functions because they couldn’t trust the compiler to inline like it was supposed to. This is utter nonsense. Setting aside the (largely valid) argument that the compiler is usually smarter about performance optimizations than the programmer, you can always use a macro, for pete’s sake. I’ve heard similar mindsets in laments about inheritance and vtables, the inefficiency of regexes, the inconvenience of private member variables, and lots of other features in every programming language I know. In each case, the technical points on which the rationalization rests may be narrowly valid–and maybe it matters in a very specific context–but there are almost always ways to generalize better or more cleanly than we like to claim. We should hang on to as few details as we have to.

A third reason we don’t generalize is because we don’t think hard enough, or we’re not smart enough to notice a pattern. This happens to me a lot; I find that I can’t generalize in code that I haven’t invested in deeply. It’s too easy to make mistakes.

A fourth reason we don’t generalize is because our tools and languages discourage us. Java, for example, is ridiculously detail-heavy in its management of data types: to declare a variable, you usually have to declare its type and name, and then set it equal to a new object of exactly that same type, named all over again. Do an egrep through a java codebase sometime, looking for typename identifier = new typename. It’s silly. You can have just as much type safety, without the mind-numbing repetition, as ML proved, and C++11 discovered with the introduction of the auto keyword.

There are lots of other examples. Aspect-oriented programming attempts to formalize generalizations that permeate or cross-cut a whole codebase; to the extent that AOP is awkward, we are generalizing against the grain of our tools. Poor flexibility in interface evolution is endemic in nearly every programming framework; for no good reason, it prevents us from generalizing about semantically compatible software. Programming languages declare functions, parameters, and local variables in a way that makes it laborious to extract a block of logic into a subroutine (though IDEs with refactoring features have mitigated this problem somewhat). The “step routines” feature of the intent programming language I’m writing is an attempt to address this problem; perhaps I’ll blog about that soon.

Call to action

Pragmatism always matters, of course; it may not be worth our time to generalize in every case. :-)

image credit: xkcd.com

Nonetheless, the best tech folks that I know are much better at this skill than the middle of the bell curve, and I don’t think that’s an accident. I’d like to see us, as an industry, do a better job of turning implicit patterns in our everyday engineering work into method, structure, and reusable building blocks.

Where do you have code, or processes, that are calling out for this sort of attention?