The Tyranny of the Timed Coding Test

In the infamous “whiteboard” coding test, developers hoping to be hired are expected to write code in real-time, standing up in front of a whiteboard. This kind of test has become a standard way to test technical skills. Some kind of test of technical skills is indeed necessary, because one of the key challenges in recuiting is how to screen or filter out candidates with weak technical skills in order to avoid wasting time on additional interviews, not to mention hiring people who turn out not to be able to code. Hence the whiteboard coding test.

Such tests are often also administered on-line, using technology as simple as a screen share, or some kind of specialized collaborative environment designed for just this purpose. Or, candidates are asked to take tests in some app by themselves, removing the “while we’re watching” aspect.

In more and more cases, these tests are timed. They become races against the clock. If time runs out, you fail. These are the infamous code-to-the-death time trials.

Now third-party screening services are popping up that promise to not only save employers all the trouble of the technical pre-screening, but also do it much, much better. Nearly all adopt the timed coding test approach. The lucky people who pass these tests may join a group of elite ninjas deployed to development shops eager for top talent; or be recommended to employers, who may fast-track them to final interviews.

But what do these timed coding tests test? Are the skills being tested central to the tasks the employees are to be asked to do? Are they even relevant? Are they even a positive?

Not in the least. The ability to create working code under time pressure is roughly equivalent to a monkey learning how to juggle, or a bear learning how to ride a bike. The notion that success on such artificial timed tests is even vaguely connected to the ability to perform in a development job is not just unproven, it’s not even plausible.

The first erroneous assumption is that development work is mainly about coding. This is the myth that software developing is just “coding”. Politicians and the media might make this mistake, but software companies, and testing companies, should know better. As we all know, the coding component of development work is at best 25% at best — the rest is listening, gathering information, coming up with solutions, experimenting, making tradeoffs, debugging, responding to code reviews, optimizing, throwing code away, and maintainance. Testing which is fixated on coding ability simply ignores these other important skills.

Even if one agrees that coding ability in the narrow sense is important — which it is, but not primarily so — the testing culture propagated by the testing companies and whiteboard focuses exclusively on speed. These tests are like 100-yard sprints. They’re actually worse than sprints, because there’s a drop-dead cut-off, a deadline after which you simply fail. It’s as if all the runners were shot in the head if they have not reached the finish line in 12 seconds. Woe be to you if you’re a medium or long-distance runner. The tests are testing the mutant ability to create working code in the absolute minimum amount of time. There is no notion of code quality, or readability, or maintainability; it is purely a matter of whether or not the code works. Creating code which just barely works in the absolute minimum amount of time is a skill which, if I were hiring, I would select against, not for.

The sites running these tests actually acknowledge this. One, such site, for example, cautions that although good coding usually involves things like good variable naming, you should not worry about that when doing the timed tests, since typing longer identifiers could eat up valuable seconds — so just go with a, b, and c. Another one recommends that if you have trouble with the timed tests you could spend your time just practicing taking more and more timed tests on a site like Codility until, like a well-trained bear, you can ride the bike a little better.

Even if we limit our attention to “coding”, as opposed to all the other more important skills that developers must have, all the people involved in these timed coding tests have a fundamental miunderstanding of the nature of coding: they think it’s typing characters into a terminal. They don’t even understand the basic notion that there is an internal, mental design process. For good programmers, this part of the process can take up to 50% of more of all the time spent getting to shipping code. Even the simplest problem has meaningful design issues that need to be worked out before putting pen to paper.

This misguided notion of “coding” as nothing more than typing a series of magic incantations into a computer — the faster the better — pervades the industry. It also has permeated popular opinion and lies behind simplistic social visions of “teaching everyone to code”. This is the “coding as laying bricks” view — where the only relevant metric is the length of the walls the bricklayer lays and how fast he lays them. Nothing could be further from the truth. But we digress.

The tic-tac-toe example

Let’s consider writing a tic-tac-toe game. We will say, hypothetically, that the testing site in question allocates us 30 minutes to finish writing this as a console-based game. The hapless testee is instructed to write code (as a “class” in the OOP sense — more on that later) which can represent the board, make a move, render the board, determine if the board is full or the game is won, and manage the sequence of man-machine moves.

We need to decide on the basic data representation for the board. Since JavaScript does not have true two-dimensional arrays, we might consider using one-dimensional, nine-element arrays. This might make searching for an empty cell a bit easier. On the other hand, this would mean that we would have to convert between row/column coordinates and cell numbers between zero and eight. Alternatively, we could structure the board as a three-element array where each element is another three-element array — a nested array, in other words.

Then we have the issue of how to represent the content of each cell. There are three possible values: empty (represented as “-” when rendering), X (human), and O (machine). Perhaps we could just store that as a string in each cell. On the other hand, if we are interested in finding empty cells, for example, it might be slightly better to have two boolean-valued structures, one for whether or not the cell is occupied, the other applicable only when the cell is occupied, indicating whether it is occupied by X or O. These boolean values might be slightly easier to set and test. For example, with the nine-element approach, using booleans, we could search for an empty cell by simply saying

this.occupied.indexOf(false)

In contrast, with the three-by-three approach, where each cell holds one of three values, we would need to do something like

[[0,0],[0,1],[0,2],[1,0],[1,1],[1,2],[2,0],[2,1],[2,2]].find(([row, col]) => this.board[row][col] === "-")

where again “-” represents empty cells. Yes, I know there might be sleeker ways to do that.

On the other hand, this three-by-three approach makes rendering the board much easier:

this.board.map(row => row.join("|")).join("\n")

which would be a bit more complicated if we used the two nine-element arrays booleans or some variation of that.

Our point here is not that one of these representations is better. Rather, it’s that it’s something we need to think about. The decision we make will affect the quality of our eventual solution. But it will also drive the productivity with which we can finish up the solution — remember this is a timed test — and, crucially, different solutions might well have different implications in terms of quality and productivity. In other words, we might be able to do a crappier solution faster, or a better solution more slowly. The nature of the timed test demands that we choose the crappy quick solution.

The format of the test assumes that given a few quick minutes of thinking time the candidate can analyze these trade-offs and decide on the best representation, taking into account the prioritization of quality and speed. However, in the real world, and in this case, that might not be true. It is common, at least for me, that a problem reveals its manifold aspects in ever-increasing detail only as I dig into the actual coding. As the nature of the problem, and the implications of various design choices, become clearer, I iterate on the design and the solution based on that design. Unfortunately, the drop-dead nature of the timed test essentially excludes any iterative optimization or iterative design approaches. It tests for the specific capability of making good-enough one-time design choices rapidly. To repeat myself, however useful this skill might be, it’s by no means the only important skill in a developer, and may not even be important at all. In other words, success on this test in this format means, basically, nothing.

Specific instructions given to the testee often make explicit the underlying notion that coders are just machines. For example,some tests I have heard of state the requirement that only that you must use an OOP-like “class”, with instance variables and methods — no functional styles here, please — but also that the code must proceed step-by-step, starting with the data representation, then and only then proceeding to implement the methods for rendering the board, detecting if it is full, playing a move, and so on, all these in a particular order. In other words, the candidate is not just discouraged from stepping back and looking at the big picture and the overall design; she is prohibited from doing so by the rules of the test.

The amicable pairs coding test

This is another hypothetical test which I may or may not have ever encountered in real life. Amicable numbers are numbers with the property that the sum of their factors is a number the sum of the factors of which are equal to the original number. Got it? The two numbers are referred to as an amicable pair. Let us imagine a timed coding test where you are asked to find the number of amicable pairs whose first member is less than some given input. Since (220, 284) is an amicable pair, the correct response to an input of 300 would be “1” (we do not count (284, 220) since we would have already counted that in the reverse order).

Now that I’ve had the time to think this through, I can write the solution quite easily — even trivially — as

// Given one number, make an amicable pair.
const makePair = n => [n, aliquot(n)];
// Given a pair of numbers, make sure it is an amicable pair.
const isPair = ([n, m]) => m > n && aliquot(m) === n;
// Count the number of amicable pairs up to some value.
function amicablePairs(n) {
return range(1, n) . map(makePair) . filter(isPair);
}

where “aliquot sum” is the term used for the sum of a number’s proper divisors. (“Proper divisor” refers to any divisor including one, but not the number itself.) A number is prime if and only if its aliquot sum is 1.)

I’ve included comments in the above for readability, although in an actual timed test I would not have enought time. In English, the above code says: find the aliquot sum (sum of the proper divisors) of some number; make sure it’s greater than the number being examined (if it’s smaller it will be found when the smaller number is examined, and if it’s equal it is a “perfect” number which is not considered “amicable”); then see if the sum of divisors of that number is equal to the original. The count of amicable pairs, which is what the test requires, can be found trivially by taking the length of the array of amicable pairs.

All that is left now is to write count, range, divisors, sum, and aliquot:

const add = (a, b) => a + b;
const sum = a => a.reduce(add, 0);
const divisors = n => range(1, n - 1).filter(i => !(n % i));
const range = (n, m) => Array.from(Array(m - n), (_, i) => i + n);
const aliquot = n => sum(divisors(n));
const count = (a, f) => a.filter(f).length;

How elegant and brief — a mere dozen lines. Certainly not anything that would take more than the allotted fifteen minutes of time, right? Well, in my case actually, it did. The solution did not reveal itself to me in this form until about 24 hours after the problem was posed to me, after a couple of sessions of cogitation. During the hypothetical test, which did not actually happen, I started off writing a bunch of loops and if statements and debugging them and rewriting them and then ran out of time right around the point where I had just gotten the first amicable number calculated. In my hypothetical test, this effort was deemed a failure. The verdict was in — I was an utter failure as a rockstar coder.

(By the way, in practice I could have probably used a library like underscore to do sum and range, but I’m not sure that would have been allowed. Should it be? For that matter, I could have just used the is-amicable npm package — would that also have been OK? After all, a skill we increasingly need as coders is to find and integrate pre-written packages. Where in these timed torture tests is that particular skill probed? Or, since the problem statement specified that the solution only need to work up to 30,000, would it have been a valid solution to simply hard-wire all those amicable pairs, of which there are just eight?)

Of course, being able to find the right high-level solution for this problem in thirty seconds or three minutes does seem like a useful enough skill. I wish I had it. If I cared enough, I would drill on problems like this and might be able to develop that skill. Such an ability probably is correlated with skills involved in solving more complicated, real-world problems. But it hardly seems like a mandatory skill, even for someone applying to an elite development position.

But actually we don’t even know. It’s possible, albeit counter-intuitive, that the skill of being able to leap to the structure of a solution in seconds is actually a negative for a developer. She may lock herself in to a suboptimal solution. She may not consider alternative solutions. She may not evaluate solutions against all the relevant criteria. Finally, the solution might be obvious to her, but hard for other people to understand.

In other words, the timed coding exercise is designed to test for a skill without any evidence whatsoever that that skill is actually associated with superior performance as a developer day in and day out and is more likely only peripherally associated with good performance, with at least a theoretical possibility that it is testing for a skill which is negatively associated with real-world performance.

In the case of the amicable number example, the timed coding style of tests actively discourages the exact type of curiosity that I would hope to find in a programmer of learning more about the domain. For instance, are there numeric properties of amicable numbers that would avoid the brute force approach and permit some kind of closed-form or deterministic solution?

Is it just me?

But then I stopped and wondered. Is this just me? Am I just making excuses for not being able to do better on these timed coding ordeals?

To reassure myself, I reached out to a trusted acquaintance who has been coding for several decades. This man, who in the past lives wrote compression and encryption algorithms in assembly language in his sleep, now is working on machine vision algorithms for new blockbuster products in a senior role at the R&D labs of one of the largest software companies in the world.

He told me, “Bob, I’ve never done timed coding tests, and don’t intend to start now, but I’m quite sure I could never pass them. My approach to coding is all about iteration and refinement, taking as much time as I need to get to the optimal solutions, and even starting over again from scratch when I need to.”

So what’s the alternative?

Screening and filtering new tech talent, as important as it is, is a Rubik’s cube for which there is no silver bullet, to mix metaphors. When I myself hire, I spend more time talking with the candidate in detail about their past projects, down to the internals, while drilling down mercilessly to make sure she really played the role she claims. Although it is the case that some applicants may know a lot of “trivia” about some programming language yet still not be able to code, in-depth language questions can still be useful — hopefully better questions than the one a friend of mine was asked at a recent interview about ES6 about the different between var and let. If you are going to go with on-line tests, I prefer tests of programming knowledge such as those at Pluralsight, although some of these can be quite stupid too, such as asking what [] == 0 evaluates as in JavaScript. In all these cases, of course, the hiring manager has to keep in mind that the real skill to be looking for is the ability to learn new languages, frameworks, and technologies rapidly, to quickly build out one’s knowledge of existing systems, and to find and evaluate the best new technologies for some problem— something very hard if not impossible to test for.

The gold standard for gauging skill in writing apps is to, well, look at the apps that they’ve written. If the candidate has published open-source apps or libraries, that’s an obvious place to start. However, we cannot realistically expect that every potential employee is a prolific open contributor. (We must also be careful about an interviewee’s github repos can be works in progress or abandoned.)

The best app to look at is one they’ve written for the specific opportunity. This requires, obviously, that the candidate be willing to spend the requisite period of time — a couple of days to a couple of weeks — to write the app, something that will probably only be the case if she’s really interested in the job. It’s easy enough to check that the code was actually written by the applicant by means of asking her to explain chosen parts of the code. This approach lets us not only test programming productivity, but every other relevant aspect of coding skill, including program design, code style, code organization, and unit tests.

Conclusion

Let’s relegate the timed coding tests to their rightful place — a kind of brainteaser, or sudoku problem, or crossword puzzle — something for coders to play with when they have nothing else to do. Whatever we do, let’s not elevate them to some central role in the hiring process, especially not one where they are used to permanently exclude candidates who “fail”.

Technologist/author/translator mainly writing about computing