In comparison to my last two entries this one will be much shorter (and less formal).
I've been working with computers a lot recently, and after more than a decade I'm still not sure if I like them. Embedded real-time systems are probably my favorite because there is so little code between you and the hardware, and the resource limitations makes things a bit more interesting. Conventional systems with an x86 architecture, kernel, and user-space quickly become needless complex for the task at hand — even worse if I have to network them. Add the mold that is Windows and I catch myself dreaming back to the days of digging trenches at -40°C.
My remedy is Nix.
There are many several good summaries of what Nix is1 and the one I just deleted while writing this was not one of them, but in short:
I can describe how I want my computer to behave in a text file and version control it with git.
This greatly reduces the number of ways my computer does something wrong (is incorrectly configured).
The learning curve is not insignificant but it makes working with computers much less of a chore.
Nix is its own programming language so it is very flexible. In a project I have defined a few computers and some services that they expose. I test this network by virtualizing all involved systems and executing a test script2. This allows me to integration test implementations that span across systems — very handy. I can improve upon an existing system — say, a computation cluster — without bringing it offline, and the upgrade can then be done transactionally (in one atomic step).
While Nix provides a lot of assurances while running this script it is still bound by the hardware it is executing on. Running a bunch of virtualization processes is just as chaotic as it sounds: some unrelated work on the system will change the schedule of the test you are running and you run the risk of getting differing results. The risk increases if you depend on time: if you are subject to race conditions. If you have written a shoddy test it may fail once every blue moon because you don't have control over the schedule.
So while extending a test script I stumbled upon such a race conditions and thought: "why don't I just run my test a bunch until it fails? That way I can identify the root cause!"
In this medium of absent audio, please imagine a buzzer noise that signals "wrong".
Nix does not support this.
Or more accurately, if you need this with Nix, you're doing something wrong.
When you successfully build something with Nix it is cached, so if I want to re-run my test script Nix will just hand me the success that is stored in cache.
There is an aptly named --rebuild
option, but this is meant as a verification tool:
if the checksum of the result changes Nix will signal an error and halt.
In my case the checksum does indeed change:
the virtualization logs contain granular timestamps that are non-stable due to the similarly unstable scheduling.
Welp, I was stuck. I turned to people that knew more than me, in the NixOS Matrix chat, and realized something truly chocking. After more than a decade in this field I thought it couldn't happen again, but it had: I had fallen into the XY problem trap! I had dug so deep that I hit rock and now wondered why my shovel didn't work. A smart(er) NixOS user made me stop digging: "tests should be considered consistent until proven otherwise."
Some rumination and a page appended to an engineering notebook brought some stuff to light:
- When writing tests you want them to be dead simple: it should be trivial to determinate whether the condition is fulfilled.
- In small teams there isn't enough resources to systematically look for software bugs: the problem space is of unknown size and it is difficult to jump into it (that is, to trigger bugs).
- Instead, handle eventual bugs reactively: when they occur, save the logs, create an issue, dedicate resources towards fixing it later, when able.
- But don't increase your technological debt3: be proactive when fixing the bug and make sure the applied changes are a net gain for the code down the line.
Or as a flashy one-liner: assume the null hypothesis (there is no bug) true until rejected (the hidden race condition is triggered).
Footnotes
1 The project website, this short informal video, or the Nix author's PhD thesis (the Introduction suffices).
2 It is known as the NixOS test driver. This is a good resource for it.
3 Quick fixes done now that compound into a horrible code base down the line that sucks to work with.