30 Nov 2025
Planet Gentoo
One jobserver to rule them all
A common problem with running Gentoo builds is concurrency. Many packages include extensive build steps that are either fully serial, or cannot fully utilize the available CPU threads throughout. This problem becomes less pronounced when running building multiple packages in parallel, but then we are risking overscheduling for packages that do take advantage of parallel builds.
Fortunately, there are a few tools at our disposal that can improve the situation. Most recently, they were joined by two experimental system-wide jobservers: guildmaster and steve. In this post, I'd like to provide the background on them, and discuss the problems they are facing.
The job multiplication problem
You can use the MAKEOPTS variable to specify a number of parallel jobs to run:
MAKEOPTS="-j12"
This is used not only by GNU make, but it is also recognized by a plethora of eclasses and ebuilds, and converted into appropriate options for various builders, test runners and other tools that can benefit from concurrency. So far, that's good news; whenever we can, we're going to run 12 jobs and utilize all the CPU threads.
The problems start when we're running multiple builds in parallel. This could be either due to running emerge --jobs, or simply needing to start another emerge process. The latter happens to me quite often, as I am testing multiple packages simultaneously.
For example, if we end up building four packages simultaneously, and all of them support -j, we may end up spawning 48 jobs. The issue isn't just saturating the CPU; imagine you're running 48 memory-hungry C++ compilers simultaneously!
Load-average scheduling to the rescue
One possible workaround is to use the --load-average option, e.g.:
MAKEOPTS="-j12 -l13"
This causes tools supporting the option not to start new jobs if the current load exceeds 13, which roughly approximates 13 processes running simultaneously. However, the option isn't universally supported, and the exact behavior differs from tool to tool. For example, CTest doesn't start any jobs when the load is exceeded, effectively stopping test execution, whereas GNU make and Ninja throttle themselves down to one job.
Of course, this is a rough approximation. While GNU make attempts to establish the current load from /proc/loadavg, most tools just use the one-minute average from getloadavg(), suffering from some lag. It is entirely possible to end up with interspersed periods of overscheduling while the load is still ramping up, followed by periods of underscheduling before it decreases again. Still, it is better than nothing, and can become especially useful for providing background load for other tasks: a build process that can utilize the idle CPU threads, and back down when other builds need them.
The nested Makefile problem and GNU Make jobserver
Nested Makefiles are processed by calling make recursively, and therefore face a similar problem: if you run multiple make processes in parallel, and they run multiple jobs simultaneously, you end up overscheduling. To avoid this, GNU make introduces a jobserver. It ensures that the specified job number is respected across multiple make invocations.
At the time of writing, GNU make supports three kinds of the jobserver protocol:
- The legacy Unix pipe-based protocol that relied on passing file descriptors to child processes.
- The modern Unix protocol using a named pipe.
- The Windows protocol using a shared semaphore.
All these variants follow roughly the same design principles, and are peer-to-peer protocols for using shared state rather than true servers in the network sense. The jobserver's role is mostly limited to initializing the state and seeding it with an appropriate number of job tokens. Afterwards, clients are responsible for acquiring a token whenever they are about to start a job, and returning it once the job finishes. The availability of job tokens therefore limits the total number of processes started.
The flexibility of modern protocols permitted more tools to support them. Notably, the Ninja build system recently started supporting the protocol, therefore permitting proper parallelism in complex build systems combining Makefiles and Ninja. The jobserver protocol is also supported by Cargo and various Rust tools, GCC and LLVM, where it can be used to limit the number of parallel LTO jobs.
A system-wide jobserver
With a growing number of tools becoming capable of parallel processing, and at the same time gaining support for the GNU make jobserver protocol, it starts being an interesting solution to the overscheduling problem. If we could run one jobserver shared across all build processes, we could control the total number of jobs running simultaneously, and therefore have all the simultaneously running builds dynamically adjust one to another!
In fact, this is not a new idea. A bug requesting jobserver integration has been filed for Portage back in 2019. NixOS jobserver effort dates back at least to 2021, though it has not been merged yet. Guildmaster and steve joined the effort very recently.
There are two primary problems with using a system-wide jobserver: token release reliability, and the "implicit slot" problem.
The token release problem
The first problem is more important. As noted before, the jobserver protocol relies entirely on clients releasing the job tokens they acquired, and the documentation explicitly emphasizes that they must be returned even in error conditions. Unfortunately, this is not always possible: if the client gets killed, it cannot run any cleanup code and therefore return the tokens! For scoped jobservers like GNU make's this usually isn't that much of a problem, since make normally terminates upon a child being killed. However, a system jobserver could easily be left with no job tokens in the queue this way!
This problem cannot really be solved within the strict bounds of the jobserver protocol. After all, it is just a named pipe, and there are limits to how much you can monitor what's happening to the pipe buffer. Fortunately, there is a way around that: you can implement a proper server for the jobserver protocol using FUSE, and provide it in place of the named pipe. Good news is, most of the tools don't actually check the file type, and these that do can easily be patched.
The current draft of NixOS jobserver provides a regular file with special behavior via FUSE, whereas guildmaster and steve both provide a character device via its CUSE API. NixOS jobserver and guildmaster both return unreleased tokens once the process closes the jobserver file, whereas steve returns them once the process acquiring them exits. This way, they can guarantee that a process that either can't release its tokens (e.g. because it's been killed), or one that doesn't because of implementation issue (e.g. Cargo), doesn't end up effectively locking other builds. It also means we can provide live information on which processes are holding the tokens, or even implement additional features such as limiting token provision based on the system load, or setting per-process limits.
The implicit slot problem
The second problem is related to the implicit assumption that a jobserver is inherited from a parent GNU make process that already acquired a token to spawn the subprocess. Since the make subprocess doesn't really do any work itself, it can "use" the token to spawn another job instead. Therefore, every GNU make process running under a jobserver has one implicit slot that runs jobs without consuming any tokens. If the jobserver is running externally and no job tokens were acquired while running the top make process, it ends up running an extra process without a job token: so steve -j12 permits 12 jobs, plus one extra job for every package being built.
Fortunately, the solution is rather simple: one needs to implement token acquisition at Portage level. Portage acquires a new token prior to starting a build job, and releases it once the job finishes. In fact, this solves two problems: it accounts for the implicit slot in builders implementing the jobserver protocol, and it limits the total number of jobs run for parallel builds.
However, this is a double-edged sword. On one hand, it limits the risk of overscheduling when running parallel build jobs. On the other, it means that a new emerge job may not be able to start immediately, but instead wait for other jobs to free up job tokens first, negatively affecting interactivity.
A semi-related issue is that acquiring a single token doesn't properly account for processes that are parallel themselves but do not implement the jobserver protocol, such as pytest-xdist runs. It may be possible to handle these better by acquiring multiple tokens prior to running them (or possibly while running them), but in the former case one needs to be careful to acquire them atomically, and not end up with the equivalent of lock contention: two processes acquiring part of the tokens they require, and waiting forever for more.
The implicit slot problem also causes issues in other clients. For example, nasm-rs writes an extra token to the jobserver pipe to avoid special-casing the implicit slot. However, this violates the protocol and breaks clients with per-process tokens. Steve carries a special workaround for that package.
Summary
A growing number of tools is capable of some degree of concurrency: from builders traditionally being able to start multiple parallel jobs, to multithreaded compilers. While they provide some degree of control over how many jobs to start, avoiding overscheduling while running multiple builds in parallel is non-trivial. Some builders can use load average to partially mitigate the issue, but that's far from a perfect solution.
Jobservers are our best bet right now. Originally designed to handle job scheduling for recursive GNU make invocations, they are being extended to control other parallel processes throughout the build, and can be further extended to control the job numbers across different builds, and even across different build containers.
While NixOS seems to have dropped the ball, Gentoo is now finally actively pursuing global jobserver support. Guildmaster and steve both prove that the server-side implementation is possible, and integration is just around the corner. At this point, it's not clear whether a jobserver-enabled systems are going to become the default in the future, but certainly it's an interesting experiment to carry.
30 Nov 2025 7:34pm GMT
12 Oct 2025
Planet Gentoo
How we incidentally uncovered a 7-year old bug in gentoo-ci
"Gentoo CI" is the service providing periodic linting for the Gentoo repository. It is a part of the Repository mirror and CI project that I've started in 2015. Of course, it all started as a temporary third-party solution, but it persisted, was integrated into Gentoo Infrastructure and grew organically into quite a monstrosity.
It's imperfect in many ways. In particular, it has only some degree of error recovery and when things go wrong beyond that, it requires a manual fix. Often the "fix" is to stop mirroring a problematic repository. Over time, I've started having serious doubts about the project, and proposed sunsetting most of it.
Lately, things have been getting worse. What started as a minor change in behavior of Git triggered a whole cascade of failures, leading to me finally announcing the deadline for sunsetting the mirroring of third-party repositories, and starting ripping non-critical bits out of it. Interesting enough, this whole process led me to finally discover the root cause of most of these failures - a bug that has existed since the very early version of the code, but happened to be hidden by the hacky error recovery code. Here's the story of it.
Repository mirror and CI is basically a bunch of shell scripts with Python helpers run via a cronjob (repo-mirror-ci code). The scripts are responsible for syncing the lot of public Gentoo repositories, generating caches for them, publishing them onto our mirror repositories, and finally running pkgcheck on the Gentoo repository. Most of the "unexpected" error handling is set -e -x, with a dumb logging to a file, and mailing on a cronjob failure. Some common errors are handled gracefully though - sync errors, pkgcheck failures and so on.
The whole cascade started when Git was upgraded on the server. The upgrade involved a change in behavior where git checkout -- ${branch} stopped working; you could only specify files after the --. The fix was trivial enough.
However, once the issue was fixed I've started periodically seeing sync failures from the Gentoo repository. The scripts had a very dumb way of handling sync failures: if syncing failed, they removed the local copy entirely and tried again. This generally made sense - say, if upstream renamed the main branch, git pull would fail but a fresh clone would be a cheap fix. However, the Gentoo repository is quite big and when it gets removed due to sync failure, cloning it afresh from the Gentoo infrastructure failed.
So when it failed, I did a quick hack - I've cloned the repository manually from GitHub, replaced the remote and put it in place. Problem solved. Except a while later, the same issue surfaced. This time I kept an additional local clone, so I wouldn't have to fetch it from server, and added it again. But then, it got removed once more, and this was really getting tedious.
What I have assumed then is that the repository is failing to sync due to some temporary problems, either network or Infrastructure related. If that were the case, it really made no sense to remove it and clone afresh. On top of that, since we are sunsetting support for third-party repositories anyway, there is no need for automatic recovery from issues such as branch name changes. So I removed that logic, to have sync fail immediately, without removing the local copy.
Now, this had important consequences. Previously, any failed sync would result in the repository being removed and cloned again, leaving no trace of the original error. On top of that, a logic stopping the script early when the Gentoo repository failed meant that the actual error wasn't even saved, leaving me only with the subsequent clone failures.
When the sync failed again (and of course it did), I was able to actually investigate what was wrong. What actually happened is that the repository wasn't on a branch - the checkout was detached at some commit. Initially, I assumed this was some fluke, perhaps also related to the Git upgrade. I've switched manually to master, and that fixed it. Then it broke again. And again.
So far I've been mostly dealing with the failures asynchronously - I wasn't around at the time of the initial failure, and only started working on it after a few failed runs. However, finally the issue resurfaced so fast that I was able to connect the dots. The problem likely happened immediately after gentoo-ci hit an issue, and bisected it! So I've started suspecting that there is another issue in the scripts, perhaps another case of missed --, but I couldn't find anything relevant.
Finally, I've started looking at the post-bisect code. What we were doing is calling git rev-parse HEAD prior to bisect, and then using that result in git checkout. This obviously meant that after every bisect, we ended up with detached tree, i.e. precisely the issue I was seeing. So why didn't I notice this before?
Of course, because of the sync error handling. Once bisect broke the repository, next sync failed and the repository got cloned again, and we never noticed anything was wrong. We only started noticing once cloning started failing. So after a few days of confusion and false leads, I finally fixed a bug that was present for over 7 years in production code, and caused the Gentoo repository to be cloned over and over again whenever any bad commit happened.
12 Oct 2025 9:14am GMT
26 Jul 2025
Planet Gentoo
EPYTEST_PLUGINS and other goodies now in Gentoo
If you are following the gentoo-dev mailing list, you may have noticed that there's been a fair number of patches sent for the Python eclasses recently. Most of them have been centered on pytest support. Long story short, I've came up with what I believed to be a reasonably good design, and decided it's time to stop manually repeating all the good practices in every ebuild separately.
In this post, I am going to shortly summarize all the recently added options. As always, they are all also documented in the Gentoo Python Guide.
The unceasing fight against plugin autoloading
The pytest test loader defaults to automatically loading all the plugins installed to the system. While this is usually quite convenient, especially when you're testing in a virtual environment, it can get quite messy when you're testing against system packages and end up with lots of different plugins installed. The results can range from slowing tests down to completely breaking the test suite.
Our initial attempts to contain the situation were based on maintaining a list of known-bad plugins and explicitly disabling their autoloading. The list of disabled plugins has gotten quite long by now. It includes both plugins that were known to frequently break tests, and these that frequently resulted in automagic dependencies.
While the opt-out approach allowed us to resolve the worst issues, it only worked when we knew about a particular issue. So naturally we'd miss some rarer issue, and learn only when arch testing workflows were failing, or users reported issues. And of course, we would still be loading loads of unnecessary plugins at the cost of performance.
So, we started disabling autoloading entirely, using PYTEST_DISABLE_PLUGIN_AUTOLOAD environment variable. At first we only used it when we needed to, however over time we've started using it almost everywhere - after all, we don't want the test suites to suddenly start failing because of a new pytest plugin installed.
For a long time, I have been hesitant to disable autoloading by default. My main concern was that it's easy to miss a missing plugin. Say, if you ended up failing to load pytest-asyncio or a similar plugin, all the asynchronous tests would simply be skipped (verbosely, but it's still easy to miss among the flood of warnings). However, eventually we started treating this warning as an error (and then pytest started doing the same upstream), and I have decided that going opt-in is worth the risk. After all, we were already disabling it all over the place anyway.
EPYTEST_PLUGINS
Disabling plugin autoloading is only the first part of the solution. Once you disabled autoloading, you need to load the plugins explicitly - it's not sufficient anymore to add them as test dependencies, you also need to add a bunch of -p switches. And then, you need to keep maintaining both dependencies and pytest switches in sync. So you'd end up with bits like:
BDEPEND="
test? (
dev-python/flaky[${PYTHON_USEDEP}]
dev-python/pytest-asyncio[${PYTHON_USEDEP}]
dev-python/pytest-timeout[${PYTHON_USEDEP}]
)
"
distutils_enable_tests pytest
python_test() {
local -x PYTEST_DISABLE_PLUGIN_AUTOLOAD=1
epytest -p asyncio -p flaky -p timeout
}
Not very efficient, right? The idea then is to replace all that with a single EPYTEST_PLUGINS variable:
EPYTEST_PLUGINS=( flaky pytest-{asyncio,timeout} )
distutils_enable_tests pytest
And that's it! EPYTEST_PLUGINS takes a bunch of Gentoo package names (without category - almost all of them reside in dev-python/, and we can special-case the few that do not), distutils_enable_tests adds the dependencies and epytest (in the default python_test() implementation) disables autoloading and passes the necessary flags.
Now, what's really cool is that the function will automatically determine the correct argument values! This can be especially important if entry point names change between package versions - and upstreams generally don't consider this an issue, since autoloading isn't affected.
Going towards no autoloading by default
Okay, that gives us a nice way of specifying which plugins to load. However, weren't we talking of disabling autoloading by default?
Well, yes - and the intent is that it's going to be disabled by default in EAPI 9. However, until then there's a simple solution we encourage everyone to use: set an empty EPYTEST_PLUGINS. So:
EPYTEST_PLUGINS=() distutils_enable_tests pytest
…and that's it. When it's set to an empty list, autoloading is disabled. When it's unset, it is enabled for backwards compatibility. And the next pkgcheck release is going to suggest it:
dev-python/a2wsgi EPyTestPluginsSuggestion: version 1.10.10: EPYTEST_PLUGINS can be used to control pytest plugins loaded
EPYTEST_PLUGIN* to deal with special cases
While the basic feature is neat, it is not a golden bullet. The approach used is insufficient for some packages, most notably pytest plugins that run a pytest subprocesses without appropriate -p options, and expect plugins to be autoloaded there. However, after some more fiddling we arrived at three helpful features:
- EPYTEST_PLUGIN_LOAD_VIA_ENV that switches explicit plugin loading from -p arguments to PYTEST_PLUGINS environment variable. This greatly increases the chance that subprocesses will load the specified plugins as well, though it is more likely to cause issues such as plugins being loaded twice (and therefore is not the default). And as a nicety, the eclass takes care of finding out the correct values, again.
- EPYTEST_PLUGIN_AUTOLOAD to reenable autoloading, effectively making EPYTEST_PLUGINS responsible only for adding dependencies. It's really intended to be used as a last resort, and mostly for future EAPIs when autoloading will be disabled by default.
- Additionally, EPYTEST_PLUGINS can accept the name of the package itself (i.e. ${PN}) - in which case it will not add a dependency, but load the just-built plugin.
How useful is that? Compare:
BDEPEND="
test? (
dev-python/pytest-datadir[${PYTHON_USEDEP}]
)
"
distutils_enable_tests pytest
python_test() {
local -x PYTEST_DISABLE_PLUGIN_AUTOLOAD=1
local -x PYTEST_PLUGINS=pytest_datadir.plugin,pytest_regressions.plugin
epytest
}
…and:
EPYTEST_PLUGINS=( "${PN}" pytest-datadir )
EPYTEST_PLUGIN_LOAD_VIA_ENV=1
distutils_enable_tests pytest
Old and new bits: common plugins
The eclass already had some bits related to enabling common plugins. Given that EPYTEST_PLUGINS only takes care of loading plugins, but not passing specific arguments to them, they are still meaningful. Furthermore, we've added EPYTEST_RERUNS.
The current list is:
- EPYTEST_RERUNS=... that takes a number of reruns and uses pytest-rerunfailures to retry failing tests the specified number of times.
- EPYTEST_TIMEOUT=... that takes a number of seconds and uses pytest-timeout to force a timeout if a single test does not complete within the specified time.
- EPYTEST_XDIST=1 that enables parallel testing using pytest-xdist, if the user allows multiple test jobs. The number of test jobs can be controlled (by the user) by setting EPYTEST_JOBS with a fallback to inferring from MAKEOPTS (setting to 1 disables the plugin entirely).
The variables automatically add the needed plugin, so they do not need to be repeated in EPYTEST_PLUGINS.
JUnit XML output and gpy-junit2deselect
As an extra treat, we ask pytest to generate a JUnit-style XML output for each test run that can be used for machine processing of test results. gpyutils now supply a gpy-junit2deselect tool that can parse this XML and output a handy EPYTEST_DESELECT for the failing tests:
$ gpy-junit2deselect /tmp/portage/dev-python/aiohttp-3.12.14/temp/pytest-xml/python3.13-QFr.xml EPYTEST_DESELECT=( tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_nonzero_passed tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_passed_to_create_connection tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_zero_not_passed )
While it doesn't replace due diligence, it can help you update long lists of deselects. As a bonus, it automatically collapses deselects to test functions, classes and files when all matching tests fail.
hypothesis-gentoo to deal with health check nightmare
Hypothesis is a popular Python fuzz testing library. Unfortunately, it has one feature that, while useful upstream, is pretty annoying to downstream testers: health checks.
The idea behind health checks is to make sure that fuzz testing remains efficient. For example, Hypothesis is going to fail if the routine used to generate examples is too slow. And as you can guess, "too slow" is more likely to happen on a busy Gentoo system than on dedicated upstream CI. Not to mention some upstreams plain ignore health check failures if they happen rarely.
Given how often this broke for us, we have requested an option to disable Hypothesis health checks long ago. Unfortunately, upstream's answer can be summarized as: "it's up to packages using Hypothesis to provide such an option, and you should not be running fuzz testing downstream anyway". Easy to say.
Well, obviously we are not going to pursue every single package using Hypothesis to add a profile with health checks disabled. We did report health check failures sometimes, and sometimes got no response at all. And skipping these tests is not really an option, given that often there are no other tests for a given function, and even if there are - it's just going to be a maintenance nightmare.
I've finally figured out that we can create a Hypothesis plugin - now hypothesis-gentoo - that provides a dedicated "gentoo" profile with all health checks disabled, and then we can simply use this profile in epytest. And how do we know that Hypothesis is used? Of course we look at EPYTEST_PLUGINS! All pieces fall into place. It's not 100% foolproof, but health check problems aren't that common either.
Summary
I have to say that I really like what we achieved here. Over the years, we learned a lot about pytest, and used that knowledge to improve testing in Gentoo. And after repeating the same patterns for years, we have finally replaced them with eclass functions that can largely work out of the box. This is a major step forward.
26 Jul 2025 1:29pm GMT