12 Oct 2025
Planet Gentoo
How we incidentally uncovered a 7-year old bug in gentoo-ci
"Gentoo CI" is the service providing periodic linting for the Gentoo repository. It is a part of the Repository mirror and CI project that I've started in 2015. Of course, it all started as a temporary third-party solution, but it persisted, was integrated into Gentoo Infrastructure and grew organically into quite a monstrosity.
It's imperfect in many ways. In particular, it has only some degree of error recovery and when things go wrong beyond that, it requires a manual fix. Often the "fix" is to stop mirroring a problematic repository. Over time, I've started having serious doubts about the project, and proposed sunsetting most of it.
Lately, things have been getting worse. What started as a minor change in behavior of Git triggered a whole cascade of failures, leading to me finally announcing the deadline for sunsetting the mirroring of third-party repositories, and starting ripping non-critical bits out of it. Interesting enough, this whole process led me to finally discover the root cause of most of these failures - a bug that has existed since the very early version of the code, but happened to be hidden by the hacky error recovery code. Here's the story of it.
Repository mirror and CI is basically a bunch of shell scripts with Python helpers run via a cronjob (repo-mirror-ci code). The scripts are responsible for syncing the lot of public Gentoo repositories, generating caches for them, publishing them onto our mirror repositories, and finally running pkgcheck on the Gentoo repository. Most of the "unexpected" error handling is set -e -x, with a dumb logging to a file, and mailing on a cronjob failure. Some common errors are handled gracefully though - sync errors, pkgcheck failures and so on.
The whole cascade started when Git was upgraded on the server. The upgrade involved a change in behavior where git checkout -- ${branch} stopped working; you could only specify files after the --. The fix was trivial enough.
However, once the issue was fixed I've started periodically seeing sync failures from the Gentoo repository. The scripts had a very dumb way of handling sync failures: if syncing failed, they removed the local copy entirely and tried again. This generally made sense - say, if upstream renamed the main branch, git pull would fail but a fresh clone would be a cheap fix. However, the Gentoo repository is quite big and when it gets removed due to sync failure, cloning it afresh from the Gentoo infrastructure failed.
So when it failed, I did a quick hack - I've cloned the repository manually from GitHub, replaced the remote and put it in place. Problem solved. Except a while later, the same issue surfaced. This time I kept an additional local clone, so I wouldn't have to fetch it from server, and added it again. But then, it got removed once more, and this was really getting tedious.
What I have assumed then is that the repository is failing to sync due to some temporary problems, either network or Infrastructure related. If that were the case, it really made no sense to remove it and clone afresh. On top of that, since we are sunsetting support for third-party repositories anyway, there is no need for automatic recovery from issues such as branch name changes. So I removed that logic, to have sync fail immediately, without removing the local copy.
Now, this had important consequences. Previously, any failed sync would result in the repository being removed and cloned again, leaving no trace of the original error. On top of that, a logic stopping the script early when the Gentoo repository failed meant that the actual error wasn't even saved, leaving me only with the subsequent clone failures.
When the sync failed again (and of course it did), I was able to actually investigate what was wrong. What actually happened is that the repository wasn't on a branch - the checkout was detached at some commit. Initially, I assumed this was some fluke, perhaps also related to the Git upgrade. I've switched manually to master, and that fixed it. Then it broke again. And again.
So far I've been mostly dealing with the failures asynchronously - I wasn't around at the time of the initial failure, and only started working on it after a few failed runs. However, finally the issue resurfaced so fast that I was able to connect the dots. The problem likely happened immediately after gentoo-ci hit an issue, and bisected it! So I've started suspecting that there is another issue in the scripts, perhaps another case of missed --, but I couldn't find anything relevant.
Finally, I've started looking at the post-bisect code. What we were doing is calling git rev-parse HEAD prior to bisect, and then using that result in git checkout. This obviously meant that after every bisect, we ended up with detached tree, i.e. precisely the issue I was seeing. So why didn't I notice this before?
Of course, because of the sync error handling. Once bisect broke the repository, next sync failed and the repository got cloned again, and we never noticed anything was wrong. We only started noticing once cloning started failing. So after a few days of confusion and false leads, I finally fixed a bug that was present for over 7 years in production code, and caused the Gentoo repository to be cloned over and over again whenever any bad commit happened.
12 Oct 2025 9:14am GMT
26 Jul 2025
Planet Gentoo
EPYTEST_PLUGINS and other goodies now in Gentoo
If you are following the gentoo-dev mailing list, you may have noticed that there's been a fair number of patches sent for the Python eclasses recently. Most of them have been centered on pytest support. Long story short, I've came up with what I believed to be a reasonably good design, and decided it's time to stop manually repeating all the good practices in every ebuild separately.
In this post, I am going to shortly summarize all the recently added options. As always, they are all also documented in the Gentoo Python Guide.
The unceasing fight against plugin autoloading
The pytest test loader defaults to automatically loading all the plugins installed to the system. While this is usually quite convenient, especially when you're testing in a virtual environment, it can get quite messy when you're testing against system packages and end up with lots of different plugins installed. The results can range from slowing tests down to completely breaking the test suite.
Our initial attempts to contain the situation were based on maintaining a list of known-bad plugins and explicitly disabling their autoloading. The list of disabled plugins has gotten quite long by now. It includes both plugins that were known to frequently break tests, and these that frequently resulted in automagic dependencies.
While the opt-out approach allowed us to resolve the worst issues, it only worked when we knew about a particular issue. So naturally we'd miss some rarer issue, and learn only when arch testing workflows were failing, or users reported issues. And of course, we would still be loading loads of unnecessary plugins at the cost of performance.
So, we started disabling autoloading entirely, using PYTEST_DISABLE_PLUGIN_AUTOLOAD environment variable. At first we only used it when we needed to, however over time we've started using it almost everywhere - after all, we don't want the test suites to suddenly start failing because of a new pytest plugin installed.
For a long time, I have been hesitant to disable autoloading by default. My main concern was that it's easy to miss a missing plugin. Say, if you ended up failing to load pytest-asyncio or a similar plugin, all the asynchronous tests would simply be skipped (verbosely, but it's still easy to miss among the flood of warnings). However, eventually we started treating this warning as an error (and then pytest started doing the same upstream), and I have decided that going opt-in is worth the risk. After all, we were already disabling it all over the place anyway.
EPYTEST_PLUGINS
Disabling plugin autoloading is only the first part of the solution. Once you disabled autoloading, you need to load the plugins explicitly - it's not sufficient anymore to add them as test dependencies, you also need to add a bunch of -p switches. And then, you need to keep maintaining both dependencies and pytest switches in sync. So you'd end up with bits like:
BDEPEND="
test? (
dev-python/flaky[${PYTHON_USEDEP}]
dev-python/pytest-asyncio[${PYTHON_USEDEP}]
dev-python/pytest-timeout[${PYTHON_USEDEP}]
)
"
distutils_enable_tests pytest
python_test() {
local -x PYTEST_DISABLE_PLUGIN_AUTOLOAD=1
epytest -p asyncio -p flaky -p timeout
}
Not very efficient, right? The idea then is to replace all that with a single EPYTEST_PLUGINS variable:
EPYTEST_PLUGINS=( flaky pytest-{asyncio,timeout} )
distutils_enable_tests pytest
And that's it! EPYTEST_PLUGINS takes a bunch of Gentoo package names (without category - almost all of them reside in dev-python/, and we can special-case the few that do not), distutils_enable_tests adds the dependencies and epytest (in the default python_test() implementation) disables autoloading and passes the necessary flags.
Now, what's really cool is that the function will automatically determine the correct argument values! This can be especially important if entry point names change between package versions - and upstreams generally don't consider this an issue, since autoloading isn't affected.
Going towards no autoloading by default
Okay, that gives us a nice way of specifying which plugins to load. However, weren't we talking of disabling autoloading by default?
Well, yes - and the intent is that it's going to be disabled by default in EAPI 9. However, until then there's a simple solution we encourage everyone to use: set an empty EPYTEST_PLUGINS. So:
EPYTEST_PLUGINS=() distutils_enable_tests pytest
…and that's it. When it's set to an empty list, autoloading is disabled. When it's unset, it is enabled for backwards compatibility. And the next pkgcheck release is going to suggest it:
dev-python/a2wsgi EPyTestPluginsSuggestion: version 1.10.10: EPYTEST_PLUGINS can be used to control pytest plugins loaded
EPYTEST_PLUGIN* to deal with special cases
While the basic feature is neat, it is not a golden bullet. The approach used is insufficient for some packages, most notably pytest plugins that run a pytest subprocesses without appropriate -p options, and expect plugins to be autoloaded there. However, after some more fiddling we arrived at three helpful features:
- EPYTEST_PLUGIN_LOAD_VIA_ENV that switches explicit plugin loading from -p arguments to PYTEST_PLUGINS environment variable. This greatly increases the chance that subprocesses will load the specified plugins as well, though it is more likely to cause issues such as plugins being loaded twice (and therefore is not the default). And as a nicety, the eclass takes care of finding out the correct values, again.
- EPYTEST_PLUGIN_AUTOLOAD to reenable autoloading, effectively making EPYTEST_PLUGINS responsible only for adding dependencies. It's really intended to be used as a last resort, and mostly for future EAPIs when autoloading will be disabled by default.
- Additionally, EPYTEST_PLUGINS can accept the name of the package itself (i.e. ${PN}) - in which case it will not add a dependency, but load the just-built plugin.
How useful is that? Compare:
BDEPEND="
test? (
dev-python/pytest-datadir[${PYTHON_USEDEP}]
)
"
distutils_enable_tests pytest
python_test() {
local -x PYTEST_DISABLE_PLUGIN_AUTOLOAD=1
local -x PYTEST_PLUGINS=pytest_datadir.plugin,pytest_regressions.plugin
epytest
}
…and:
EPYTEST_PLUGINS=( "${PN}" pytest-datadir )
EPYTEST_PLUGIN_LOAD_VIA_ENV=1
distutils_enable_tests pytest
Old and new bits: common plugins
The eclass already had some bits related to enabling common plugins. Given that EPYTEST_PLUGINS only takes care of loading plugins, but not passing specific arguments to them, they are still meaningful. Furthermore, we've added EPYTEST_RERUNS.
The current list is:
- EPYTEST_RERUNS=... that takes a number of reruns and uses pytest-rerunfailures to retry failing tests the specified number of times.
- EPYTEST_TIMEOUT=... that takes a number of seconds and uses pytest-timeout to force a timeout if a single test does not complete within the specified time.
- EPYTEST_XDIST=1 that enables parallel testing using pytest-xdist, if the user allows multiple test jobs. The number of test jobs can be controlled (by the user) by setting EPYTEST_JOBS with a fallback to inferring from MAKEOPTS (setting to 1 disables the plugin entirely).
The variables automatically add the needed plugin, so they do not need to be repeated in EPYTEST_PLUGINS.
JUnit XML output and gpy-junit2deselect
As an extra treat, we ask pytest to generate a JUnit-style XML output for each test run that can be used for machine processing of test results. gpyutils now supply a gpy-junit2deselect tool that can parse this XML and output a handy EPYTEST_DESELECT for the failing tests:
$ gpy-junit2deselect /tmp/portage/dev-python/aiohttp-3.12.14/temp/pytest-xml/python3.13-QFr.xml EPYTEST_DESELECT=( tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_nonzero_passed tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_passed_to_create_connection tests/test_connector.py::test_tcp_connector_ssl_shutdown_timeout_zero_not_passed )
While it doesn't replace due diligence, it can help you update long lists of deselects. As a bonus, it automatically collapses deselects to test functions, classes and files when all matching tests fail.
hypothesis-gentoo to deal with health check nightmare
Hypothesis is a popular Python fuzz testing library. Unfortunately, it has one feature that, while useful upstream, is pretty annoying to downstream testers: health checks.
The idea behind health checks is to make sure that fuzz testing remains efficient. For example, Hypothesis is going to fail if the routine used to generate examples is too slow. And as you can guess, "too slow" is more likely to happen on a busy Gentoo system than on dedicated upstream CI. Not to mention some upstreams plain ignore health check failures if they happen rarely.
Given how often this broke for us, we have requested an option to disable Hypothesis health checks long ago. Unfortunately, upstream's answer can be summarized as: "it's up to packages using Hypothesis to provide such an option, and you should not be running fuzz testing downstream anyway". Easy to say.
Well, obviously we are not going to pursue every single package using Hypothesis to add a profile with health checks disabled. We did report health check failures sometimes, and sometimes got no response at all. And skipping these tests is not really an option, given that often there are no other tests for a given function, and even if there are - it's just going to be a maintenance nightmare.
I've finally figured out that we can create a Hypothesis plugin - now hypothesis-gentoo - that provides a dedicated "gentoo" profile with all health checks disabled, and then we can simply use this profile in epytest. And how do we know that Hypothesis is used? Of course we look at EPYTEST_PLUGINS! All pieces fall into place. It's not 100% foolproof, but health check problems aren't that common either.
Summary
I have to say that I really like what we achieved here. Over the years, we learned a lot about pytest, and used that knowledge to improve testing in Gentoo. And after repeating the same patterns for years, we have finally replaced them with eclass functions that can largely work out of the box. This is a major step forward.
26 Jul 2025 1:29pm GMT
30 Apr 2025
Planet Gentoo
Urgent - OSU Open Source Lab needs your help
Oregon State University's Open Source Lab (OSL) has been a major supporter of Gentoo Linux and many other software projects for years. It is currently hosting several of our infrastructure servers as well as development machines for exotic architectures, and is critical for Gentoo operation.
Due to drops in sponsor contributions, OSL has been operating at loss for a while, with the OSU College of Engineering picking up the rest of the bill. Now, university funding has been cut, this is not possible anymore, and unless US$ 250.000 can be provided within the next two weeks OSL will have to shut down. The details can be found in a blog post of Lance Albertson, the director of OSL.
Please, if you value and use Gentoo Linux or any of the other projects that OSL has been supporting, and if you are in a position to make funds available, if this is true for the company you work for, etc … contact the address in the blog post. Obviously, long-term corporate sponsorships would here serve best - for what it's worth, OSL developers have ended up at almost every big US tech corporation by now. Right now probably everything helps though.
30 Apr 2025 5:00am GMT