09 Apr 2026
planet.freedesktop.org
Natalie Vock: Fixing AMDGPU’s VRAM management for low-end GPUs
It may sound unbelievable to some, but not everyone has a datacenter beast with 128GB of VRAM shoved in their desktop PCs. Around the world people tell the tale of a particularly fierce group of Linux gamers: Those who dare attempt to play games with only 8 gigabytes of VRAM, or even less. Truly, it takes exceedingly strong resilience and determination to face the stutters and slowdowns bound to occur when the system starts running low on free VRAM. Carnage erupts inside the kernel driver as every application fights for as much GPU memory as it can hold on to. Any game caught up in this battle for resources will surely not leave unscathed.
That is, until now. Because I fixed it.
Q: I don't care about long-winded rants about Linux graphics drivers! Where do I get moar perf?
A: You need some kernel patches as well as additional utilities to make use of the kernel capabilities properly.
The simplest option is to use CachyOS (with KDE as your desktop). Their kernel includes the patches you need from version 7.0rc7-2 and up, and the userspace utilities are available in the package repositories. All you need to do is use CachyOS's 7.0rc7-2 kernel, install the packages called dmemcg-booster and plasma-foreground-booster, and you should be good to go.
Q: I use another Arch-based distro! What now?
The dmemcg-booster and plasma-foreground-booster utilities are available in the AUR as well (plasma-foreground-booster carries the package name plasma-foreground-booster-dmemcg), so you can install them from there.
For the kernel side, you can either use the CachyOS kernel package on a non-CachyOS system by retrieving the package from their repository, or you can compile your own kernel. Installing linux-dmemcg from the AUR will compile the development branch I used to develop this. Being a development branch, this carries the risk of some stuff being broken, so install at your own risk!
If you want to apply the kernel patches yourself, you need these six .patch files:
Patch 1
Patch 2
Patch 3
Patch 4
Patch 5
Patch 6
I'm not sure how easily they apply on specific kernel versions, but feel free to leave a comment if you run into issues and I'll try to help out.
Q: I don't use an Arch-based distro (or the instructions don't apply to me for some other reason)! What now?
Maybe wait a bit. Eventually I'd expect this to trickle down into more distros. If I notice this work being packaged by other distros or being installable by other means, I will update this blogpost.
Q: I don't use KDE! What now?
A: For games where you care about VRAM usage, you can use newer versions of gamescope. Newer versions of gamescope will also try to make use of these kernel capabilities, so running your games through that should be sufficient. You will still need the dmemcg-booster utility in any case.
Q: I don't use systemd! What now?
A: All the user-space utilities hard-depend on systemd. Without systemd, you'd need to write your own utilities that make use of my kernel patches. Something needs to manage cgroups in your system, and that something needs to enable the right cgroup controllers and set the right limits (see also the long-winded explanation about how this works).
Q: I do care about long-winded rants about Linux graphics drivers! How does this work?
Let's first look at what problems we actually run into when we have games running on GPUs with little VRAM.
On a standard desktop system, the game won't be the only application that runs on the GPU at a time at all. If it's anything like my system, there's always at least one browser window with way too many tabs open, plus an assortment of apps (many of which are actually web apps running in their own browsers under the hood). All of this eats up quite a bit of VRAM, as well.
To properly stress-test kernel memory management when working on this issue, I would go ahead and open up nearly every app with an integrated browser engine that I had installed. Viewed in amdgpu_top, the result of that looks something like this:
![]()
Ouch, there goes 1/4 of VRAM. Now, let's try and launch Cyberpunk 2077 on top of that:
![]()
As expected, the game uses a lot of VRAM (I cranked the settings really high). However, a lot of memory allocations also end up in a memory region referred to as "GTT". This is memory that is accessible by the GPU, but physically located in system RAM. From the GPU's point of view, system RAM memory has to be accessed over the PCI bus. Accessing memory over the PCI bus is typically really, really slow. On my system, instead of the 256GB/s bandwidth VRAM could provide, we're suddenly stuck with a meager 16GB/s at absolute maximum, paired with significantly worse latency.
Some amount of memory landing in GTT is normal - many games will intentionally allocate memory in GTT because it is advantageous for some use cases. However, Cyberpunk 2077 allocates a fixed amount of around 650MB of memory in GTT. Instead, what happened here is that the game requested some memory allocations in VRAM, but somehow, they ended up in GTT instead!
In kernel land, this process is referred to as eviction. The system in total tried to use more VRAM than there was available at all, so something had to give. Instead of telling the app that memory allocation failed (which would mean a near-certain application crash), the kernel decides to kick some memory out of VRAM to make everything fit. This degrades performance, but at least it allows every app to continue running. Nice! If only it would evict literally anything other than the game, which is the very thing that suffers the worst from having its memory evicted. Why in the world would it decide on that????
A brief history on kernel eviction policies
Memory eviction and behavior under VRAM pressure are by no means new issues. Over the course of time, different approaches have tried tackling different associated issues, and those different approaches introduced new issues themselves.
In the beginning, things worked rather simplistically: If applications wanted VRAM allocations, the user-mode driver would go to the kernel-mode driver and request VRAM memory. Save for some exceptional cases, that request would be granted, and that memory would be kept in VRAM. If another application requested VRAM allocations, and the memory was kicked out, the kernel driver would move the memory back into VRAM the next time work was submitted to the GPU using that memory.
This worked quite horribly. Generally, two competing applications can be expected to roughly take turns executing GPU work - first one application submits work, then the other, then the first again, and so on. With that approach, memory would keep being moved back and forth after every single submission. One application gets kicked out and immediately moved back in, kicking the other out (which moves memory back in the next step). All this moving ended up with worse performance than if the memory had never been moved in the first place.
The first bandaid solution was to rate-limit memory movement inside the kernel driver. Once the kernel driver moved enough memory within a specific time frame to trigger a limit, no more memory would be moved for some more time. This indeed reduced moves, but didn't do anything to fix the underlying issue of repeated cyclic memory movements. Worse yet, repeatedly running into this ratelimit would introduce annoying jitters and stutters as the kernel driver rapidly alternated between moving memory and doing nothing.
Eventually, to combat the still-existing overhead of repeatedly moving memory, user-mode drivers changed their allocation strategy. Instead of specifying VRAM as the only acceptable domain to place the allocation in, every VRAM allocation request would specify both "VRAM" and "GTT" as possible memory domains. The kernel would interpret this as VRAM being preferred, but if there was no space, GTT was an acceptable fallback and the kernel wouldn't try to kick out other VRAM memory to make space.
This change entirely stopped the issue with memory repeatedly moving in and out of VRAM. However, if you squint your eyes a bit, you can see the kernel conceptually performing an eviction here, too. If there is no space in VRAM, the newly allocated memory is immediately evicted. This "eviction" is incredibly cheap to perform, since you don't actually need to move any memory, but the result is all the same: Memory that would ideally be in VRAM ends up in GTT.
This case is what we run into in Cyberpunk 2077 above. At some point, VRAM is full, and new allocations done by the game go straight to GTT. Clearly, that is the wrong decision to make here. But being more aggressive wouldn't really work either - that was the approach before, and it was even worse. So what is the right decision here?
Making the right decision is impossible
There is no single right decision to make here. Being aggressive is wrong, and not being aggressive is wrong, too. To be more specific, they're wrong in different cases. It makes complete sense for a game to be aggressive, but it makes no sense for random background apps to be equally aggressive. Random background apps should not be aggressive at all, but if the game backs off equally quickly, that doesn't help much either.
The real problem is that to the kernel driver, all memory looks the same. The kernel doesn't know if it's dealing with a highly-important object from a game or a static image from a random web app running in the background - all it sees is a list of buffers. As long as all buffers look the same, it is impossible to have the same approach work well for every one of all the wildly different situations a driver may encounter.
Enter cgroups
cgroups are cool. They're super great at organizing random batches of processes into single organizational pieces. If you make a "compile job" cgroup and put the make process in it, all compiler processes it spawns will be part of that cgroup too. Don't want a big build hogging up all your RAM? Set a limit with the cgroup memory controller. Want to have some CPU time for other things? Just set a CPU limit with the cgroup cpu controller. It's great. You can have cgroup hierarchies too, and represent almost any kind of complex resource distribution you want.
Luckily, systemd agrees that cgroups are cool. Every systemd unit is actually represented with its own cgroup, as well. And, as it happens, desktop environments will represent each desktop app as a systemd unit.
How convenient! Complex resource distribution sounds exactly like the problems we're having in GPU driver land. If only someone wrote a cgroup controller operating on memory allocations from arbitrary devices such as GPUs…
![]()
cgroups are a very clean solution for figuring out how relatively important GPU memory allocations are. Some time after Maarten Lankhorst from Intel initially wrote a cgroup controller managing GPU memory (initially only made for limiting how much VRAM one cgroup is allowed to consume), he pointed me to this work as a possible solution for the VRAM issues I was investigating. Eventually, this resulted in the dmem cgroup controller, written by Maarten, Maxime Ripard from Red Hat, and me.
With the dmem cgroup controller, the kernel now learns about "memory protection". Memory being "protected" merely means that the kernel will go to significant lengths to avoid evicting that memory. For example, it may try to find memory from a different cgroup that is not protected and evict that instead. cgroups are all about resource partitioning, so for a cgroup, you can assign a "protection limit" - that is, if a cgroup's memory usage is below that limit, its memory is protected. As soon as it exceeds the limit, the memory ceases to be protected and can more easily be evicted. This roughly corresponds to the "more aggressive" and "less aggressive" behaviors we used to have, but now we can have some applications (=cgroups) that are more aggressive and some that are less aggressive. Precisely what we wanted!
A note about my kernel patches
The dmem cgroup controller has been upstream for a while now, but for memory protection to work properly in gaming scenarios and such, you will likely still need my kernel patches.
Remember how Cyberpunk 2077 ends up with its memory in GTT because the kernel driver sees that VRAM is exhausted and puts new memory in GTT right away? I argued this is conceptually equivalent to an eviction, but under the hood, this and real evictions that move existing memory from VRAM to GTT work very differently. Among other things, protection by dmem cgroups did not apply to these "evictions" - this is what my kernel patches fix. Without them, the kernel is still not aggressive enough even if there is protection, and allocations will still end up in GTT.
User-space configuration
Maybe the best thing about cgroups for VRAM management is that the prioritization is completely dynamic and configurable by userspace. Window managers can now determine whichever app is in the foreground and dedicate the highest priority to that app via its cgroup, completely without having to teach the GPU driver what a "window" or "foreground" is. For desktops, this is an important heuristic, but it's totally not the kernel's business to know the concept of a foreground app.
I personally use KDE Plasma as my desktop environment, so I went looking for how such a thing could be integrated into Plasma. Lo and behold, it was already done! Plasma people already developed the ForegroundBooster utility that listens to which app is currently in the foreground, and tries to give it higher prioritization (in this case: wrt. CPU time) than other apps. This prioritization was also done via cgroups, so adding VRAM prioritization in my fork was pretty much a walk in the park.
Except for one thing - the ForegroundBooster utility doesn't manage cgroups and cgroup properties directly. systemd is responsible for managing cgroups, so ForegroundBooster just communicates with systemd to set the cgroup properties. That's not too bad though, let's just implement support for the dmem cgroup controller in systemd, right?
Well, this is what I thought, too. But as I alluded to before, fixing VRAM management for gaming purposes is by far not the only possible purpose of dmem cgroups. There are quite a few other use cases that people are eyeing dmem cgroups for, and if I were to implement a systemd interface while only considering the gaming scenario, the other use cases run the risk of having to deal with a systemd interface that wasn't designed with that use case in mind at all. So for now, a common systemd implementation seems mostly off-limits until the dust has settled some more.
What do we do if we can't tell systemd to do the thing we want? That's right, we do it anyway, but behind systemd's back. (Sorry, systemd.)
This is what the final piece of the puzzle, dmemcg-booster does (safely and 🚀blazingly fast🚀). After systemd constructs the cgroup hierarchy, dmemcg-booster goes over those cgroups and additionally enables the dmem controller on them, in order to activate the kernel functionality that ultimately allows for GPU memory protection on those cgroups. While at it, it also sets some settings in the cgroup hierarchy that allow the memory protection to kick in properly.
Of course, this is a rather ugly stopgap. Once systemd gains proper support, you'd express all this with drop-in unit configurations, which is a much prettier approach. The dmemcg-booster utility is exclusively there to bridge the gap until that proper support happens.
Conclusion
With all the puzzle pieces finally in place, let's repeat our test from before, launch a bunch of heavy apps, and then play Cyberpunk 2077 on top of that. How does it look now?
![]()
GTT memory usage is now down to 650MB, i.e. only the memory that the game explicitly allocated in system RAM itself. Not a single piece of memory got spilled!
Prioritization via cgroups now allows the game to use pretty much every last byte of VRAM for actual gaming purposes. It's a bit hard to compare precise numbers on how the game performs, because the VRAM shortage slowly develops over time as you run around in the game, but the improvement should be obvious when comparing how games feel when you play them for a while. Instead of performance slowly degrading over time, games should perform much more stable - as long as the game itself doesn't use more VRAM than you actually have. Generally, it seems like even modern games stay within a memory budget of ~8GB or a bit less, so if you have a GPU with 8GB of VRAM, you should be good to go with today's games.
More FAQ
Which GPUs does this work with? Is it only AMD GPUs?
Whether or not your GPU can benefit from it depends on the kernel driver - more specifically, whether it sets up the dmem cgroup controller.
amdgpu and xe both have support for the dmem cgroup controller already. In theory, Intel GPUs running the xe kernel driver should benefit as well, although I'm not sure anyone tested this yet.
For nouveau, I have sent a patch for dmem cgroup support to the mailing lists. This patch is also included in my development branch, so if you use my AUR package it should work. In other cases, you will need to wait for the patch to be picked up by your distribution, or apply it yourself.
The proprietary NVIDIA kernel modules do not support dmem cgroups yet, so this won't work there.
Do iGPUs/APU systems benefit from this too?
I don't actually know :)
The main problem (system RAM being slower than dedicated VRAM) does not exist on integrated GPUs, because they use system RAM for everything - so effects will most likely be more limited than on dGPUs. Maybe it still has some benefit? It probably requires careful testing to find out.
09 Apr 2026 12:00am GMT
01 Apr 2026
planet.freedesktop.org
Dave Airlie (blogspot): drm subsystem contributor numbers
I'm doing a podcast recording this week, so I wanted to run some numbers so I could have some facts rather than feels. It turns out my feels were off by a factor of 3 or so.
If asked, I've always said the contributor count to the drm subsystem is probably in the 100 or so developers per release cycle.
Did the simplest:
git log --format='%aN' v6.14..v6.15 drivers/gpu/drm/ include/uapi/drm/ include/drm/ | sort -u | wc -l
Iterated over a few kernel releases
v6.15 326
v6.16 322
v6.17 300
v6.18 334
v6.19 332
v7.0-rc6 346
The number for the complete kernel in those scenarios are ~2000 usually, which means drm subsystem has around 15-16% of the kernel contributors.
I'm a bit spun out, that's quite a lot of people. I think I'll blame Sima for it. This also explains why I'm a bit out of touch with the process problems other maintainers have, and when I say stuff like a lot of workflows don't scale, this is what I mean.
01 Apr 2026 8:59pm GMT
27 Mar 2026
planet.freedesktop.org
Sebastian Wick: Three Little Rust Crates
I published three Rust crates:
- name-to-handle-at: Safe, low-level Rust bindings for Linux
name_to_handle_atandopen_by_handle_atsystem calls - pidfd-util: Safe Rust wrapper for Linux process file descriptors (pidfd)
- listen-fds: A Rust library for handling systemd socket activation
They might seem like rather arbitrary, unconnected things - but there is a connection!
systemd socket activation passes file descriptors and a bit of metadata as environment variables to the activated process. If the activated process exec's another program, the file descriptors get passed along because they are not CLOEXEC. If that process then picks them up, things could go very wrong. So, the activated process is supposed to mark the file descriptors CLOEXEC, and unset the socket activation environment variables. If a process doesn't do this for whatever reason however, the same problems can arise. So there is another mechanism to help prevent it: another bit of metadata contains the PID of the target. Processes can check it against their own PID to figure out if they were the target of the activation, without having to depend on all other processes doing the right thing.
PIDs however are racy because they wrap around pretty fast, and that's why nowadays we have pidfds. They are file descriptors which act as a stable handle to a process and avoid the ID wrap-around issue. Socket activation with systemd nowadays also passes a pidfd ID. A pidfd ID however is not the same as a pidfd file descriptor! It is the 64 bit inode of the pidfd file descriptor on the pidfd filesystem. This has the advantage that systemd doesn't have to install another file descriptor in the target process which might not get closed. It can just put the pidfd ID number into the $LISTEN_PIDFDID environment variable.
Getting the inode of a file descriptor doesn't sound hard. fstat(2) fills out struct stat which has the st_ino field. The problem is that it has a type of ino_t, which is 32 bits on some systems so we might end up with a process identifier which wraps around pretty fast again.
We can however use the name_to_handle syscall on the pidfd to get a struct file_handle with a f_handle field. The man page helpfully says that "the caller should treat the file_handle structure as an opaque data type". We're going to ignore that, though, because at least on the pidfd filesystem, the first 64 bits are the 64 bit inode. With systemd already depending on this and the kernel rule of "don't break user-space", this is now API, no matter what the man page tells you.
So there you have it. It's all connected.
Obviously both pidfds and name_to_handle have more exciting uses, many of which serve my broader goal: making Varlink services a first-class citizen. More about that another time.
27 Mar 2026 12:15am GMT
26 Mar 2026
planet.freedesktop.org
Lennart Poettering: Mastodon Stories for systemd v260
On March 17 we released systemd v260 into the wild.
In the weeks leading up to that release (and since then) I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd260 hash tag. In case you aren't using Mastodon, but would like to read up, here's a list of all 21 posts:
- Post #1: NvPCR Measurements for Activated DDIs
- Post #2: Varlink Transport Plugins
- Post #3: Well-Known Varlink Services
- Post #4: .mstack Overlay Mount Stacks
- Post #5: RefreshOnReload= in Service Units
- Post #6: FANCY_NAME= in /etc/os-release
- Post #7: BindNetworkInterface= in Service Units
- Post #8: importctl pull-oci for Acquiring OCI Containers
- Post #9: systemd-report and Metrics API
- Post #10: udev's tpm2_id built-in and the TPM2 Quirks Database
- Post #11: Devicetree/CHID Database
- Post #12: Varlink IPC for systemd-networkd
- Post #13: systemd-vmspawn knows --ephemeral now
- Post #14: systemd-loginds's xaccess Concept
- Post #15: Unprivileged Portable Services
- Post #16: Image Policy Improvements
- Post #17: LUKS Volume Key Fixation
- Post #18: Journal Varlink Access
- Post #19: Nested UID Range Delegation
- Post #20: PrivateUsers=managed
- Post #21: bootctl install as Varlink API
I intend to do a similar series of serieses of posts for the next systemd release (v261), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.
My series for v261 will begin in a few weeks most likely, under the #systemd261 hash tag.
In case you are interested, here is the corresponding blog story for systemd v259, here for v258, here for v257, and here for v256.
26 Mar 2026 11:00pm GMT
Erik Faye-Lund: Open Source and AI
More and more frequently, I get asked about my stance on AI in the context of programming. This is my attempt to summarize my stance for those who wonder.
This is a blog post that I don't want to write, but some recent developments have more or less forced my hand here. I would have preferred to keep pretending that I'm neutral in the issue, and just hoping that the problem goes away. But that doesn't seem to be happening.
I'm probably not the most qualified person to write about this, I'm sure you can find better informed articles out there. These are my personal opinions, and not those of my employer, their customers or any other of my affiliates. Take them with a pinch of salt, and feel free to disagree.
On a general note, I'm very reluctant to tell people how they should behave. But in this case, I've decided to do just that. I hope it's clear why from the context.
Similarly, I would caution every reader to be skeptical of anyone who claims to know what the future holds, me included. People often predict the future that benefits themselves the most. There's a few times I make some predictions in this post. Those are just predictions, and I might very well be wrong.
A final warning; this is a long post, so set aside some time. I've tried to limit the scope somewhat to mostly cover topics concerning open source development, but I sometimes end up discussing wider issues. This is simply because I don't feel like I can ignore these.
But yeah, let's start close to home here…
Licensing issues - The "plagiarism machine"
Currently, the legal status of AI generated code is still far from clear. Is it derivative work of all of the training data or not? It sometimes can be, but it depends on a lot of factors.
This is a critical issue to open source; I can't submit code somewhere that I don't know the origin of and potentially is license-incompatible with the upstream project. This isn't just theoretical, Copilot has been known to output Quake 3 source code with the wrong license. It doesn't matter if the people with the most to lose from AI output counting as derivative works keeps insisting not to worry.
The US supreme court recently made it clear that AI generated code isn't even copyright-able. A human needs to write the code for copyright to be granted. But with using AI tools, we're blurring the line, making it hard or even impossible to tell what is written by a human and what isn't.
I also doubt that a single or a handful of lawsuits is going to be enough to settle this. We're working in a global ecosystem, and there's potentially hundreds of jurisdictions that might have to rule, and just as many subtleties to take into account before we have a good understanding around this. It's going to take a long time to find out.
But even if this wasn't an issue, does that really mean we should use AI? Open source software is inherently political, especially when it comes to licensing. I tend to find something not being illegal to be a terribly low standard to have. It should IMO also be the right thing to do. This brings us to the other issues…
The cost of AI scraping
As open source developers, it's important for our infrastructure to be publicly available to everyone. In recent times, AI scrapers have started taking advantage of this, and are now aggressively scraping all content on open source support infrastructure so they can train their models. These scrapers often ignore robots.txt directives, and sometimes even use randomized, residential IP addresses, making it hard or impossible to effectively block them,
All of this has a major financial impact on the open source projects. It's not unusual to see over 90% of traffic provably coming from AI scrapers.
As a result, the open source community has had to introduce barriers, like Anubis, which slows down initial page-loads. Since Anubis is based on proof-of-work, it means people with slow computers can no longer reach our infrastructure. And I can't reliably browse our GitLab instance from my phone on the bus to work.
While the latter is a minor annoyance, the former is a real problem for inclusivity.
And because our infrastructure is so heavily affected by this, it feels deeply problematic to me if we use (and pay for) the tools that are built on this behavior. That would be rewarding the behavior. We should vote with our wallets here, and in this case this means to not pay them.
Maintainability issues
Another issue is that code needs to be understood and maintained in the long term. For this to work well, we need to be able to reach out to the people that wrote the code and get input on what led to a decision. Obviously, that's not always possible, but with AI this is almost never possible. The context is lost, and so is all the insight. Asking an AI again about the same code might lead to completely different reasoning, and miss crucial details.
The project I'm mostly working on, Mesa 3D, is also arguably critical infrastructure for a lot of computer systems around the world. We need to be lean towards being conservative rather than experimental when building these kinds of systems.
The junior problem
Another related issue is that AI technology tends to be used to take over more "junior" tasks, but the result of this is likely to be that we end up hiring and mentoring fewer junior developers. This will lead us to having fewer competent senior developers in the future.
Interacting with an AI isn't going to gradually make the AI learn and become more senior, unlike with a junior developer. AIs learn from training, not queries. Mentoring junior developers builds trust, which makes the interactions worthwhile also for the senior. In my experience, interactions with AIs are little other than frustration that never improves. And because working with the AI doesn't build any meaningful trust, the AIs will always need guard-rails to prevent disasters.
A future where we develop software with few to no human developers (junior or senior) sounds scary to me, but that's where this path leads.
Environmental issues
Building and running these huge data-centers is extremely resource heavy. Some of these resources are resources we all have to share on this planet, like water, electricity and rare earth minerals. This is taking a toll on our planet and everything living on it.
I feel like this point got a lot less attention recently than it used to, but it hasn't really been solved. Instead, the AI giants have just doubled down on wanting to consume all the resources they feel they need, without regard for the planet or people living on it. They are far from truthful about how bad this is, and try to prevent us from knowing just how bad it is.
The truth is that AI-type solutions are almost always one of the most resource intensive solutions to problems possible. And right now we're being told that we should use it for all problems. This is a recipe for disaster, nothing less.
And it seems like there's nothing being done on this front. The big AI companies are just slowly boiling the ocean, hoping that we don't notice or that we forget. I haven't forgotten.
Economical issues
A secondary effect of building all these data-centers is that demand for a bunch of resources goes up, and so does the price. This affects everyone.
We're not just seeing electricity and water being more scarce, we're also seeing memory and storage prices spiking hard as well. Forget about buying a new GPU, and just generally wait a couple of years with buying a new computer, or really any new gadgets.
How can this possibly not lead to a recession if things are allowed to continue?
And then we have the blatant circular economy that the big AI players are doing to try to convince the market this is actually profitable. In reality, very little actual money is changing hands, they're mostly just making promises to buy tech from each other in the future… Which brings us to the big one…
The bubble
Yeah, so it seems very likely we're currently in a bubble. We've been for a while, and this bubble is going to pop. The question is when and how.
Don't get me wrong; not all bubbles pop and erase everything with it. The dot-com bubble took years to pop, and we still have computers and the internet and all that jazz.
But we're currently overspending on infrastructure, and the companies selling that are currently raking in, and they are trying hard to make us all dependent on their technology.
For the last few years, the AI industry has slurped up most of the traditional technology investment capital available. The investors seem less and less interested in investing more money into the AI industry, and want return on their investments instead. So they have started turning to things like pension funds. If they get away with this, everyone is going to pay for this, regardless of their involvement in AI.
We've already been seeing the idea of "too big to fail" being thrown out there, mirroring what happened in the subprime mortgage crisis. We should, as a society refrain from letting them do this. These problems are caused by the AI industry, not by us consumers. We shouldn't be the ones to bail them out when the time comes.
OpenAI's CFO has already suggested that the U.S. government should provide a $1.4 trillion "safety net" for AI investments, and while Sam Altman since has walked that back after public outcry, this shows that these companies are already thinking along those lines.
Keeping the brain active
On a more personal note, it's kinda undeniable; I'm getting older, and part of getting old means that I need to spend more time actively thinking about things to keep up. Letting an AI take over the wheels, even just for the boring bits doesn't help me, it only makes this worse. Keeping the brain sharp requires work, not assistance.
In fact, I often feel like I learn something useful, even when I do mundane tasks. Asking an LLM to write up a python script for me to do something robs me of learning in the process of doing it myself.
Add that to the data that suggests that we actually get less productive by using AI (while thinking we're more productive), makes this all very unappealing to me. My brain is my most important tool, and I'm not going to risk it because tech CEOs are yelling at the world that they need to use AI to prevent a recession.
Conclusion
You might have noticed that I don't really address the technical abilities of current AI technologies in this post. The reason is that I don't feel like I need to; it's kinda irrelevant.
I think the moral arguments against using AI for open source development are just too large to ignore. In fact, just the licensing and environmental issues alone would probably have been enough for me to draw a hard line in the sand here:
Using AI for open source projects is in my opinion immoral, and I will not be using it. I do not condone others using AI for anything in the open source ecosystem either. Using it is simply detrimental to our values and directly harms our community.
If you're currently playing around with AI out of curiosity for open source projects, I would like to ask you to reconsider. If you're working in a company that's encouraging AI usage, I would like to ask you to speak up against it. If you are involved in policy decisions for open source projects, I would like to encourage you to try your best to discourage AI adoption within those projects.
Our entire ecosystem is on the line here. Not just the open source ecosystem, but the entire, global ecosystem. And I feel there's not enough voices speaking up about it.
Make your voice heard! Allow yourself to be angry; there's enough nonsense going out there! We need to stop this madness.
26 Mar 2026 4:30pm GMT
23 Mar 2026
planet.freedesktop.org
Christian Schaller: Using AI to create some hardware tools and bring back the past
As I talked about in a couple of blog posts now I been working a lot with AI recently as part of my day to day job at Red Hat, but also spending a lot of evenings and weekend time on this (sorry kids pappa has switched to 1950's mode for now). One of the things I spent time on is trying to figure out what the limitations of AI models are and what kind of use they can have for Open Source developers.
One thing to mention before I start talking about some of my concrete efforts is that I more and more come to conclude that AI is an incredible tool to hypercharge someone in their work, but I feel it tend to fall short for fully autonomous systems. In my experiments AI can do things many many times faster than you ordinarily could, talking specifically in the context of coding here which is what is most relevant for those of us in the open source community.
So one annoyance I had for years as a Linux user is that I get new hardware which has features that are not easily available to me as a Linux user. So I have tried using AI to create such applications for some of my hardware which includes an Elgato Light and a Dell Ultrasharp Webcam.
I found with AI and this is based on using Google Gemini, Claude Sonnet and Opus and OpenAI codex, they all required me to direct and steer the AI continuously, if I let the AI just work on its own, more often than not it would end up going in circles or diverging from the route it was supposed to go, or taking shortcuts that makes wanted output useless.On the other hand if I kept on top of the AI and intervened and pointed it in the right direction it could put together things for me in very short time spans.
My projects are also mostly what I would describe as end leaf nodes, the kind of projects that already are 1 person projects in the community for the most part. There are extra considerations when contributing to bigger efforts, and I think a point I seen made by others in the community too is that you need to own the patches you submit, meaning that even if an AI helped your write the patch you still need to ensure that what you submit is in a state where it can be helpful and is merge-able. I know that some people feel that means you need be capable of reviewing the proposed patch and ensuring its clean and nice before submitting it, and I agree that if you expect your patch to get merged that has to be the case. On the other hand I don't think AI patches are useless even if you are not able to validate them beyond 'does it fix my issue'.
My friend and PipeWire maintainer Wim Taymans and I was talking a few years ago about what I described at the time as the problem of 'bad quality patches', and this was long before AI generated code was a thing. Wim response to me which I often thought about afterwards was "a bad patch is often a great bug report". And that would hold true for AI generated patches to. If someone makes a patch using AI, a patch they don't have the ability to code review themselves, but they test it and it fixes their problem, it might be a good bug report and function as a clearer bug report than just a written description by the user submitting the report. Of course they should be clear in their bug report that they don't have the skills to review the patch themselves, but that they hope it can be useful as a tool for pinpointing what isn't working in the current codebase.
Anyway, let me talk about the projects I made. They are all found on my personal website Linuxrising.org a website that I also used AI to update after not having touched the site in years.
Elgato Light GNOME Shell extension
Elgato Light GNOME Shell extension
The first project I worked on is a GNOME Shell extension for controlling my Elgato Key Wifi Lamp. The Elgato lamp is basically meant for podcasters and people doing a lot of video calls to be able to easily configure light in their room to make a good recording. The lamp announces itself over mDNS, and thus can be controlled via Avahi. For Windows and Mac the vendor provides software to control their lamp, but unfortunately not for Linux.
There had been GNOME Shell extensions for controlling the lamp in the past, but they had not been kept up to date and their feature set was quite limited. Anyway, I grabbed one of these old extensions and told Claude to update it for latest version of GNOME. It took a few iterations of testing, but we eventually got there and I had a simple GNOME Shell extension that could turn the lamp off and on and adjust hue and brightness. This was a quite straightforward process because I had code that had been working at some point, it just needed some adjustments to work with current generation of GNOME Shell.
Once I had the basic version done I decided to take it a bit further and try to recreate the configuration dialog that the windows application offers for the full feature set which took me quite a bit of back and forth with Claude. I found that if I ask Claude to re-implement from a screenshot it recreates the functionality of the user interface first, meaning that it makes sure that if the screenshot has 10 buttons, then you get a GUI with 10 buttons. You then have to iterate both on the UI design, for example telling Claude that I want a dark UI style to match the GNOME Shell, and then I also had to iterate on each bit of functionality in the UI. Like most of the buttons in the UI didn't really do anything from the start, but when you go back and ask Claude to add specific functionality per button it is usually able to do so.
Elgato Light Settings Application
So this was probably a fairly easy thing for the AI because all the functionality of the lamp could be queried over Avahi, there was no 'secret' USB registers to be set or things like that.
Since the application was meant to be part of the GNOME Shell extension I didn't want to to have any dependency requirements that the Shell extension itself didn't have, so I asked Claude to make this application in JavaScript and I have to say so far I haven't seen any major differences in terms of the AIs ability to generate different languages. The application now reproduce most of the functionality of the Windows application. Looking back I think it probably took me a couple of days in total putting this tool together.
Dell UltraSharp 4K settings application for Linux
The second application on the list is a controller application for my Dell UltraSharp Webcam 4K UHD (WB7022). This is a high end Webcam I that have been using for a while and it is comparable to something like the Logitech BRIO 4K webcam. It has mostly worked since I got it with the generic UVC driver and I been using it for my Google Meetings and similar, but since there was no native Linux control application I could not easily access a lot of the cameras features. To address this I downloaded the windows application installer and installed it under Windows and then took a bunch of screenshots showcasing all features of the application. I then fed the screenshots into Claude and told it I wanted a GTK+ version for Linux of this application. I originally wanted to have Claude write it in Rust, but after hitting some issues in the PipeWire Rust bindings I decided to just use C instead.
I took me probably 3-4 days with intermittent work to get this application working and Claude turned out to be really good and digging into Windows binaries and finding things like USB property values. Claude was also able to analyze the screenshots and figure out the features the application needed to have. It was a lot of trial and error writing the application, but one way I was able to automate it was by building a screenshot option into the application, allowing it to programmatically take screenshots of itself. That allowed me to tell Claude to try fixing something and then check the screenshot to see if it worked without me having to interact with the prompt. Also to get the user interface looking nicer, once I had all the functionality in I asked Claude to tweak the user interface to follow the guidelines of the GNOME Human Interface Guidelines, which greatly improved the quality of the UI.
At this point my application should have almost all the features of the Windows application. Since it is using PipeWire underneath it is also tightly integrated with the PipeWire media graph, allowing you to see it connect and work with your application in PipeWire patchbay applications like Helvum. The remaining features are software features of Dell's application, like background removal and so on, but I think that if I decided to to implement that it should be as a standalone PipeWire tool that can be used with any camera, and not tied to this specific one.
The application shows the worlds Red Hat offices and include links to latest Red Hat news.
The next application on my list is called Red Hat Planet. It is mostly a fun toy, but I made it to partly revisit the Xtraceroute modernisation I blogged about earlier. So as I mentioned in that blog, Xtraceroute while cute isn't really very useful IMHO, since the way the modern internet works rarely have your packets jump around the world. Anyway, as people pointed out after I posted about the port is that it wasn't an actual Vulkan application, it was a GTK+ application using the GTK+ Vulkan backend. The Globe animation itself was all software rendered.
I decided if I was going to revisit the Vulkan problem I wanted to use a different application idea than traceroute. The idea I had was once again a 3D rendered globe, but this one reading the coordinates of Red Hats global offices from a file and rendering them on the globe. And alongside that provide clickable links to recent Red Hat news items. So once again maybe not the worlds most useful application, but I thought it was a cute idea and hopefully it would allow me to create it using actual Vulkan rendering this time.
Creating this turned out to be quite the challenge (although it seems to have gotten easier since I started this effort), with Claude Opus 4.6 being more capable at writing Vulkan code than Claude Sonnet, Google Gemini or OpenAI Codex was when I started trying to create this application.
When I started this project I had to keep extremely close tabs on the AI and what is was doing in order to force it to keep working on this as a Vulkan application, as it kept wanting to simplify with Software rendering or OpenGL and sometimes would start down that route without even asking me. That hasn't happened more recently, so maybe that was a problem of AI of 5 Months ago.
I also discovered as part of this that rendering Vulkan inside a GTK4 application is far from trivial and would ideally need the GTK4 developers to create such a widget to get rendering timings and similar correct. It is one of the few times I have had Claude outright say that writing a widget like that was beyond its capabilities (haven't tried again so I don't know if I would get the same response today). So I started moving the application to SDL3 first, which worked as I got a spinning globe with red dots on, but came with its own issues, in the sense that SDL is not a UI toolkit as such. So while I got the globe rendered and working the AU struggled badly with the news area when using SDL.
So I ended up trying to port the application to Qt, which again turned out to be non-trivial in terms of how much time it took with trial and error to get it right. I think in my mind I had a working globe using Vulkan, how hard could it be to move it from SDL3 to Qt, but there was a million rendering issues. In fact I ended up using the Qt Vulkan rendering example as a starting point in the end and then 'porting' the globe over bit by bit, testing it for each step, to finally get a working version. The current version is a Vulkan+Qt app and it basically works, although it seems the planet is not spinning correctly on AMD systems at the moment, while it seems to work well on Intel and NVIDIA systems.
WmDock fullscreen with config application.
This project came out of a chat with Matthias Clasen over lunch where I mused about if Claude would be able to bring the old Window Maker dockapps to GNOME and Wayland. Turns out the answer is yes although the method of doing so changed as I worked on it.
My initial thought was for Claude to create a shim that the old dockapps could be compiled against, without any changes. That worked, but then I had a ton of dockapps showing up in things like the alt+tab menu. It also required me to restart my GNOME Shell session all the time as I was testing the extension to house the dockapps. In the end I decided that since a lot of the old dockapps don't work with modern Linux versions anyway, and thus they would need to be actively ported, I should accept that I ship the dockapps with the tool and port them to work with modern linux technologies. This worked well and is what I currently have in the repo, I think the wildest port was porting the old dockapp webcam app from V4L1 to PipeWire. Although updating the soundcontroller from ESD to PulesAudio was also a generational jump.
XMMS brought back to life
So the last effort I did was reviving the old XMMS media player. I had tried asking Claude to do this for Months and it kept failing, but with Opus 4.6 it plowed through it and had something working in a couple of hours, with no input from me beyond kicking it off. This was a big lift,moving it from GTK2 and Esound, to GTK4, GStreamer and PipeWire. One thing I realized is that a challenge with bringing an old app back is that since keeping the themeable UI is a big part of this specific application adding new features is a little kludgy. Anyway I did set it up to be able to use network speakers through PipeWire and also you can import your Spotify playlists and play those, although you need to run the Spotify application in the background to be able to play sound on your local device.
Monkey Bubble

Monkey Bubble was a game created in the heyday of GNOME 2 and while I always thought it was a well made little game it had never been updated to never technologies. So I asked Claude to port it to GTK4 and use GStreamer for audio.This port was fairly straightforward with Claude having little problems with it. I also asked Claude to add highscores using the libmanette library and network game discovery with Avahi. So some nice little.improvements.
All the applications are available either as Flatpaks or Fedora RPMS, through the gitlab project page, so I hope people enjoy these applications and tools. And enoy the blasts from the past as much as I did.
Worries about Artifical Intelligence
When I speak to people both inside Red Hat and outside in the community I often come across negativity or even sometimes anger towards Artificial Intelligence in the coding space. And to be clear I to worry about where things could be heading and how it will affect my livelihood too, so I am not unsympathetic to those worries at all. I probably worry about these things at least a few times a day. At the same time I don't think we can hide from or avoid this change, it is happening with or without us. We have to adapt to a world where this tool exists, just like our ancestors have adapted to jobs changing due to industrialization and science before. So do I worry about the future, yes I do. Do I worry about how I might personally get affected by this? yes, I do. Do I worry about how society might change for the worse due to this? yes, I do. But I also remind myself that I don't know the future and that people have found ways to move forward before and society has survived and thrived. So what I can control is that I try to be on top of these changes myself and take advantage of them where I can and that is my recommendation to the wider open source community on this too. By leveraging them to move open source forward and at the same time trying to put our weight on the scale towards the best practices and policies around Artificial Intelligence.
The Next Test and where AI might have hit a limit for me.
So all these previous efforts did teach me a lot of tricks and helped me understand how I can work with an AI agent like Claude, but especially after the success with the webcam I decided to up the stakes and see if I could use Claude to help me create a driver for my Plustek OpticFilm 8200i scanner. So I have zero backround in any kind of driver development and probably less than zero in the field of scanner driver specifically. So I ended up going down a long row of deadends on this journey and I to this day has not been able to get a single scan out of the scanner with anything that even remotely resembles the images I am trying to scan.
My idea was to have Claude analyse the Windows and Mac driver and build me a SANE driver based on that, which turned out to be horribly naive and lead nowhere. One thing I realized is that I would need to capture USB traffic to help Claude contextualize some of the findings it had from looking at the Windows and Mac drivers.I started out with Wireshark and feeding Claude with the Wireshark capture logs. Claude quite soon concluded that the Wireshark logs wasn't good enough and that I needed lower level traffic capture. Buying a USB packet analyzer isn't cheap so I had the idea that I could use one of the ARM development boards floating around the house as a USB relay, allowing me to perfectly capture the USB traffic. With some work I did manage to set up my LibreComputer Solitude AML-S905D3-CC arm board going and setting it in device mode. I also had a usb-relay daemon going on the board. After a lot of back and forth, and even at one point trying to ask Claude to implement a missing feature in the USB kernel stack, I realized this would never work and I ended up ordering a Beagle USB 480 USB hardware analyzer.
At about the same time I came across the chipset documentation for the Genesys Logic GL845 chip in the scanner. I assumed that between my new USB analyzer and the chipset docs this would be easy going from here on, but so far no. I even had Claude decompile the windows driver using ghidra and then try to extract the needed information needed from the decompiled code.
I bought a network controlled electric outlet so that Claude can cycle the power of the scanner on its own.
So the problem here is that with zero scanner driver knowledge I don't even know what I should be looking for, or where I should point Claude to, so I keept trying to brute force it by trial and error. I managed to make SANE detect the scanner and I managed to get motor and lamp control going, but that is about it. I can hear the scanner motor running and I ask for a scan, but I don't know if it moves correctly. I can see light turning on and off inside the scanner, but I once again don't know if it is happening at the correct times and correct durations. And Claude has of course no way of knowing either, relying on me to tell it if something seems like it has improved compared to how it was.
I have now used Claude to create two tools for Claude to use, once using a camera to detect what is happening with the light inside the scanner and the other recording sound trying to compare the sound this driver makes compared to the sounds coming out when doing a working scan with the MacOS X application. I don't know if this will take me to the promised land eventually, but so far I consider my scanner driver attempt a giant failure. At the same time I do believe that if someone actually skilled in scanner driver development was doing this they could have guided Claude to do the right things and probably would have had a working driver by now.
So I don't know if I hit the kind of thing that will always be hard for an AI to do, as it has to interact with things existing in the real world, or if newer versions of Claude, Gemini or Codex will suddenly get past a threshold and make this seem easy, but this is where things are at for me at the moment.
23 Mar 2026 4:07pm GMT
18 Mar 2026
planet.freedesktop.org
Alberto Ruiz: Booting with Rust: Chapter 3
In Chapter 1 I gave the context for this project and in Chapter 2 I showed the bare minimum: an ELF that Open Firmware loads, a firmware service call, and an infinite loop.
That was July 2024. Since then, the project has gone from that infinite loop to a bootloader that actually boots Linux kernels. This post covers the journey.
The filesystem problem
The Boot Loader Specification expects BLS snippets in a FAT filesystem under loaders/entries/. So the bootloader needs to parse partition tables, mount FAT, traverse directories, and read files. All #![no_std], all big-endian PowerPC.
I tried writing my own minimal FAT32 implementation, then integrating simple-fatfs and fatfs. None worked well in a freestanding big-endian environment.
Hadris
The breakthrough was hadris, a no_std Rust crate supporting FAT12/16/32 and ISO9660. It needed some work to get going on PowerPC though. I submitted fixes upstream for:
thiserrorpulling instd: default features were not disabled, preventingno_stdbuilds.- Endianness bug: the FAT table code read cluster entries as native-endian
u32. On x86 that's invisible; on big-endian PowerPC it produced garbage cluster chains. - Performance: every cluster lookup hit the firmware's block I/O separately. I implemented a 4MiB readahead cache for the FAT table, made the window size parametric at build time, and improved
read_to_vec()to coalesce contiguous fragments into a single I/O. This made kernel loading practical.
All patches were merged upstream.
Disk I/O
Hadris expects Read + Seek traits. I wrote a PROMDisk adapter that forwards to OF's read and seek client calls, and a Partition wrapper that restricts I/O to a byte range. The filesystem code has no idea it's talking to Open Firmware.
Partition tables: GPT, MBR, and CHRP
PowerVM with modern disks uses GPT (via the gpt-parser crate): a PReP partition for the bootloader and an ESP for kernels and BLS entries.
Installation media uses MBR. I wrote a small mbr-parser subcrate using explicit-endian types so little-endian LBA fields decode correctly on big-endian hosts. It recognizes FAT32, FAT16, EFI ESP, and CHRP (type 0x96) partitions.
The CHRP type is what CD/DVD boot uses on PowerPC. For ISO9660 I integrated hadris-iso with the same Read + Seek pattern.
Boot strategy? Try GPT first, fall back to MBR, then try raw ISO9660 on the whole device (CD-ROM). This covers disk, USB, and optical media.
The firmware allocator wall
This cost me a lot of time.
Open Firmware provides claim and release for memory allocation. My initial approach was to implement Rust's GlobalAlloc by calling claim for every allocation. This worked fine until I started doing real work: parsing partitions, mounting filesystems, building vectors, sorting strings. The allocation count went through the roof and the firmware started crashing.
It turns out SLOF has a limited number of tracked allocations. Once you exhaust that internal table, claim either fails or silently corrupts state. There is no documented limit; you discover it when things break.
The fix was to claim a single large region at startup (1/4 of physical RAM, clamped to 16-512 MB) and implement a free-list allocator on top of it with block splitting and coalescing. Getting this right was painful: the allocator handles arbitrary alignment, coalesces adjacent free blocks, and does all this without itself allocating. Early versions had coalescing bugs that caused crashes which were extremely hard to debug - no debugger, no backtrace, just writing strings to the OF console on a 32-bit big-endian target.
And the kernel boots!
March 7, 2026. The commit message says it all: "And the kernel boots!"
The sequence:
-
BLS discovery: walk
loaders/entries/*.conf, parse intoBLSEntrystructs, filter by architecture (ppc64le), sort by version usingrpmvercmp. -
ELF loading: parse the kernel ELF, iterate
PT_LOADsegments,claima contiguous region, copy segments to their virtual address offsets, zero BSS. -
Initrd:
claimmemory, load the initramfs. -
Bootargs: set
/chosen/bootargsviasetprop. -
Jump: inline assembly trampoline - r3=initrd address, r4=initrd size, r5=OF client interface, branch to kernel:
core::arch::asm!(
"mr 7, 3", // save of_client
"mr 0, 4", // r0 = kernel_entry
"mr 3, 5", // r3 = initrd_addr
"mr 4, 6", // r4 = initrd_size
"mr 5, 7", // r5 = of_client
"mtctr 0",
"bctr",
in("r3") of_client,
in("r4") kernel_entry,
in("r5") initrd_addr as usize,
in("r6") initrd_size as usize,
options(nostack, noreturn)
)
One gotcha: do NOT close stdout/stdin before jumping. On some firmware, closing them corrupts /chosen and the kernel hits a machine check. We also skip calling exit or release - the kernel gets its memory map from the device tree and avoids claimed regions naturally.
The boot menu
I implemented a GRUB-style interactive menu:
- Countdown: boots the default after 5 seconds unless interrupted.
- Arrow/PgUp/PgDn/Home/End navigation.
- ESC: type an entry number directly.
e: edit the kernel command line with cursor navigation and word jumping (Ctrl+arrows).
This runs on the OF console with ANSI escape sequences. Terminal size comes from OF's Forth interpret service (#columns / #lines), with serial forced to 80×24 because SLOF reports nonsensical values.
Secure boot (initial, untested)
IBM POWER has its own secure boot: the ibm,secure-boot device tree property (0=disabled, 1=audit, 2=enforce, 3=enforce+OS). The Linux kernel uses an appended signature format - PKCS#7 signed data appended to the kernel file, same format GRUB2 uses on IEEE 1275.
I wrote an appended-sig crate that parses the appended signature layout, extracts an RSA key from a DER X.509 certificate (compiled in via include_bytes!), and verifies the signature (SHA-256/SHA-512) using the RustCrypto crates, all no_std.
The unit tests pass, including an end-to-end sign-and-verify test. But I have not tested this on real firmware yet. It needs a PowerVM LPAR with secure boot enforced and properly signed kernels, which QEMU/SLOF cannot emulate. High on my list.
The ieee1275-rs crate
The crate has grown well beyond Chapter 2. It now provides: claim/release, the custom heap allocator, device tree access (finddevice, getprop, instance-to-package), block I/O, console I/O with read_stdin, a Forth interpret interface, milliseconds for timing, and a GlobalAlloc implementation so Vec and String just work.
Published on crates.io at github.com/rust-osdev/ieee1275-rs.
What's next
I would like to test the Secure Boot feature on an end to end setup but I have not gotten around to request access to a PowerVM PAR. Beyond that I want to refine the menu. Another idea would be to perhaps support the equivalent of the Unified Kernel Image using ELF. Who knows, if anybody finds this interesting let me know!
The source is at the powerpc-bootloader repository. Contributions welcome, especially from anyone with POWER hardware access.
18 Mar 2026 4:52am GMT
10 Mar 2026
planet.freedesktop.org
Sebastian Wick: Redefining Content Updates in Wayland
The Wayland core protocol has described surface state updates the same way since the beginning: requests modify pending state, commits either apply that state immediately or cache it into the parent for synchronized subsurfaces. Compositors implemented this model faithfully. Then things changed.
Buffer Readiness and Compositor Deviation
The problem emerged from GPU work timing. When a client commits a surface with a buffer, that buffer might still have GPU rendering in progress. If the compositor applies the commit immediately, it would display incomplete content-glitches. If the compositor submits its own GPU work with a dependency on the unfinished client work, it risks missing the deadlines for the next display refresh cycles and even worse stalling in some edge cases.
To get predictable timing, the compositor needs to defer applying commits until the GPU work finishes. This requires tracking readiness constraints on committed state.
Mutter was the first compositor to address this by implementing constraints and dependency tracking of content updates internally. Instead of immediately applying or caching commits, Mutter queued the changes in what we now call content updates, and only applied them when ready. Critically, this was an internal implementation detail. From the client's perspective, the protocol semantics remained unchanged. Mutter had deviated from the implementation model implied by the specification while maintaining the observable behavior.
New Protocols on Unstable Foundations
When we wanted better frame timing control and a proper FIFO presentation modes on Wayland, we suddenly required explicit queuing of content updates to describe the behavior of the protocols. You can't implement FIFO and scheduling of content updates without a queue, so both the fifo and commit-timing protocols were designed around the assumption that compositors maintain per-surface queues of content updates.
These protocols were implemented in compositors on top of their internal queue-based architectures, and added to wayland-protocols. But the core protocol specification was never updated. It still described the old "apply or cache into parent state" model that has no notion of content updates, and per-surface queues.
We now had a situation where the core protocol described one model, extension protocols assumed a different model, and compositors implemented something that sort of bridged both.
Implementation and Theory
That situation is not ideal: If the internal implementation follows the design which the core protocol implies, you can't deal properly with pending client GPU work, and you can't properly implement the latest timing protocols. To understand and implement the per-surface queue model, you would have to read a whole bunch of discussions, and most likely an implementation such as the one in mutter. The implementations in compositors also evolved organically, making them more complex than they actually have to be. To make matter worse, we also lacked a shared vocabulary for discussing the behavior.
The obvious solution to this is specifying a general model of the per-surface content update queues in the core protocol. Easier said than done though. Coming up with a model that is sufficient to describe the new behavior while also being compatible with the old behavior when no constraints on content updates defer their application was harder than I expected.
Together with Julian Orth, we managed to change the Wayland core protocol, and I wrote documentation about the system.
Recently Pekka Paalanen and Julian Orth reviewed the work, which allowed it to land. The updated and improved Wayland book should get deployed soon, as well.
The end result is that if you ever have to write a Wayland compositor, one of the trickier parts to get right should now be almost trivial. Implement the rules as specified, and things should just work. Edge cases are handled by the general rules rather than requiring special knowledge.
10 Mar 2026 10:56pm GMT
Harry Wentland: Plane Color Pipeline, CSC, 3D LUT, and KWin
A wild blog appears…
The Plane Color Pipeline API and KWin
A couple months ago the DRM/KMS Plane Color Pipeline API was merged after more than 2 years of work and deep discussions. Many people worked on it and it's nice to see it upstream. KWin and other compositors implemented support for it. I'll mainly focus on kwin here because that's what I use regularly and what I am most familiar with. I will also focus on AMD HW because that's what I'm working on.
On AMD HW with a kernel that includes the new Color Pipeline API KWin enables HW composition for surfaces that update more than 20 times per second on a single enabled display. It needs a few other things to match as well. In particular, this means that running mpv with the default backend (--vo=gpu) will use HW composition for mpv's video surface with the rest of the desktop. The easiest way to observe this is with UMR by running it in --gui mode and looking at the KMS tab.

In my examples UMR also gets HW composed, so this shows 4 planes:
- 1920x1200 desktop plane - XR24
- 720p video plane - AB48
- 720p UMR plane - XR24
- 256x256 cursor plane - AR24
The mpv framebuffer shows up as an AB48 buffer, not NV12. This is because the --vo=gpu backend in mpv performs any required color-space conversation, scaling, and tone-mapping, and then offers up a 16-bpc buffer to the Wayland compositor, which kwin passes to the DRM/KMS driver.
NV12/P010 Scanout
We can tell mpv to pass the raw YUV buffer (NV12 or P010) to kwin by passing using the --vo=dmabuf-wayland backend. This tells mpv to simply decode the video stream but leave the buffer alone. It then passes the buffer information to kwin via the Wayland color-management and color-representation protocol extensions.
When we do this we don't see a HW-composed plane in umr. KWin color-space converts, scales, tone-maps, and composes the plane via OpenGL. It can't offload it to display HW because the DRM color pipeline API doesn't yet support color-space conversion (CSC). The drm_plane does have COLOR_RANGE and COLOR_ENCODING properties to specify CSC, but they are deprecated with the color pipeline API.
So I went and implemented a CSC drm_colorop, added IGT tests and added support for it in kwin.
With this new CSC colorop we now see an NV12 buffer for our SDR video (I'm using a 1080p60 Big Buck Bunny clip).

Banding and 3DLUTs
Unfortunately we see some banding during the HW composed Big Buck Bunny playback:

This is the SW composed version, showing no problems:

I haven't yet debugged the banding. It seems to happen with one of the 1D LUTs.
But the AMD HW also has a 3D LUT and kwin lets us sample its entire internal color pipeline, so we can simply sample it at our 3DLUT coordinates and program it to HW. This allows us to represent any complex color pipeline with a single 3D LUT operation. The result is this.

HDR Video
In order to compose HDR content kwin creates a tone-mapper. By packing the entire color pipeline into a 3D LUT we don't need to worry about it and get support for HW composition of HDR content for free.

Note: AMD's 3D LUT uses 17 entries per dimension and interpolates tetrahedrally. This will give good results when applied in non-linear luminance space. KWin blends in non-linear space, and the input buffer is non-linear, so it works well here. While this gives good results you might still observe minor differences, especially in brighter areas of the image. This can be observed when toggling between HW and SW composition in certain scenes.
Scaling
Because AMD's DCN HW uses a multi-tap scaler filter but kwin's SW composition uses GL_LINEAR there are differences in scaling. It's apparent when doing 4-to-1 downscaling, such as when scaling this 4k (HDR) video down to 720p.
SW composed:

HW composed:

The former has stronger aliasing. The latter looks softer and more natural, in my opinion. The difference becomes much less pronounced when the image is downscaled less, e.g., from 4k to 1080p.
At this point there is no good API to help align the two. GL seems to only have GL_NEAREST and GL_LINEAR when not dealing with mipmaps. DRM/KMS provides "Default" and "Nearest Neighbor" via the SCALING_FILTER property. While this allows us to use a nearest neighbor filter in both cases that's undesirable since it'll make the image worse in both cases.
Seeing is believing
When I started this work I asked myself: How do I see whether a surface is a candidate for offloading? How do I see which actual surface is being offloaded?
To solve this I worked with Claude to create a plugin that marks surfaces and their offload status:

It's quite useful to see immediately which surfaces are candidates, and why they might fail to offload.
I've also added a new tab to dynamically toggle HW composition and and off. The screenshot shows toggles for 3D LUT and tone-mapping and while we can add those as well they don't always take effect as expected, so I left them out of the branch that's linked below. But the ability to toggle HW composition is quite powerful when debugging HW composition issues.

kwin branch on top of csc-3dlut branch
UMR DCN tab
DCN HW programming can be logged via the amdgpu_dm_dtn_log debug log debugfs. But that log is quite extensive. It can be useful to show this in UMR with auto-update functionality to see programmed settings immediately.

The code still needs a fair bit of work as this was the first time I used Claude for something extensive like this. I plan to post it eventually.
A Brief Note on LLMs
I used Claude Sonnet extensively for this work (it basically wrote all the code), so I thought it prudent to leave a couple thoughts on it.
LLMs are large language models, not actual artificial intelligence. They're language models, and are designed to work well with language. They're large and can hold much more context than humans. Use them in ways that uses their strength. I've found value in understanding complex code-bases, and creating code that fits within those code-bases.
Don't stop owning your code. Even if it's produced by an LLM, take pride and ownership. This means, review what you get from an LLM. Be active in steering it. Don't throw trash at maintainers. Your name and reputation are on the line.
Next Steps
I'll be working with relevant communities and maintainers to attempt to upstream these things.
The CSC colorop is probably in a good shape.
The KWin code requires feedback from maintainers. I expect it will need more work.
This also needs more testing. At times I see the 3D LUT fail to apply. I'm not sure whether this is a problem with my kwin code or amdgpu.
I don't see offload candidate surfaces from many applications where I'd expect to see them. This needs further analysis. For one, I'm unsure what happens with games. The other thing is Youtube in Firefox, which fails to present the video as an offload surface. Some other videos work fine in Firefox, in particular local video playback.
sneaky edit: and power measurements, of course, since that's the entire reason for this.
10 Mar 2026 12:00am GMT
21 Feb 2026
planet.freedesktop.org
Simon Ser: Status update, February 2026
Hi all!
Lars has contributed an implementation independent test suite for the scfg configuration file format. This is quite nice for implementors, they get a base test suite for free. I've added support for it for libscfg, the C implementation.
I've spent some time working on the go-proxyproto library. While adding support for PP2_SUBTYPE_SSL_CLIENT_CERT (a PROXY protocol addition to carry the TLS client certificate I've introduced last month), I've fixed large PROXY protocol headers being rejected (TLS certificates can be a few kilobytes), I've fixed some issues in the test suite, and I've improved the HTTP/2 helper. I've merged support for PP2_SUBTYPE_SSL_CLIENT_CERT in tlstunnel, soju and kimchi.
Speaking about soju, delthas and taiite have finished up soju.im/client-cert, a new IRC extension to manage TLS client certificates. Clients can register, unregister, and list TLS client certificates which can be used for authentication for the logged in user. We aim to stop storing plaintext passwords, instead generating a fresh TLS certificate when logging in for the first time and storing its key. Nobody has started working on a Goguma patch yet, but that would be nice!
Goguma now has a brand new shiny website! Many thanks to Jean THOMAS for building it from the ground up. delthas has added a /invite command, I've added support for removing reactions (via the new unreact message tag), and I've experimented with a Web build of the app (just for fun, with WebSockets connections instead of TCP).
kanshi v1.9 has been released. This new version is the first to leverage vali for Varlink support. The new ...output directive can match any number of outputs, a the new mode preferred output directive can be used to select the mode marked as preferred by the kernel.
I've resumed work on oembed-proxy, a small server which generates oEmbed previews for arbitrary URLs. It's quite simple: send an HTTP request with a URL, it replies with a JSON payload with metadata such as page title, image size, and so on. I plan to use it for IRC clients, to show link previews without leaking the client's IP address and to make them work on Web clients. I've added support for Open Graph, the most widely used scheme to attach structured data to Web pages. I ended up linking with ffmpeg because I figured I would need to eventually generate thumbnails for images and videos. I played a bit with CGo to integrate Go's streaming io.Reader with ffmpeg's C API. I had to jump through a few hoops, but it works!
Hiroaki Yamamoto has contributed wlroots support for ext-workspace-v1, and Félix Poisot has upgraded color-management-v1 to minor version 2. Félix also uncovered some holes in our explicit synchronization implementation - we're in the process of fixing these up now. I've started the wlroots release candidate cycle, and I just published RC3 today.
I've spent quite some time improving go-kdfs, a Go library for the Khronos Data Format Specification. KDFS defines a standard file format to describe how pixels are laid out in memory and how their contents should be interpreted. I've added a bunch of new pixel formats, JSON output for the CLI, unit tests against dfdutils, and a lot of other smaller improvements. I've written a wlroots patch to remove a bunch of manually written pixel format tables an replace them with auto-generated tables from go-kdfs. I've also added sample position to pixfmtdb, a Web frontend for go-kdfs (see for instance the Y samples on the DRM_FORMAT_NV12 page). Next up, I'd like to add missing features to the kdfs compat command so that wlroots can get rid of all of its tables (better endianness support, and flags to specify/strip some information such as the alpha channel, color primaries or transfer function).
I'm quite happy with all of the good stuff we've managed to get over the fence this month! See you in March!
21 Feb 2026 10:00pm GMT
20 Feb 2026
planet.freedesktop.org
Christian Gmeiner: GLES3 on etnaviv: Fixing the Hard Parts
This is the start of a series about getting OpenGL ES 3.0 conformance on Vivante GC7000 hardware using the open-source etnaviv driver in Mesa. Thanks to Igalia for giving me the opportunity to spend some time on these topics.
Where We Are
etnaviv has supported GLES2 on Vivante GPUs for a long time. GLES3 support has been progressing steadily, but the remaining dEQP failures are the stubborn ones - the cases where the hardware doesn't quite do what the spec says, and the driver has to get creative.
20 Feb 2026 12:00am GMT
13 Feb 2026
planet.freedesktop.org
Dave Airlie (blogspot): drm subsystem AI patch review
This topic came up at kernel maintainers summit and some other groups have been playing around with it, particularly the BPF folks, and Chris Mason's work on kernel review prompts[1] for regressions. Red Hat have asked engineers to investigate some workflow enhancements with AI tooling, so I decided to let the vibecoding off the leash.
My main goal:
- Provide AI led patch review for drm patches
- Don't pollute the mailing list with them at least initially.
This led me to wanting to use lei/b4 tools, and public-inbox. If I could push the patches with message-ids and the review reply to a public-inbox I could just publish that and point people at it, and they could consume it using lei into their favorite mbox or browse it on the web.
I got claude to run with this idea, and it produced a project [2] that I've been refining for a couple of days.
I started with trying to use Chris' prompts, but screwed that up a bit due to sandboxing, but then I started iterating on using them and diverged.
The prompts are very directed at regression testing and single patch review, the patches get applied one-by-one to the tree, and the top patch gets the exhaustive regression testing. I realised I probably can't afford this, but it's also not exactly what I want.
I wanted a review of the overall series, but also a deeper per-patch review. I didn't really want to have to apply them to a tree, as drm patches are often difficult to figure out the base tree for them. I did want to give claude access to a drm-next tree so it could try apply patches, and if it worked it might increase the review, but if not it would fallback to just using the tree as a reference.
Some holes claude fell into, claude when run in batch mode has limits on turns it can take (opening patch files and opening kernel files for reference etc), giving it a large context can sometimes not leave it enough space to finish reviews on large patch series. It tried to inline patches into the prompt before I pointed out that would be bad, it tried to use the review instructions and open a lot of drm files, which ran out of turns. In the end I asked it to summarise the review prompts with some drm specific bits, and produce a working prompt. I'm sure there is plenty of tuning left to do with it.
Anyways I'm having my local claude run the poll loop every so often and processing new patches from the list. The results end up in the public-inbox[3], thanks to Benjamin Tissoires for setting up the git to public-inbox webhook.
I'd like for patch submitters to use this for some initial feedback, but it's also something that you should feel free to ignore, but I think if we find regressions in the reviews and they've been ignored, then I'll started suggesting it stronger. I don't expect reviewers to review it unless they want to. It was also suggested that perhaps I could fold in review replies as they happen into another review, and this might have some value, but I haven't written it yet. If on the initial review of a patch there is replies it will parse them, but won't do it later.
[1] https://github.com/masoncl/review-prompts
[2] https://gitlab.freedesktop.org/airlied/patch-reviewer
[3] https://lore.gitlab.freedesktop.org/drm-ai-reviews/
13 Feb 2026 6:56am GMT
Christian Gmeiner: My first Vulkan extension
After years of working on etnaviv - a Gallium/OpenGL driver for Vivante GPUs - I've been wanting to get into Vulkan. As part of my work at Igalia, the goal was to bring VK_EXT_blend_operation_advanced to lavapipe. But rather than going straight there, I started with Honeykrisp - the Vulkan driver for Apple Silicon - as a first target: a real hardware driver to validate the implementation against before wiring it up in a software renderer. My first Vulkan extension, and my first real contribution to Honeykrisp.
13 Feb 2026 12:00am GMT
10 Feb 2026
planet.freedesktop.org
Adam Jackson: now you're footgunning with gas!
If you haven't heard of gastown it is my sincere pleasure to be the one to fix that. If you're like me and you have way more ideas than time or ability to type them, gastown is an absolute game changer. I haven't felt this jazzed about programming in decades, like, I'm using antiquated slang like "jazzed" without embarrassment.
Well. Not embarrassment about that.
I awoke to a polite note from a colleague saying I had apparently pushed a bunch of random branches to the upstream Mesa repo and are we sure I wasn't hacked. No, I wasn't, those branches were definitely things I was working on, but I had been working on them locally, nothing should have even been pushed to my personal gitlab repo. I was using gastown to do that work so obviously we start looking there...
Gastown is built on beads and beads is built on git. Every atom of work within a gastown has a bead, which means updates to those beads are absolutely critical, if they don't happen then the town chugs to a halt. Gastown is also built on claude, or whatever, but claude is what I was using. Not to pick on claude, here, just using it as a generic brand name, claude is too polite to be a robust tool, sometimes. It'll lose some critical bit of context and want to stop and ask for directions. For a coding assistant that's a great way to be; for a code factory it's less awesome. It is slightly comic to read how much of gastown is just different ways of exhorting a lilypond of claudes to please do their work, please.
In gastown you wrap an upstream project in a rig, and I'm an old so I have all my git clones of everything already in ~/git, so I had set up my rigs to use those as the local storage so I didn't have to wait for things to clone again. The town mayor or one of his underlings dispatches work to ephemeral claude instances by writing the orders down in a bead, the work happens in the worker instance's git worktree instance of the rig. I would use gastown for the automagic git worktree management alone, forget the rest of the automation.
But here's where the projectile weapon starts pointing towards the positive gradient of the ol' G field. One way to try to help when claude loses context is to put important orientation information into CLAUDE.md, and helpfully, the /init command will build that for you by inspecting the current project. So I ran that at the top level of my gastown so it, gastown, wouldn't have to keep parsing "gt --help" output just to rediscover how to update a bead, hopefully. In doing so, claude lifted that directive about bead updates being seriously no-kidding mandatory up to the top level.
From that point on, claude instances would interpret that directive to apply to the project in the rig! So now, work isn't done until the work-in-progress branch is pushed. Until it is pushed to origin. Which, too bad that you cloned it from https, I'm going to discover the ssh URL to your personal mesa repo from the local repo's git config and change the URL in the rig to use ssh, because I was told to resolve the push failure or else.
So, cautionary tale, right? Maybe determinism in your tools still has value. Maybe plaintext English isn't the best idea for a configuration language. Maybe agent prompts need to be extra careful about context. Maybe careful sandbox construction would mitigate that kind of escape. Maybe an open source agent would be more trustworthy in terms of configurable tool usage since you would actually be able to see and control the boundary instead of just trusting that there is a boundary at all.
10 Feb 2026 5:06pm GMT
04 Feb 2026
planet.freedesktop.org
Dave Airlie (blogspot): nouveau: a tale of two bugs
Just to keep up some blogging content, I'll do where did I spend/waste time last couple of weeks.
I was working on two nouveau kernel bugs in parallel (in between whatever else I was doing).
Bug 1: Lyude, 2 or 3 weeks ago identified the RTX6000 Ada GPU wasn't resuming from suspend. I plugged in my one and indeed it wasn't. Turned out since we moved to 570 firmware, this has been broken. We started digging down various holes on what changed, sent NVIDIA debug traces to decode for us. NVIDIA identified that suspend was actually failing but the result wasn't getting propogated up. At least the opengpu driver was working properly.
I started writing patches for all the various differences between nouveau and opengpu in terms of what we send to the firmware, but none of them were making a difference.
I took a tangent, and decided to try and drop the latest 570.207 firmware into place instead of 570.144. NVIDIA have made attempts to keep the firmware in one stream more ABI stable. 570.207 failed to suspend, but for a different reason.
It turns out GSP RPC messages have two levels of sequence numbering, one on the command queue, and one on the RPC. We weren't filling in the RPC one, and somewhere in the later 570's someone found a reason to care. Now it turned out whenever we boot on 570 firmware we get a bunch of async msgs from GSP, with the word ASSERT in them with no additional info. Looks like at least some of those messages were due to our missing sequence numbers and fixing that stopped those.
And then? still didn't suspend/resume. Dug into memory allocations, framebuffer suspend/resume allocations. Until Milos on discord said you did confirm the INTERNAL_FBSR_INIT packet is the same, and indeed it wasn't. There is a flag bEnteringGCOff, which you set if you are entering into graphics off suspend state, however for normal suspend/resume instead of runtime suspend/resume, we shouldn't tell the firmware we are going to gcoff for some reason. Fixing that fixed suspend/resume.
While I was head down on fixing this, the bug trickled up into a few other places and I had complaints from a laptop vendor and RH internal QA all lined up when I found the fix. The fix is now in drm-misc-fixes.
Bug 2: A while ago Mary, a nouveau developer, enabled larger pages support in the kernel/mesa for nouveau/nvk. This enables a number of cool things like compression and gives good speedups for games. However Mel, another nvk developer reported random page faults running Vulkan CTS with large pages enabled. Mary produced a workaround which would have violated some locking rules, but showed that there was some race in the page table reference counting.
NVIDIA GPUs post pascal, have a concept of a dual page table. At the 64k level you can have two tables, one with 64K entries, and one with 4K entries, and the addresses of both are put in the page directory. The hardware then uses the state of entries in the 64k pages to decide what to do with the 4k entries. nouveau creates these 4k/64k tables dynamically and reference counts them. However the nouveau code was written pre VMBIND, and fully expected the operation ordering to be reference/map/unmap/unreference, and we would always do a complete cycle on 4k before moving to 64k and vice versa. However VMBIND means we delay unrefs to a safe place, which might be after refs happen. Fun things like ref 4k, map 4k, unmap 4k, ref 64k, map 64k, unref 4k, unmap 64k, unref 64k can happen, and the code just wasn't ready to handle those. Unref on 4k would sometimes overwrite the entry in the 64k table to invalid, even when it was valid. This took a lot of thought and 5 or 6 iterations on ideas before we stopped seeing fails. In the end the main things were to reference count the 4k/64k ref/unref separately, but also the last thing to do a map operation owned the 64k entry, which should conform to how userspace uses this interface.
The fixes for this are now in drm-misc-next-fixes.
Thanks to everyone who helped, Lyude/Milos on the suspend/resume, Mary/Mel on the page tables.
04 Feb 2026 9:04pm GMT
30 Jan 2026
planet.freedesktop.org
Natalie Vock: Inside Mesa 26.0’s RADV RT improvements
Mesa 26.0 is big for RADV's ray tracing. In fact, it's so big it single-handedly revived this blog.
There are a lot of improvements to talk about, and some of them were in the making for a little over two years at this point.
In this blog post I'll focus on the things I myself worked on specifically, most of which revolve around how ray tracing pipelines are compiled and dispatched. Of course, there's more than just what I did myself: Konstantin Seurer worked on a lot of very cool improvements to how we build BVHs, the data structure that RT hardware uses for the triangle soup making up the geometry in game scenes so the HW can trace rays against them efficiently.
RT pipeline compilation
The rest of this blog post will assume some basic idea of how GPU ray tracing and ray tracing pipelines work. I've written about this in more detail one-and-a-half years ago, in my blog post about RT pipeline being enabled by default.
Let's take a bit of a closer look at what I said about RT pipelines in RADV back then. In a footnote, I said:
Any-hit and Intersection shaders are still combined into a single traversal shader. This still shows some of the disadvantages of the combined shader method, but generally compile times aren't that ludicrous anymore.
I spent a significant amount of time in that blogpost detailing about how there tend to be a really large number of shaders, and combining them into a single megashader is very slow because shader sizes get genuinely ridiculous at that point.
So clearly, it was only a matter of time until the any-hit/intersection shader combination would blow up spectacularly on a spectacular number of shaders, as well.
So there's this thing called Unreal Engine
For illustrating the issues with inlined any-hit/intersection shaders, I'll use Unreal Engine as an example because I noticed it being particularly egregious here. This definitely was an issue with other RT games/workloads as well, and function calls will provide improvements there too.
There's a lot of people going around making fun of Unreal Engine these days, to the point of entire social media presences being built around mocking the ways in which UE is inefficient, slow, badly-designed bloatware and whatnot. Unfortunately, the most popular critics often know the least what they're actually talking about. I feel compelled to point out here that while there certainly are reasonable complaints to be raised about UE and games made with it, I explicitly don't want this section (or anything else in this post, really) to be misconstrued as "UE doing a bad thing". As you'll see, Unreal is really just using the RT pipeline API as designed.
With the disclaimer aside, what does Unreal actually do here that made RADV fall over so hard?
Let's talk a bit about how big game engines handle shading and materials. As you'll probably know already, to calculate how lighting interacts with objects in a scene, an application will usually run small programs called "shaders" on the GPU that, among other things1, calculate the colors different pixels have according to the material at that pixel.
Different materials interact with light differently, and in a large world with tons of different materials, you might end up having a ton of different shaders.
In a traditional raster setup you draw each object separately, so you can compile a lot of graphics pipelines for all of your materials, and then bind the correct one whenever you draw something with that material.
However, this approach falls apart in ray tracing. Rays can shoot through the scene randomly and they can hit pretty much any object that's loaded in at the moment. You can only ever use one ray tracing pipeline at once, so every single material that exists in your scene and may be hit by a ray needs to be present in the RT pipeline. The more materials a game has, the more ludicrous the number of shaders gets.
Usually, this is most relevant for closest-hit shaders, because these are the shaders that get called for the object hit by the ray (where shading needs to be calculated). However, depending on your material setup, you may have something like translucent materials - where parts of the material are "see-through", and rays should go through these parts to reveal the scene behind it instead of stopping.
This is where any-hit shaders come into play - any-hit shaders can instruct the driver to ignore a ray hitting a geometry, and instead keep searching for the next hit. If you have a ton of (potentially) translucent materials, that would translate into a lot of any-hit shaders being compiled for these materials.
The design of RT pipelines is quite obviously written in a way that accounts for this. In the previous blogpost I already mentioned pipeline libraries - the idea is that a material could just be contained in a "library", and if RT pipelines want to use it, they just need to link to the library instead of compiling the shading code all over again. This also allows for easy addition/removal of materials: Even though you have to re-create the RT pipeline, all you need to do is link to the already compiled libraries for the different materials.
UE, particularly UE4, is a heavy user of libraries, which makes a lot of sense: It maps very well to what it's trying to achieve. Everything's good, as long as the driver doesn't do silly things.
Silly things like, for example, combining any-hit shaders into one big traversal shader.
Doing something like that pretty much entirely side-steps the point of libraries. The traversal shader can only be compiled when all any-hit shaders are known, which is only at the very final linking step, which is supposed to be very fast…
And if UE4, assuming the linking step is very fast, does that re-linking over and over, very often, what you end up with is horrible pipeline compilation stutter every few seconds. And in this case, it's not really UE's fault, even! Sorry for that, Unreal.
Why can't we just compile any-hit/intersection separately?
Clearly, inlining all the any-hit and intersection shaders won't work. So why not just compile them separately?
To answer that, I'll try to start with explaining some assumptions that lie at the base of RADV's shader compilation. When ACO (and NIR, too) were written, shaders were usually incredibly simple. They had some control flow, ifs, loops and whatnot, but all the code that would ever execute was contained in one compact program executing top-to-bottom. This perfectly matched what graphics/compute shaders looked like in the APIs, and what the API does is what you want to optimize for.
Unfortunately, this means RADV's shader compilation stack got hit extra hard by the paradigm shift introduced by RT pipelines. Dynamic linking of different programs, and calls across the dynamic link boundaries, is something common in CPU programming languages (C/C++, etc.), but Mesa never really had to deal with something like that before2.
One specific core assumption that prevents us from compiling any-hit/intersection shaders separately just like that is that every piece of code assumes it has exclusive and complete access to things like registers and other hardware resources. Comparing to CPU again, most of the program code is contained in some functions, and those functions will be called from somewhere else3. Those functions will have used CPU registers and stack memory and so on before, and code inside that function can't write to just any CPU register, or any location on stack. Which registers are writable by a function and which ones must have their values preserved (so that the function callers can store values of their own there without them being overwritten) are governed by little specifications called "calling conventions".
In Mesa, the shader compiler generally used to have no concept of calling conventions, or a concept of "calling" something, for that matter. There was no concept of a register having some value from a function caller and needing to be preserved - if a register exists, the shader might end up writing its own value to it. In cases of graphics/compute shaders, this wasn't a problem - the registers only ever had random uninitialized values in them.
This has always been a problem for separately compiling shaders in RT pipelines, but we had a different solution: At every point a shader called another shader, we'd split the shader in half: One half containing everything before the call, and the other half containing everything after. Of course, sometimes the second half needed variables coming from the first half of the shader. All these variables would be stored to memory in the first half. Then, the first half ends, and execution jumps to the called shader. Once the end of the called shader is reached, execution returns to the second half.
This was good enough for things like calling into traceRay to trace a ray and execute all the associated closest hit/miss shaders. Usually, applications wouldn't have that many variables needing to be backed up to memory, and tracing a ray is supposed to be expensive.
But that concept completely breaks down when you apply it to any-hit shaders. At the point an any-hit shader is called, you're right in the middle of ray traversal. Ray traversal has lots of internal state variables that you really want to keep in registers at all times. If you call an any-hit shader with this approach, you'd have to back up all of these state variables to memory and reload them back afterwards. Any-hit shaders are supposed to be relatively cheap and called potentially lots of times during traversal. All these memory stores and reloads you'd need to insert would completely ruin performance.
So, separately compiling any-shaders was an absolute no-go. At least, unless someone were to go off the deep end and change the entire compiler stack to fix the assumptions at their heart.
"So, where have you been the last two years?"
I went and changed more or less the entire compiler stack to fix these assumptions and introduce proper function calls.
The biggest part of this work by far were the absolute basics. How do we best teach the compiler that certain registers need to be preserved and are best left alone? How should the compiler figure out that something like a call instruction might randomly overwrite other registers? How do we represent a calling convention/ABI specification in the driver? All of these problems can be tackled with different approaches and at different stages of compilation, and nailing down a clean solution is pretty important in a rework as fundamental as this one.
I started out with applying function calls to the shaders that were already separately compiled - this means that the function call work itself didn't improve performance by too much, but in retrospect I think it was a very good idea to make sure the baseline functionality is rock-solid before moving on to separately-compiling any-hit shaders.
Indeed, once I finally got around to adding the code that splits out any-hit/intersection shaders and use function calls for them, things worked nearly out of the box! I opened the associated merge request a bit over two weeks ago and got everything merged within a week. (Of course, I would never have gotten it in that fast without all the reviewers teaming up to get everything in ASAP! Big thank you to Daniel, Rhys and Konstantin)
In comparison, I started work on function calls in January of 2024 and got the initial code in a good enough shape to open a merge request in June that year, and the code only got merged on the same day I opened the above merge request, two years after starting the initial drafting (although to be fair, that merge request also had periods of being stalled due to personal reasons).
Shader compilation with function calls
Function calls makes shader compilation work in arguably a much more straightforward way. For the most part, the shader just gets compiled like any other - there's no fancy splitting or anything going on. If a shader calls another shader, like when executing traceRay, or when calling an any-hit shaders, a call instruction is generated. When the called shader finishes, execution resumes after the call instruction.
All the magic happens in ACO, the compiler backend. I've documented the more technical design of how calls and ABIs are represented in a docs article. At first, call instructions in the NIR IR are translated to a p_call "pseudo" instruction. It's not actually a hardware instruction, but serves as a placeholder for the eventual jump to the callee. This instruction also carries information about which specific registers parameters will be stored in, and which registers may be overwritten by the call instruction.
ACO's compiler passes have special handling for calls wherever necessary: For example, passes analyzing how many registers are required in all the different parts of the code take special care to take into account that in call instructions, fewer registers may be available to store values in (because all other values are overwritten). ACO also has a spilling pass for moving register values to memory whenever the amount of used registers exceeds the available amount.
Another fundamental change is that function calls also introduce a call stack. In CPUs, this is no big deal - you have one stack pointer register, and it points to the stack region that your program uses. However, on GPUs, there isn't just one stack - remember that GPUs are highly parallel, and every thread running on the GPU needs its own stack!
Luckily, this sounds worse at first than it actually is. In fact, the hardware already has facilities to help manage stacks. AMD GPUs ever since Vega4 have the concept of "scratch memory" - a memory pool in VRAM where the hardware ensures that each thread has its own private "scratch region". There are special scratch_* memory instructions that load and store from this scratch area. Even though they're also VRAM loads/stores, they don't take any address, just an offset, and for each thread return the value stored in that thread's own scratch memory region.
In my blog post about RT pipeline being enabled by default I claimed AMD GPUs don't implement a call stack. This is actually misleading - the scratch memory functionality is all you need to implement a stack yourself. The "stack pointer" here is just the offset you pass to the scratch_* memory instruction. Pushing to the stack increases the stack offset, and popping from it decreases the offset5.
Eventually, when it comes to converting a call to hardware instructions, all that is needed is to execute the s_swappc instruction. This instruction automatically writes the address of the next instruction to a register before jumping to the called shader. When the called shader wants to return, it merely needs to jump to the address stored in that register, and execution resumes from right after the call instruction.
Finally, any-hit separate compilation was a straightforward task as well - it was merely an issue of defining an ABI that made sure that a ton of registers stay preserved and the caller can stash its values there. In practice, all of the traversal state will be stashed in these preserved registers. No expensive spilling to memory needed, just a quick jump to the any-hit shader and back.
Performance considerations
If you look at the merge request, the performance benefits seem pretty obvious.
Ghostwire Tokyo's RT passes speed up by more than 2x, and of course pipeline compilation times improved massively.
The compilation time difference is quite easy to explain. Generally, compilers will perform a ton of analysis passes on shader code to find everything they can to optimize it to death. However, these analysis passes often require going over the same code more than once, e.g. after gathering more context elsewhere in the shader. This also means that a shader that doubles in size will take more than twice as long to compile. When inlining hundreds or thousands of shaders into one, that also means that shader's compile time grows by a lot more than just a hundred or a thousand times.
Thus, if we reverse things and are suddenly able to stop inlining all the shaders into one, that scaling effect means all the shaders will take less total time to compile than the one big megashader. In practice, all modern games also offload shader compilation to multiple threads. If you can compile the any-hit shaders separately, the game can compile them all in parallel - this just isn't possible with the single megashader which will always be compiled on a single thread.
In the runtime performance department, moving to just having a single call instruction instead of hundreds of shaders in one place means the loop has a much smaller code size. In a loop iteration where you don't call any any-hit shaders, you would still need to jump over all of the code for those shaders, almost certainly causing instruction cache misses, stalls and so on.
Also, forcing any-hit/intersection shaders to be separate also means that any-hit/intersection shaders that consume tons of registers despite nearly never getting called won't have any negative effects on ray traversal as a whole. ACO has heuristics on where to optimally insert memory stores in case something somewhere needs more registers than available. However, these heuristics may decide to insert memory stores inside the generic traversal loop, even if the problematic register usage only comes from a few rarely-called inlined shaders. These stores in the generic loop would now mean that the whole shader is slowed down in every case.
However, separate compilation doesn't exclusively have advantages, either. In an inlined shader, the compiler is able to use the context surrounding the (now-inlined) shader to optimize the code itself. A separately-compiled shader needs to be able to get called from any imaginable context (as long as it conforms to ABI), and this inhibits optimization.
Another consideration is that the jump itself has a small cost (not as big as you'd think, but it does have a cost). RADV currently keeps inlining any-hit shaders as long as you don't have too many of them, and as long as doing so wouldn't inhibit the ability to compile the shaders in parallel.
About that big UE5 Lumen perf improvement
I also openend a merge request that provided massive performance improvements to Lumen's RT right before the branchpoint.
However, these improvements are completely unrelated to function calls. In fact, they're a tiny bit embarrassing, because all that changed was that RADV doesn't make the hardware do ridiculously inefficient things anymore.
Let's talk about dispatching RT shaders. The Vulkan API provides a vkCmdTraceRaysKHR command that takes in the number of rays to dispatch for X, Y and Z dimensions. Usually, compute dispatches are described in terms of how many thread groups to dispatch, but RT is special because one ray corresponds to one thread. So here, we really get the dispatch sizes in threads, not groups.
By itself, that's not an issue. In fact, AMD hardware has always been able to specify dispatch dimensions in threads instead of groups. In that case, the hardware takes the job of assembling just enough groups that hold the specified number of threads. The issue here comes from how we describe that group to the hardware. The workgroup size itself is also per-dimension, and the simplest case of 32x1x1 threads (i.e. a 1D workgroup) is actually not always the best.
Let's consider a very common ray tracing use case: You might want to trace a ray for each pixel in a 1920x1080 image. That's pretty easy, you just call vkCmdTraceRaysKHR to dispatch 1920 rays in the X dimension and 1080 in the Y dimension.
When you dispatch a 32x1x1 workgroup, the coordinates for each thread in a workgroup look like this:
thread id | 0 | 1 | 2 | ... | 16 | 17 |...| 31 |
coord |(0,0)|(1,0)|(2,0)| ... |(16,0)|(17,0)|...|(31,0)|
Or, if you consider how the thread IDs are laid out in the image:
-------------------
0 | 1 | 2 | 3 | ..
-------------------
That's a straight line in image space. That's not the best, because it means that the pixels will most likely cover different objects which may have very different trace characteristics. This means divergence during RT will be higher, which can make the overall process slower.
Let's look instead what happens when you make the workgroup 2D, with a 8x4 size:
thread id | 0 | 1 | 2 | ... | 16 | 17 |...| 31 |
coord |(0,0)|(1,0)|(2,0)| ... |(0,2) |(1,2) |...|(7,3) |
In image space:
-------------------
0 | 1 | 2 | 3 | ..
------------------
8 | 9 | 10| 11| ..
------------------
16| 17| 18| 19| ..
-------------------
That's much better. Threads are now arranged in a little square, and these squares are much more likely to all cover the same objects, have similar RT characteristics, etc.
This is why RADV used 8x4 workgroups as well. Now let's get to when this breaks down. What if the RT dispatch doesn't actually have 2 dimensions? What if there are 1920 rays in the X dimension, but the Y dimension is just 1?
It turns out that the hardware can only run 8 threads in a single wavefront in this case. This is because the rest of the workgroup is out-of-bounds of the dispatch - it has a non-zero Y coordinate, but the size in the Y dimension is only 1, so it would exceed the dispatch bounds.
The hardware also can't pull in threads from other workgroups, because one wavefront can only ever execute one workgroup. The end result is that the wave runs with only 8 out of 32 threads active - at 1/4 theoretical performance. For no real reason.
I actually had noticed this issue years ago (with UE4, ironically). Back then I worked around it by rearranging the game's dispatch sizes into a 2D one behind its back, and recalculating a 1-dimensional dispatch ID inside the RT shader so the game doesn't notice. That worked just fine… as long as we're actually aware about the dispatch sizes.
UE5 doesn't actually use vkCmdTraceRaysKHR. It uses vkCmdTraceRaysIndirectKHR, a variant of the command where the dispatch size is read from GPU memory, not specified on the CPU. This command is really cool and allows for some and nifty GPU-driven rendering setups where you only dispatch as many rays as you're definitely going to trace (as determined by previous GPU commands). This command also rips a giant hole in the approach of rearranging dispatch sizes, because we don't even know the dispatch size before the dispatch is actually executed. That means the super simple workaround I built was never hit, and we had the same embarrassingly inefficient RT performance as a few years ago all over again.
Obviously, if UE5 is too smart for your workaround, then the solution is to make an even smarter workaround. The ideal solution would work with a 1D thread ID (so that we don't run into any more issues when there is a 1D dispatch, but if a 2D dispatch is detected, we turn that "line" of 1D IDs into a "square". The whole idea about turning a linear coordinate into a square reminded me a lot of how Z-order curves work. In fact, the GPU arranges things like image data on a Z-order curve by interleaving the address bits from X and Y already, because nearby pixels are often accessed together and it's better if they're close to each other.
However, instead of interleaving a X and Y coordinate pair to make a linear memory address, we want the opposite: We have a linear dispatch ID, and we want to recover a 2D coordinate inside a square from it. That's not too hard, you just do the opposite operation: Deinterleave the bits, where every odd/even bit of the dispatch ID forms the X/Y coordinate. As it turned out, you can actually do this entirely from inside the shader with just a few bit twiddling tricks, so this approach work for both indirect and direct (non-indirect) trace commands.
With that approach, dispatch IDs and coordinates look something like this:
thread id | 0 | 1 | 2 | ... | 16 | 17 |...| 31 |
coord |(0,0)|(1,0)|(0,1)| ... |(4,0) |(5,0) |...|(7,3) |
In image space:
-------------------
0 | 1 | 4 | 5 | ..
------------------
2 | 3 | 6 | 7 | ..
------------------
8 | 9 | 12| 13| ..
-------------------
10| 11| 14| 15| ..
-------------------
Not only are the thread IDs now arranged in squares, the squares themselves get recursively subdivided into more squares! I think theoretically this should be a further improvement w.r.t divergence, but I don't think it has resulted in measurable speedup in practice anywhere.
The most important thing, though, is that now UE5 RT doesn't run 4x slower than it should. Oops.
Bonus content: Function call bug bonanza
The second most fun thing about function calls is that you can just jump to literally any program anywhere, provided the program doesn't completely thrash your preserved registers and stack space.
The most fun thing about function calls is what happens when the program does just that.
I'm going to use this section to scream in the void about two very real function call bugs that were reported after I already merged the MR. This is not an exhaustive list, you can trust I've had much much more fun just like what I'll be presenting here while I was testing and developing function calls.
Avowed gets stuck in an infinite loop
On the scale of function call bugs, this one was rather tame, even. Having infinite loops isn't the most optimal for hang debugging, but it does mean that you can use a tool like umr to sample which wavefronts are active, and get some register dumps. The program counter will at least point to some instruction in the loop that it's stuck in, and you can get yourself the disassembly of the whole shader to try and figure out what's going in the loop and why the exit conditions aren't met.
The loop in Avowed was rather simple: It traced a ray in a loop, and when the loop counter was equal to an exit value, control flow would break out of the loop. The register dumps also immediately highlighted the loop exit counter being random garbage. So far so good.
During the traceRay call, the loop exit counter was backed up to the shader's stack. Okay, so it's pretty obvious that the stack got smashed somehow and that corrupted the loop exit counter.
What was not obvious, however, was what smashed the stack. Debugging this is generally a bit of an issue - GPUs are far, far away from tools like AddressSanitizer, especially at a compiler level. There are no tools that would help me catch a faulty access at runtime. All I could really do was look at all the shaders in that ray tracing pipeline (luckily that one didn't have too many) and see if they somehow store to wrong stack locations.
All shaders in that pipeline were completely fine, though. I checked every single scratch instruction in every shader if the offsets were correct (luckily, the offsets are constants encoded in the disassembly, so this part was trivial). I also verified that the stack pointer was incremented by the correct values - everything was completely fine. No shader was smashing its callers' stack.
I found the bug more or less by complete chance. The shader code was indeed completely correct, there were no miscompilations happening. Instead, the "scratch memory" area the HW allocated was smaller than what each thread actually used, because I forgot to multiply by the number of threads in a wavefront in one place.
The stack wasn't smashed by the called function, it was smashed by a completely different thread. Whether your stack would get smashed was essentially complete luck, depending on where the HW placed your scratch memory area, other wavefront's scratch, and how those wavefronts' execution was timed relative to yours. I don't think I would ever have been able to deduce this from any debugger output, so I should probably count myself lucky I stumbled upon the fix regardless.
Silent Hill 2's reflections sample the sky color
Did I talk about Unreal Engine yet? Let's talk about Unreal Engine some more. Silent Hill 2 uses Lumen for its reflection/GI system, and somehow Lumen from UE 5.3 specifically was the only thing that seemed to reproduce this particular bug.
In every way the Avowed bug was tolerable to debug, this one was pure suffering. There were no GPU hangs, all shaders ran completely fine. That means using umr and getting a rough idea of where the issue is was off the table from the start. Unfortunately, the RT pipeline was also way too large to analyze - there were a few hundred hit shaders, but there also were seven completely different ray generation shaders.
Having little other recourse, I started trying to at least narrow down the ray generation shader that triggered the fault. I used Mesa's debugging environment variables to dump the SPIR-V of all the shaders the driver encountered, and then used spirv-cross on all of them to turn them into editable GLSL. For each ray generation shader, I'd comment out the imageStore instructions that stored the RT result to some image, recompiled the modified GLSL to SPIR-V, and instructed Mesa to sneakily swap out the original ray-gen SPIR-V with my modified one. Then I re-ran the game to see if anything changed.
This indeed led me to find the correct ray generation shader, but the lead turned into a dead end - there was little insight other than that the ray was indeed executing the miss shader. Everything seemed correct so far, and if I hadn't known these rays didn't miss about 3 commits ago, I honestly wouldn't even have suspected anything was wrong at all.
The next thing I tried was commenting out random things in ray traversal code. Skipping over all any-hit/intersection shaders yielded no change, and neither did replacing the ray flags/culling masks with known good constants to rule out wrong values being passed as parameters. What did "fix" the result, however, was… commenting out the calls to closest-hit shaders.
Now, if closest-hit shaders get called and that makes miss shaders execute somehow, you'd perhaps think we'd be calling the wrong function. Maybe we confuse the shader binding table where we load the addresses of shaders to call from? To verify that assumption, I also disabled calling any and all miss shaders. I zeroed out the addresses in the shader handles to make extra sure there was no possible way that a miss shader could ever get called. To keep things working, I replaced the code that calls miss shaders with the relevant code fragment from UE's miss shader (essentially inlining the shader myself).
Nothing changed from that. That means a closest-hit shader being executed somehow resulted in a ray traversal itself returning a miss, not the wrong function being called.
Perhaps the closest-hit shaders corrupt some caller values again? Since the RT pipeline was too big to analyze, I tried to narrow down the suspicious shaders by only disabling specific closest-hit shaders. I also discovered that just making all closest-hit shaders no-ops "fixed" things as well, even if they do get called.
Sure enough, at some point I had a specific closest-hit shader where the issue went away once I deleted all code from it/made it a no-op. I even figured out a specific register that, if explicitly preserved, would make the issue go away.
The only problem was that this register corresponded to one part of the return value of the closest-hit shader - that is, a register that the shader was supposed to overwrite.
From here on out it gets completely nonsensical. I will save you the multiple days of confusion, hair-pulling, desperation and agony over the complete and utter undebuggableness of Lumen's RT setup and skip to the solution:
It turned out the "faulty" closest-hit shader I found was nothing but a red herring. Lumen's RT consists of 6+ RT dispatches, most of which I haven't exactly figured out the purpose of, but what I seemed to observe was that the faulty RT dispatch used the results of the previous RT dispatch to make decisions on whether to trace any rays or not. Making the closest-hit shaders a no-op did nothing but disable the subsequent traceRays that actually exhibited the issue.
Since these RT dispatches used the same RT pipelines, that meant virtually any avenue I had of debugging this driver-side was completely meaningless. Any hacks inside the shader compiler might actually work around the issue, or just affect a conceptually unrelated dispatch that happens to disable the actually problematic rays. Determining which was the case was nearly impossible, especially in a general case.
I never really figured out how to debug this issue. Once again, what saved me was a random epiphany out of the blue. In fact, now that I know what the bug was, I'm convinced I would've never found this through a debugger either.
The issue turned out to be in an optimization for what's commonly called tail-calls. If you have a function that calls another function at the very end just before returning, a common optimization is to simply turn that call into a jump, and let the other function return directly to the caller.
Imagine ray traversal working a bit like this C code:
/* hitT is the t value of the ray at the hit point */
payload closestHit(float hitT);
/* tMax is the maximum range of the ray, if there is
* no hit with a t <= tMax, the ray misses instead */
payload traversal(float tMax) {
do something;
if (hit)
return closestHit(hitT); // gets replaced with a jmp, closestHit returns directly to traversal's caller
}
More specifically, the bug was with how preserved parameters and tail-calls interact. Function callers are generally supposed to assume that preserved parameters do not change their value over the function call. That means it's safe to reuse that register after the call and assuming it still has the value the caller put in.
However, in the example above, let's assume closestHit has the same calling convention as traversal. That means closestHit's parameter needs to go into the same register as traversal's parameter, and thus the register gets overwritten.
If traversal's caller was assuming that the parameter is preserved, that would mean the value of tMax has just been overwritten with the value of hitT without the caller knowing. If traversal now gets called again from the same place, the value of tMax is not the intended value, but the hitT value from the previous iteration, which is definitely smaller than tMax.
Put shortly: If all these conditions are met, a smaller-than-intended tMax could cause rays to miss when they were intended to hit.
Once again, I got incredibly lucky and stumbled upon the bug by complete chance.
The GPU gods seem to be in good spirits for my endeavours. I pray it stays this way.
Footnotes
-
"Shader" in this context really means any program that runs on the GPU. The RT pipeline is also made of shaders, shaders determine where the points and triangles making up each object end up on screen, there are compute shaders for generic computing, and so on… ↩
-
There actually is another use-case where this becomes relevant on GPU - and that is GPGPU code like CUDA/HIP/OpenCL. CUDA/HIP allow you to write C++ for the GPU in a much more "CPU-like" programming environment (OpenCL uses C), and you run into all the same problems there. This also means all the major GPU vendors had already written their solutions for these problems when raytracing came around. There are OpenCL kernels that end up really really bad if you don't have proper function calls in the compiler (which Rusticl suffers from right now), and the function calls work in RADV/ACO may end up proving useful for those as well. ↩
-
Even your
mainfunction works like that, actually. Unless you have some form of freestanding environment, all your program code works like that. ↩ -
In RADV, the stack pointer is actually constant across a function, and pushing/popping to/from the stack is implemented by adding another offset to the constant stack pointer in load/store instructions. This allows to make the stack pointer an SGPR instead of a VGPR and simplifies stack accesses that aren't push/pop. ↩
-
We support raytracing before Vega too. We support function calls on all GPUs, as well, through a little magic in dreaming up a buffer descriptor with specific memory swizzling to achieve the same addressing that
scratch_*instructions use on Vega and later. ↩
30 Jan 2026 12:00am GMT