30 Jan 2026

feedplanet.freedesktop.org

Natalie Vock: Inside Mesa 26.0’s RADV RT improvements

Mesa 26.0 is big for RADV's ray tracing. In fact, it's so big it single-handedly revived this blog.

There are a lot of improvements to talk about, and some of them were in the making for a little over two years at this point.

In this blog post I'll focus on the things I myself worked on specifically, most of which revolve around how ray tracing pipelines are compiled and dispatched. Of course, there's more than just what I did myself: Konstantin Seurer worked on a lot of very cool improvements to how we build BVHs, the data structure that RT hardware uses for the triangle soup making up the geometry in game scenes so the HW can trace rays against them efficiently.

RT pipeline compilation

The rest of this blog post will assume some basic idea of how GPU ray tracing and ray tracing pipelines work. I've written about this in more detail one-and-a-half years ago, in my blog post about RT pipeline being enabled by default.

Let's take a bit of a closer look at what I said about RT pipelines in RADV back then. In a footnote, I said:

Any-hit and Intersection shaders are still combined into a single traversal shader. This still shows some of the disadvantages of the combined shader method, but generally compile times aren't that ludicrous anymore.

I spent a significant amount of time in that blogpost detailing about how there tend to be a really large number of shaders, and combining them into a single megashader is very slow because shader sizes get genuinely ridiculous at that point.

So clearly, it was only a matter of time until the any-hit/intersection shader combination would blow up spectacularly on a spectacular number of shaders, as well.

So there's this thing called Unreal Engine

For illustrating the issues with inlined any-hit/intersection shaders, I'll use Unreal Engine as an example because I noticed it being particularly egregious here. This definitely was an issue with other RT games/workloads as well, and function calls will provide improvements there too.

There's a lot of people going around making fun of Unreal Engine these days, to the point of entire social media presences being built around mocking the ways in which UE is inefficient, slow, badly-designed bloatware and whatnot. Unfortunately, the most popular critics often know the least what they're actually talking about. I feel compelled to point out here that while there certainly are reasonable complaints to be raised about UE and games made with it, I explicitly don't want this section (or anything else in this post, really) to be misconstrued as "UE doing a bad thing". As you'll see, Unreal is really just using the RT pipeline API as designed.

With the disclaimer aside, what does Unreal actually do here that made RADV fall over so hard?

Let's talk a bit about how big game engines handle shading and materials. As you'll probably know already, to calculate how lighting interacts with objects in a scene, an application will usually run small programs called "shaders" on the GPU that, among other things1, calculate the colors different pixels have according to the material at that pixel.

Different materials interact with light differently, and in a large world with tons of different materials, you might end up having a ton of different shaders.

In a traditional raster setup you draw each object separately, so you can compile a lot of graphics pipelines for all of your materials, and then bind the correct one whenever you draw something with that material.

However, this approach falls apart in ray tracing. Rays can shoot through the scene randomly and they can hit pretty much any object that's loaded in at the moment. You can only ever use one ray tracing pipeline at once, so every single material that exists in your scene and may be hit by a ray needs to be present in the RT pipeline. The more materials a game has, the more ludicrous the number of shaders gets.

Usually, this is most relevant for closest-hit shaders, because these are the shaders that get called for the object hit by the ray (where shading needs to be calculated). However, depending on your material setup, you may have something like translucent materials - where parts of the material are "see-through", and rays should go through these parts to reveal the scene behind it instead of stopping.

This is where any-hit shaders come into play - any-hit shaders can instruct the driver to ignore a ray hitting a geometry, and instead keep searching for the next hit. If you have a ton of (potentially) translucent materials, that would translate into a lot of any-hit shaders being compiled for these materials.

The design of RT pipelines is quite obviously written in a way that accounts for this. In the previous blogpost I already mentioned pipeline libraries - the idea is that a material could just be contained in a "library", and if RT pipelines want to use it, they just need to link to the library instead of compiling the shading code all over again. This also allows for easy addition/removal of materials: Even though you have to re-create the RT pipeline, all you need to do is link to the already compiled libraries for the different materials.

UE, particularly UE4, is a heavy user of libraries, which makes a lot of sense: It maps very well to what it's trying to achieve. Everything's good, as long as the driver doesn't do silly things.

Silly things like, for example, combining any-hit shaders into one big traversal shader.

Doing something like that pretty much entirely side-steps the point of libraries. The traversal shader can only be compiled when all any-hit shaders are known, which is only at the very final linking step, which is supposed to be very fast…

And if UE4, assuming the linking step is very fast, does that re-linking over and over, very often, what you end up with is horrible pipeline compilation stutter every few seconds. And in this case, it's not really UE's fault, even! Sorry for that, Unreal.

Why can't we just compile any-hit/intersection separately?

Clearly, inlining all the any-hit and intersection shaders won't work. So why not just compile them separately?

To answer that, I'll try to start with explaining some assumptions that lie at the base of RADV's shader compilation. When ACO (and NIR, too) were written, shaders were usually incredibly simple. They had some control flow, ifs, loops and whatnot, but all the code that would ever execute was contained in one compact program executing top-to-bottom. This perfectly matched what graphics/compute shaders looked like in the APIs, and what the API does is what you want to optimize for.

Unfortunately, this means RADV's shader compilation stack got hit extra hard by the paradigm shift introduced by RT pipelines. Dynamic linking of different programs, and calls across the dynamic link boundaries, is something common in CPU programming languages (C/C++, etc.), but Mesa never really had to deal with something like that before2.

One specific core assumption that prevents us from compiling any-hit/intersection shaders separately just like that is that every piece of code assumes it has exclusive and complete access to things like registers and other hardware resources. Comparing to CPU again, most of the program code is contained in some functions, and those functions will be called from somewhere else3. Those functions will have used CPU registers and stack memory and so on before, and code inside that function can't write to just any CPU register, or any location on stack. Which registers are writable by a function and which ones must have their values preserved (so that the function callers can store values of their own there without them being overwritten) are governed by little specifications called "calling conventions".

In Mesa, the shader compiler generally used to have no concept of calling conventions, or a concept of "calling" something, for that matter. There was no concept of a register having some value from a function caller and needing to be preserved - if a register exists, the shader might end up writing its own value to it. In cases of graphics/compute shaders, this wasn't a problem - the registers only ever had random uninitialized values in them.

This has always been a problem for separately compiling shaders in RT pipelines, but we had a different solution: At every point a shader called another shader, we'd split the shader in half: One half containing everything before the call, and the other half containing everything after. Of course, sometimes the second half needed variables coming from the first half of the shader. All these variables would be stored to memory in the first half. Then, the first half ends, and execution jumps to the called shader. Once the end of the called shader is reached, execution returns to the second half.

This was good enough for things like calling into traceRay to trace a ray and execute all the associated closest hit/miss shaders. Usually, applications wouldn't have that many variables needing to be backed up to memory, and tracing a ray is supposed to be expensive.

But that concept completely breaks down when you apply it to any-hit shaders. At the point an any-hit shader is called, you're right in the middle of ray traversal. Ray traversal has lots of internal state variables that you really want to keep in registers at all times. If you call an any-hit shader with this approach, you'd have to back up all of these state variables to memory and reload them back afterwards. Any-hit shaders are supposed to be relatively cheap and called potentially lots of times during traversal. All these memory stores and reloads you'd need to insert would completely ruin performance.

So, separately compiling any-shaders was an absolute no-go. At least, unless someone were to go off the deep end and change the entire compiler stack to fix the assumptions at their heart.

"So, where have you been the last two years?"

I went and changed more or less the entire compiler stack to fix these assumptions and introduce proper function calls.

The biggest part of this work by far were the absolute basics. How do we best teach the compiler that certain registers need to be preserved and are best left alone? How should the compiler figure out that something like a call instruction might randomly overwrite other registers? How do we represent a calling convention/ABI specification in the driver? All of these problems can be tackled with different approaches and at different stages of compilation, and nailing down a clean solution is pretty important in a rework as fundamental as this one.

I started out with applying function calls to the shaders that were already separately compiled - this means that the function call work itself didn't improve performance by too much, but in retrospect I think it was a very good idea to make sure the baseline functionality is rock-solid before moving on to separately-compiling any-hit shaders.

Indeed, once I finally got around to adding the code that splits out any-hit/intersection shaders and use function calls for them, things worked nearly out of the box! I opened the associated merge request a bit over two weeks ago and got everything merged within a week. (Of course, I would never have gotten it in that fast without all the reviewers teaming up to get everything in ASAP! Big thank you to Daniel, Rhys and Konstantin)

In comparison, I started work on function calls in January of 2024 and got the initial code in a good enough shape to open a merge request in June that year, and the code only got merged on the same day I opened the above merge request, two years after starting the initial drafting (although to be fair, that merge request also had periods of being stalled due to personal reasons).

Shader compilation with function calls

Function calls makes shader compilation work in arguably a much more straightforward way. For the most part, the shader just gets compiled like any other - there's no fancy splitting or anything going on. If a shader calls another shader, like when executing traceRay, or when calling an any-hit shaders, a call instruction is generated. When the called shader finishes, execution resumes after the call instruction.

All the magic happens in ACO, the compiler backend. I've documented the more technical design of how calls and ABIs are represented in a docs article. At first, call instructions in the NIR IR are translated to a p_call "pseudo" instruction. It's not actually a hardware instruction, but serves as a placeholder for the eventual jump to the callee. This instruction also carries information about which specific registers parameters will be stored in, and which registers may be overwritten by the call instruction.

ACO's compiler passes have special handling for calls wherever necessary: For example, passes analyzing how many registers are required in all the different parts of the code take special care to take into account that in call instructions, fewer registers may be available to store values in (because all other values are overwritten). ACO also has a spilling pass for moving register values to memory whenever the amount of used registers exceeds the available amount.

Another fundamental change is that function calls also introduce a call stack. In CPUs, this is no big deal - you have one stack pointer register, and it points to the stack region that your program uses. However, on GPUs, there isn't just one stack - remember that GPUs are highly parallel, and every thread running on the GPU needs its own stack!

Luckily, this sounds worse at first than it actually is. In fact, the hardware already has facilities to help manage stacks. AMD GPUs ever since Vega4 have the concept of "scratch memory" - a memory pool in VRAM where the hardware ensures that each thread has its own private "scratch region". There are special scratch_* memory instructions that load and store from this scratch area. Even though they're also VRAM loads/stores, they don't take any address, just an offset, and for each thread return the value stored in that thread's own scratch memory region.

In my blog post about RT pipeline being enabled by default I claimed AMD GPUs don't implement a call stack. This is actually misleading - the scratch memory functionality is all you need to implement a stack yourself. The "stack pointer" here is just the offset you pass to the scratch_* memory instruction. Pushing to the stack increases the stack offset, and popping from it decreases the offset5.

Eventually, when it comes to converting a call to hardware instructions, all that is needed is to execute the s_swappc instruction. This instruction automatically writes the address of the next instruction to a register before jumping to the called shader. When the called shader wants to return, it merely needs to jump to the address stored in that register, and execution resumes from right after the call instruction.

Finally, any-hit separate compilation was a straightforward task as well - it was merely an issue of defining an ABI that made sure that a ton of registers stay preserved and the caller can stash its values there. In practice, all of the traversal state will be stashed in these preserved registers. No expensive spilling to memory needed, just a quick jump to the any-hit shader and back.

Performance considerations

If you look at the merge request, the performance benefits seem pretty obvious.

Ghostwire Tokyo's RT passes speed up by more than 2x, and of course pipeline compilation times improved massively.

The compilation time difference is quite easy to explain. Generally, compilers will perform a ton of analysis passes on shader code to find everything they can to optimize it to death. However, these analysis passes often require going over the same code more than once, e.g. after gathering more context elsewhere in the shader. This also means that a shader that doubles in size will take more than twice as long to compile. When inlining hundreds or thousands of shaders into one, that also means that shader's compile time grows by a lot more than just a hundred or a thousand times.

Thus, if we reverse things and are suddenly able to stop inlining all the shaders into one, that scaling effect means all the shaders will take less total time to compile than the one big megashader. In practice, all modern games also offload shader compilation to multiple threads. If you can compile the any-hit shaders separately, the game can compile them all in parallel - this just isn't possible with the single megashader which will always be compiled on a single thread.

In the runtime performance department, moving to just having a single call instruction instead of hundreds of shaders in one place means the loop has a much smaller code size. In a loop iteration where you don't call any any-hit shaders, you would still need to jump over all of the code for those shaders, almost certainly causing instruction cache misses, stalls and so on.

Also, forcing any-hit/intersection shaders to be separate also means that any-hit/intersection shaders that consume tons of registers despite nearly never getting called won't have any negative effects on ray traversal as a whole. ACO has heuristics on where to optimally insert memory stores in case something somewhere needs more registers than available. However, these heuristics may decide to insert memory stores inside the generic traversal loop, even if the problematic register usage only comes from a few rarely-called inlined shaders. These stores in the generic loop would now mean that the whole shader is slowed down in every case.

However, separate compilation doesn't exclusively have advantages, either. In an inlined shader, the compiler is able to use the context surrounding the (now-inlined) shader to optimize the code itself. A separately-compiled shader needs to be able to get called from any imaginable context (as long as it conforms to ABI), and this inhibits optimization.

Another consideration is that the jump itself has a small cost (not as big as you'd think, but it does have a cost). RADV currently keeps inlining any-hit shaders as long as you don't have too many of them, and as long as doing so wouldn't inhibit the ability to compile the shaders in parallel.

About that big UE5 Lumen perf improvement

I also openend a merge request that provided massive performance improvements to Lumen's RT right before the branchpoint.

However, these improvements are completely unrelated to function calls. In fact, they're a tiny bit embarrassing, because all that changed was that RADV doesn't make the hardware do ridiculously inefficient things anymore.

Let's talk about dispatching RT shaders. The Vulkan API provides a vkCmdTraceRaysKHR command that takes in the number of rays to dispatch for X, Y and Z dimensions. Usually, compute dispatches are described in terms of how many thread groups to dispatch, but RT is special because one ray corresponds to one thread. So here, we really get the dispatch sizes in threads, not groups.

By itself, that's not an issue. In fact, AMD hardware has always been able to specify dispatch dimensions in threads instead of groups. In that case, the hardware takes the job of assembling just enough groups that hold the specified number of threads. The issue here comes from how we describe that group to the hardware. The workgroup size itself is also per-dimension, and the simplest case of 32x1x1 threads (i.e. a 1D workgroup) is actually not always the best.

Let's consider a very common ray tracing use case: You might want to trace a ray for each pixel in a 1920x1080 image. That's pretty easy, you just call vkCmdTraceRaysKHR to dispatch 1920 rays in the X dimension and 1080 in the Y dimension.

When you dispatch a 32x1x1 workgroup, the coordinates for each thread in a workgroup look like this:

thread id |  0  |  1  |  2  | ... |  16  |  17  |...|  63  |
coord     |(0,0)|(1,0)|(2,0)| ... |(16,0)|(17,0)|...|(63,0)|

Or, if you consider how the thread IDs are laid out in the image:

-------------------
0 | 1 | 2 | 3 | ..
-------------------

That's a straight line in image space. That's not the best, because it means that the pixels will most likely cover different objects which may have very different trace characteristics. This means divergence during RT will be higher, which can make the overall process slower.

Let's look instead what happens when you make the workgroup 2D, with a 8x4 size:

thread id |  0  |  1  |  2  | ... |  16  |  17  |...|  63  |
coord     |(0,0)|(1,0)|(2,0)| ... |(0,2) |(1,2) |...|(7,3) |

In image space:

-------------------
0 | 1 | 2 | 3 | ..
------------------
8 | 9 | 10| 11| ..
------------------
16| 17| 18| 19| ..
-------------------

That's much better. Threads are now arranged in a little square, and these squares are much more likely to all cover the same objects, have similar RT characteristics, etc.

This is why RADV used 8x4 workgroups as well. Now let's get to when this breaks down. What if the RT dispatch doesn't actually have 2 dimensions? What if there are 1920 rays in the X dimension, but the Y dimension is just 1?

It turns out that the hardware can only run 8 threads in a single wavefront in this case. This is because the rest of the workgroup is out-of-bounds of the dispatch - it has a non-zero Y coordinate, but the size in the Y dimension is only 1, so it would exceed the dispatch bounds.

The hardware also can't pull in threads from other workgroups, because one wavefront can only ever execute one workgroup. The end result is that the wave runs with only 8 out of 32 threads active - at 1/4 theoretical performance. For no real reason.

I actually had noticed this issue years ago (with UE4, ironically). Back then I worked around it by rearranging the game's dispatch sizes into a 2D one behind its back, and recalculating a 1-dimensional dispatch ID inside the RT shader so the game doesn't notice. That worked just fine… as long as we're actually aware about the dispatch sizes.

UE5 doesn't actually use vkCmdTraceRaysKHR. It uses vkCmdTraceRaysIndirectKHR, a variant of the command where the dispatch size is read from GPU memory, not specified on the CPU. This command is really cool and allows for some and nifty GPU-driven rendering setups where you only dispatch as many rays as you're definitely going to trace (as determined by previous GPU commands). This command also rips a giant hole in the approach of rearranging dispatch sizes, because we don't even know the dispatch size before the dispatch is actually executed. That means the super simple workaround I built was never hit, and we had the same embarrassingly inefficient RT performance as a few years ago all over again.

Obviously, if UE5 is too smart for your workaround, then the solution is to make an even smarter workaround. The ideal solution would work with a 1D thread ID (so that we don't run into any more issues when there is a 1D dispatch, but if a 2D dispatch is detected, we turn that "line" of 1D IDs into a "square". The whole idea about turning a linear coordinate into a square reminded me a lot of how Z-order curves work. In fact, the GPU arranges things like image data on a Z-order curve by interleaving the address bits from X and Y already, because nearby pixels are often accessed together and it's better if they're close to each other.

However, instead of interleaving a X and Y coordinate pair to make a linear memory address, we want the opposite: We have a linear dispatch ID, and we want to recover a 2D coordinate inside a square from it. That's not too hard, you just do the opposite operation: Deinterleave the bits, where every odd/even bit of the dispatch ID forms the X/Y coordinate. As it turned out, you can actually do this entirely from inside the shader with just a few bit twiddling tricks, so this approach work for both indirect and direct (non-indirect) trace commands.

With that approach, dispatch IDs and coordinates look something like this:

thread id |  0  |  1  |  2  | ... |  16  |  17  |...|  63  |
coord     |(0,0)|(1,0)|(0,1)| ... |(4,0) |(5,0) |...|(7,3) |

In image space:

-------------------
0 | 1 | 4 | 5 | ..
------------------
2 | 3 | 6 | 7 | ..
------------------
8 | 9 | 12| 13| ..
-------------------
10| 11| 14| 15| ..
-------------------

Not only are the thread IDs now arranged in squares, the squares themselves get recursively subdivided into more squares! I think theoretically this should be a further improvement w.r.t divergence, but I don't think it has resulted in measurable speedup in practice anywhere.

The most important thing, though, is that now UE5 RT doesn't run 4x slower than it should. Oops.

Bonus content: Function call bug bonanza

The second most fun thing about function calls is that you can just jump to literally any program anywhere, provided the program doesn't completely thrash your preserved registers and stack space.

The most fun thing about function calls is what happens when the program does just that.

I'm going to use this section to scream in the void about two very real function call bugs that were reported after I already merged the MR. This is not an exhaustive list, you can trust I've had much much more fun just like what I'll be presenting here while I was testing and developing function calls.

Avowed gets stuck in an infinite loop

On the scale of function call bugs, this one was rather tame, even. Having infinite loops isn't the most optimal for hang debugging, but it does mean that you can use a tool like umr to sample which wavefronts are active, and get some register dumps. The program counter will at least point to some instruction in the loop that it's stuck in, and you can get yourself the disassembly of the whole shader to try and figure out what's going in the loop and why the exit conditions aren't met.

The loop in Avowed was rather simple: It traced a ray in a loop, and when the loop counter was equal to an exit value, control flow would break out of the loop. The register dumps also immediately highlighted the loop exit counter being random garbage. So far so good.

During the traceRay call, the loop exit counter was backed up to the shader's stack. Okay, so it's pretty obvious that the stack got smashed somehow and that corrupted the loop exit counter.

What was not obvious, however, was what smashed the stack. Debugging this is generally a bit of an issue - GPUs are far, far away from tools like AddressSanitizer, especially at a compiler level. There are no tools that would help me catch a faulty access at runtime. All I could really do was look at all the shaders in that ray tracing pipeline (luckily that one didn't have too many) and see if they somehow store to wrong stack locations.

All shaders in that pipeline were completely fine, though. I checked every single scratch instruction in every shader if the offsets were correct (luckily, the offsets are constants encoded in the disassembly, so this part was trivial). I also verified that the stack pointer was incremented by the correct values - everything was completely fine. No shader was smashing its callers' stack.

I found the bug more or less by complete chance. The shader code was indeed completely correct, there were no miscompilations happening. Instead, the "scratch memory" area the HW allocated was smaller than what each thread actually used, because I forgot to multiply by the number of threads in a wavefront in one place.

The stack wasn't smashed by the called function, it was smashed by a completely different thread. Whether your stack would get smashed was essentially complete luck, depending on where the HW placed your scratch memory area, other wavefront's scratch, and how those wavefronts' execution was timed relative to yours. I don't think I would ever have been able to deduce this from any debugger output, so I should probably count myself lucky I stumbled upon the fix regardless.

Silent Hill 2's reflections sample the sky color

Did I talk about Unreal Engine yet? Let's talk about Unreal Engine some more. Silent Hill 2 uses Lumen for its reflection/GI system, and somehow Lumen from UE 5.3 specifically was the only thing that seemed to reproduce this particular bug.

In every way the Avowed bug was tolerable to debug, this one was pure suffering. There were no GPU hangs, all shaders ran completely fine. That means using umr and getting a rough idea of where the issue is was off the table from the start. Unfortunately, the RT pipeline was also way too large to analyze - there were a few hundred hit shaders, but there also were seven completely different ray generation shaders.

Having little other recourse, I started trying to at least narrow down the ray generation shader that triggered the fault. I used Mesa's debugging environment variables to dump the SPIR-V of all the shaders the driver encountered, and then used spirv-cross on all of them to turn them into editable GLSL. For each ray generation shader, I'd comment out the imageStore instructions that stored the RT result to some image, recompiled the modified GLSL to SPIR-V, and instructed Mesa to sneakily swap out the original ray-gen SPIR-V with my modified one. Then I re-ran the game to see if anything changed.

This indeed led me to find the correct ray generation shader, but the lead turned into a dead end - there was little insight other than that the ray was indeed executing the miss shader. Everything seemed correct so far, and if I hadn't known these rays didn't miss about 3 commits ago, I honestly wouldn't even have suspected anything was wrong at all.

The next thing I tried was commenting out random things in ray traversal code. Skipping over all any-hit/intersection shaders yielded no change, and neither did replacing the ray flags/culling masks with known good constants to rule out wrong values being passed as parameters. What did "fix" the result, however, was… commenting out the calls to closest-hit shaders.

Now, if closest-hit shaders get called and that makes miss shaders execute somehow, you'd perhaps think we'd be calling the wrong function. Maybe we confuse the shader binding table where we load the addresses of shaders to call from? To verify that assumption, I also disabled calling any and all miss shaders. I zeroed out the addresses in the shader handles to make extra sure there was no possible way that a miss shader could ever get called. To keep things working, I replaced the code that calls miss shaders with the relevant code fragment from UE's miss shader (essentially inlining the shader myself).

Nothing changed from that. That means a closest-hit shader being executed somehow resulted in a ray traversal itself returning a miss, not the wrong function being called.

Perhaps the closest-hit shaders corrupt some caller values again? Since the RT pipeline was too big to analyze, I tried to narrow down the suspicious shaders by only disabling specific closest-hit shaders. I also discovered that just making all closest-hit shaders no-ops "fixed" things as well, even if they do get called.

Sure enough, at some point I had a specific closest-hit shader where the issue went away once I deleted all code from it/made it a no-op. I even figured out a specific register that, if explicitly preserved, would make the issue go away.

The only problem was that this register corresponded to one part of the return value of the closest-hit shader - that is, a register that the shader was supposed to overwrite.

From here on out it gets completely nonsensical. I will save you the multiple days of confusion, hair-pulling, desperation and agony over the complete and utter undebuggableness of Lumen's RT setup and skip to the solution:

It turned out the "faulty" closest-hit shader I found was nothing but a red herring. Lumen's RT consists of 6+ RT dispatches, most of which I haven't exactly figured out the purpose of, but what I seemed to observe was that the faulty RT dispatch used the results of the previous RT dispatch to make decisions on whether to trace any rays or not. Making the closest-hit shaders a no-op did nothing but disable the subsequent traceRays that actually exhibited the issue.

Since these RT dispatches used the same RT pipelines, that meant virtually any avenue I had of debugging this driver-side was completely meaningless. Any hacks inside the shader compiler might actually work around the issue, or just affect a conceptually unrelated dispatch that happens to disable the actually problematic rays. Determining which was the case was nearly impossible, especially in a general case.

I never really figured out how to debug this issue. Once again, what saved me was a random epiphany out of the blue. In fact, now that I know what the bug was, I'm convinced I would've never found this through a debugger either.

The issue turned out to be in an optimization for what's commonly called tail-calls. If you have a function that calls another function at the very end just before returning, a common optimization is to simply turn that call into a jump, and let the other function return directly to the caller.

Imagine ray traversal working a bit like this C code:

/* hitT is the t value of the ray at the hit point */
payload closestHit(float hitT);

/* tMax is the maximum range of the ray, if there is
 * no hit with a t <= tMax, the ray misses instead */
payload traversal(float tMax) {
   do something;
   if (hit)
       return closestHit(hitT); // gets replaced with a jmp, closestHit returns directly to traversal's caller
}

More specifically, the bug was with how preserved parameters and tail-calls interact. Function callers are generally supposed to assume that preserved parameters do not change their value over the function call. That means it's safe to reuse that register after the call and assuming it still has the value the caller put in.

However, in the example above, let's assume closestHit has the same calling convention as traversal. That means closestHit's parameter needs to go into the same register as traversal's parameter, and thus the register gets overwritten.

If traversal's caller was assuming that the parameter is preserved, that would mean the value of tMax has just been overwritten with the value of hitT without the caller knowing. If traversal now gets called again from the same place, the value of tMax is not the intended value, but the hitT value from the previous iteration, which is definitely smaller than tMax.

Put shortly: If all these conditions are met, a smaller-than-intended tMax could cause rays to miss when they were intended to hit.

Once again, I got incredibly lucky and stumbled upon the bug by complete chance.

The GPU gods seem to be in good spirits for my endeavours. I pray it stays this way.

Footnotes

  1. "Shader" in this context really means any program that runs on the GPU. The RT pipeline is also made of shaders, shaders determine where the points and triangles making up each object end up on screen, there are compute shaders for generic computing, and so on…

  2. There actually is another use-case where this becomes relevant on GPU - and that is GPGPU code like CUDA/HIP/OpenCL. CUDA/HIP allow you to write C++ for the GPU in a much more "CPU-like" programming environment (OpenCL uses C), and you run into all the same problems there. This also means all the major GPU vendors had already written their solutions for these problems when raytracing came around. There are OpenCL kernels that end up really really bad if you don't have proper function calls in the compiler (which Rusticl suffers from right now), and the function calls work in RADV/ACO may end up proving useful for those as well.

  3. Even your main function works like that, actually. Unless you have some form of freestanding environment, all your program code works like that.

  4. In RADV, the stack pointer is actually constant across a function, and pushing/popping to/from the stack is implemented by adding another offset to the constant stack pointer in load/store instructions. This allows to make the stack pointer an SGPR instead of a VGPR and simplifies stack accesses that aren't push/pop.

  5. We support raytracing before Vega too. We support function calls on all GPUs, as well, through a little magic in dreaming up a buffer descriptor with specific memory swizzling to achieve the same addressing that scratch_* instructions use on Vega and later.

30 Jan 2026 12:00am GMT

26 Jan 2026

feedplanet.freedesktop.org

Lennart Poettering: Introducing Amutable

Today, we announce Amutable, our ✨ new ✨ company. We - @blixtra@hachyderm.io, @brauner@mastodon.social, @davidstrauss@mastodon.social, @rodrigo_rata@mastodon.social, @michaelvogt@mastodon.social, @pothos@fosstodon.org, @zbyszek@fosstodon.org, @daandemeyer@mastodon.social @cyphar@mastodon.social, @jrocha@floss.social and yours truly - are building the 🚀 next generation of Linux systems, with integrity, determinism, and verification - every step of the way.

For more information see → https://amutable.com/blog/introducing-amutable

26 Jan 2026 11:00pm GMT

23 Jan 2026

feedplanet.freedesktop.org

Mike Blumenkrantz: Unpopular Opinion

A Big Day For Graphics

Today is a big day for graphics. We got shiny new extensions and a new RM2026 profile, huzzah.

VK_EXT_descriptor_heap is huge. I mean in terms of surface area, the sheer girth of the spec, and the number of years it's been under development. Seriously, check out that contributor list. Is it the longest ever? I'm not about to do comparisons, but it might be.

So this is a big deal, and everyone is out in the streets (I assume to celebrate such a monumental leap forward), and I'm not.

All hats off. Person to person, let's talk.

Power Overwhelming

It's true that descriptor heap is incredibly powerful. It perfectly exemplifies everything that Vulkan is: low-level, verbose, flexible. vkd3d-proton will make good use of it (eventually), as this more closely relates to the DX12 mechanics it translates. Game engines will finally have something that allows them to footgun as hard as they deserve. This functionality even maps more closely to certain types of hardware, as described by a great gfxstrand blog post.

There is, to my knowledge, just about nothing you can't do with VK_EXT_descriptor_heap. It's really, really good, and I'm proud of what the Vulkan WG has accomplished here.

But I don't like it.

What Is This Incredibly Hot Take?

It's a risky position; I don't want anyone's takeaway to be "Mike shoots down new descriptor extension as worst idea in history". We're all smart people, and we can comprehend nuance, like the difference between rb and ab in EGL patch review (protip: if anyone ever gives you an rb, they're fucking lying because nobody can fully comprehend that code).

In short, I don't expect zink to ever move to descriptor heap. If it does, it'll be years from now as a result of taking on some other even more amazing extension which depends on heaps. Why is this, I'm sure you ask. Well, there's a few reasons:

Code Complexity

Like all things Vulkan, "getting it right" with descriptors meant creating an API so verbose that I could write novels with fewer characters than some of the struct names. Everything is brand new, with no sharing/reuse of any existing code. As anyone who has ever stepped into an unfamiliar bit of code and thought "this is garbage, I should rewrite it all" knows too well, existing code is always the worst code-but it's also the code that works and is tied into all the other existing code. Pretty soon, attempting to parachute in a new descriptor API becomes rewriting literally everything because it's all incompatible. Great for those with time and resources to spare, not so great for everyone else.

Gone are image views, which is cool and good, except that everything else in Vulkan still uses them, meaning now all image descriptors need an extra pile of code to initialize the new structs which are used only for heaps. Hope none of that was shared between rendering and descriptor use, because now there will be rendering use and descriptor use and they are completely separate. Do I hate image views? Undoubtedly, and I like this direction, but hit me up in a few more years when I can delete them everywhere.

Shader interfaces are going to be the source of most pain. Sure, it's very possible to keep existing shader infrastructure and use the mapping API with its glorious nested structs. But now you have an extra 1000 lines of mapping API structs to juggle on top. Alternatively, you can get AI to rewrite all your shaders to use the new spirv extension and have direct heap access.

Performance

Descriptor heap maps closer to hardware, which should enable users to get more performant execution by eliminating indirection with direct heap access. This is great. Full stop.

…Unless you're like zink, where the only way to avoid shredding 47 CPUs every time you change descriptors is to use a "sliding" offset for descriptors and update it each draw (i.e., VK_DESCRIPTOR_MAPPING_SOURCE_HEAP_WITH_PUSH_INDEX_EXT). Then you can't use direct heap access. Which means you're still indirecting your descriptor access (which has always been the purported perf pain point of 1.0 descriptors and EXT_descriptor_buffer). You do not pass Go, you do not collect $200. All you do is write a ton of new code.

Opinionated Development

There's a tremendous piece of exposition outlining the reasons why EXT_descriptor_heap exists in the proposal. None of these items are incorrect. I've even contributed to this document. If I were writing an engine from scratch, I would certainly expect to use heaps for portability reasons (i.e., in theory, it should eventually be available on all hardware).

But as flexible and powerful as descriptor heap is, there are some annoying cases where it passes the buck to the user. Specifically, I'm talking about management of the sampler heap. 1.0 descriptors and descriptor buffer just handwave away the exact hardware details, but with VK_EXT_descriptor_heap, you are now the captain of your own destiny and also the manager of exactly how the hardware is allocating its samplers. So if you're on NVIDIA, where you have exactly 4096 available samplers as a hardware limit, you now have to juggle that limit yourself instead of letting the driver handle it for you.

This also applies to border colors, which has its own note in the proposal. At an objective, high-view level, it's awesome to have such fine-grained control over the hardware. Then again, it's one more thing the driver is no longer managing.

I Don't Have A Better Solution

That's certainly the takeaway here. I'm not saying go back to 1.0 descriptors. Nobody should do that. I'm not saying stick with descriptor buffers either. Descriptor heap has been under development since before I could legally drive, and I'm certainly not smarter than everyone (or anyone, most likely) who worked on it.

Maybe this is the best we'll get. Maybe the future of descriptors really is micromanaging every byte of device memory and material stored within because we haven't read every blog post in existence and don't trust driver developers to make our shit run good. Maybe OpenGL, with its drivers that "just worked" under the hood (with the caveat that you, the developer, can't be an idiot), wasn't what we all wanted.

Maybe I was wrong, and we do need like five trillion more blog posts about Vulkan descriptor models. Because releasing a new descriptor extension is definitely how you get more of those blog posts.

I'm tired, boss.

23 Jan 2026 12:00am GMT

21 Jan 2026

feedplanet.freedesktop.org

Simon Ser: Status update, January 2026

Hi!

Last week I've released Goguma v0.9! This new version brings a lot of niceties, see the release notes for more details. New since last month are audio previews implemented by delthas, images for users, channels & networks, and usage hints when typing a command. Jean THOMAS has been hard at work to update the iOS port and publish Goguma on AltStore PAL.

It's been a while since I've started a NPotM, but this time I have something new to show you: nagjo is a small IRC bot for Forgejo. It posts messages on activity in Forgejo (issue opened, pull request merged, commits pushed, and so on), and it expands references to issues and pull requests in messages (writing "can you look at #42?" will reply with the issue's title and link). It's very similar to glhf, its GitLab counterpart, but the configuration file enables much more flexible channel routing. I hope that bot can be useful to others too!

Up until now, many of my projects have moved to Codeberg from SourceHut, but the issue tracker was still stuck on todo.sr.ht due to a lack of a migration tool. I've hacked together srht2forgejo, a tiny script to create Forgejo issues and comments from a todo.sr.ht archive. It's not perfect since the author is the migration user's instead of the original one, but it's good enough. I've now completely migrated all of my projects to Codeberg!

I've added a server implementation and tests to go-smee, a small Go library for a Web push forwarding service. It comes in handy when implementing Web push receivers because it's very simple to set up, I've used it when working on nagjo.

I've extended the haproxy PROXY protocol to add a new client certificate TLV to relay the raw client certificate from a TLS terminating reverse proxy to a backend server. My goal is enabling client certificate authentication when the soju IRC bouncer sits behind tlstunnel. I've also sent patches for the kimchi HTTP server and go-proxyproto.

Because sending a haproxy patch involved git-send-email, I've noticed I've started hitting a long-standing hydroxide signature bug when sending a message. I wasn't previously impacted by this, but some users were. It took a bit of time to hunt down the root cause (some breaking changes in ProtonMail's crypto library), but now it's fixed.

Félix Poisot has added two new color management options to Sway: the color_profile command now has separate gamma22 and srgb transfer functions (some monitors use one, some use the other), and a --device-primaries flag to read color primaries from the EDID (as an alternative to supplying a full ICC profile).

With the help of Alexander Orzechowski, we've fixed multiple wlroots issues regarding toplevel capture (aka. window capture) when the toplevel is completely hidden. It should all work fine now, except one last bug which results in a frozen capture if you're unlucky (aka. you loose the race).

I've shipped a number of drmdb improvements. Plane color pipelines are now supported and printed on the snapshot tree and properties table. A warning icon is displayed next to properties which have only been observed on tainted or unstable kernels (as is usually the case for proprietary or vendor kernel modules with custom properties). The device list now shows vendor names for platform devices (extracted from the kernel table). Devices using the new "faux" bus (e.g. vkms) are now properly handled, and all of the possible cursor sizes advertised via the SIZE_HINTS property are now printed. I've also done some SQLite experiments, however they turned out unsuccessful (see that thread and the merge request for more details).

delthas has added a new allow_proxy_ip directive to the kimchi HTTP server to mark IP addresses as trusted proxies, and has made it so Forwarded/X-Forwarded-For header fields are not overwritten when the previous hop is a trusted proxy. That way, kimchi can be used in more scenario: behind another HTTP reverse proxy, and behind a TCP proxy which doesn't have a loopback IP address (e.g. tlstunnel in Docker).

See you next month!

21 Jan 2026 10:00pm GMT

Christian Schaller: Can AI help ‘fix’ the patent system?

So one thing I think anyone involved with software development for the last decades can see is the problem of "forest of bogus patents". I have recently been trying to use AI to look at patents in various ways. So one idea I had was "could AI help improve the quality of patents and free us from obvious ones?"

Lets start with the justification for patents existing at all. The most common argument for the patent system I hear is this one : "Patents require public disclosure of inventions in exchange for protection. Without patents, inventors would keep innovations as trade secrets, slowing overall technological progress.". This reasoning is something that makes sense to me, but it is also screamingly obvious to me that for it to hold true you need to ensure the patents granted are genuinely inventions that otherwise would stay hidden as trade secrets. If you allow patents on things that are obvious to someone skilled in the art, you are not enhancing technological progress, you are hampering it because the next person along will be blocking from doing it.

So based on this justification the question then becomes does for example the US Patents Office do a good job filtering out obvious patents? And I believe the answer is "No they don't". Having been working in the space of software for many decades now it is very clear to me that the patent office do very little to avoid patents getting approved for obvious things. And there are many reasons for why that happens I think. First of all if you are an expert in a field you would likely not be working as a case worker in the patent office, no disrespect for the people working there, but for instance if you are a really good software engineer there are much better paying options to be found elsewhere, same goes for medical researchers for example. So we have people who have limited understanding of the field of the patent trying to evaluate if the patent is obvious or not.

Secondly the patent office is not really incentivized to deny patents as they are a self funded entity and their 'success' and revenue is based on how many patent applications are submitted. So the lower the bar of approval, the more patents applications get submitted and thus the patent office makes more money and is more "successful". This is a problem, but maybe if we fix the first issue I mention it is maybe possible to reduce the impact of this second issue.

So what can AI do to help here? It can help patent reviewers at the patent office with evaluating patents for obviousness due to having been trained on these fields. If we required every patent applied for to be run through a AI engine for review then the patent office reviewer would have to justify why he or she believes something is non-obvious after the AI has looked at it and concluded it is not. My assumption would be that they would very rarely do so.

To test out this I decided to ask Gemini to review a H264 video patent that is still valid and which has been used in litigation before, to see if it could do a decent job of reviewing the patent application and judge if this is a true invention or not. Here is what Gemini came up with.

US Patent 7,916,782 (Samsung Electronics) is one of the most famous "battleground" patents in the history of smartphone litigation. It has been asserted against Apple, Huawei, and others.

It covers a specific part of the H.264 standard known as CAVLC (Context-Adaptive Variable Length Coding).

The Patent: US 7,916,782 : Method and apparatus for video coding and decoding The Mechanism: Efficiently coding the count of zero-valued pixels in a video block ("TotalZeros").

The Problem
In video compression, you divide the image into small 4×4 pixel blocks. After processing (transform and quantization), most of the 16 pixels in the block end up being zero. You only have a few non-zero numbers left.
Goal: You need to tell the decoder how many zeros are in the block so it can reconstruct the empty space.
Constraint: You already told the decoder how many non-zero coefficients exist (let's say you have 5 non-zeros).
The "Invention" The patent claims a method where the encoder selects a specific lookup table (VLC table) to encode the "TotalZeros" value, and-crucially-this selection is based on the number of non-zero coefficients (TotalCoeff) that were just processed. If TotalCoeff is 1, use Table A (allows for many zeros).If TotalCoeff is 10, use Table B (allows for fewer zeros).

The "Skilled Engineer" Test
Imagine you are a video coding engineer in 2002. You are tasked with compressing the "TotalZeros" value. Here is the logical deduction chain you would inevitably follow:

21 Jan 2026 6:35pm GMT

Sebastian Wick: Best Practices for Ownership in GLib

For all the rightful criticisms that C gets, GLib does manage to alleviate at least some of it. If we can't use a better language, we should at least make use of all the tools we have in C with GLib.

This post looks at the topic of ownership, and also how it applies to libdex fibers.

Ownership

In normal C usage, it is often not obvious at all if an object that gets returned from a function (either as a real return value or as an out-parameter) is owned by the caller or the callee:

MyThing *thing = my_thing_new ();

If thing is owned by the caller, then the caller also has to release the object thing. If it is owned by the callee, then the lifetime of the object thing has to be checked against its usage.

At this point, the documentation is usually being consulted with the hope that the developer of my_thing_new documented it somehow. With gobject-introspection, this documentation is standardized and you can usually read one of these:

The caller of the function takes ownership of the data, and is responsible for freeing it.

The returned data is owned by the instance.

If thing is owned by the caller, the caller now has to release the object or transfer ownership to another place. In normal C usage, both of those are hard issues. For releasing the object, one of two techniques are usually employed:

  1. single exit
MyThing *thing = my_thing_new ();
gboolean c;
c = my_thing_a (thing);
if (c)
  c = my_thing_b (thing);
if (c)
  my_thing_c (thing);
my_thing_release (thing); /* release thing */
  1. goto cleanup
  MyThing *thing = my_thing_new ();
  if (!my_thing_a (thing))
    goto out;
  if (!my_thing_b (thing))
    goto out;
  my_thing_c (thing);
out:
  my_thing_release (thing); /* release thing */

Ownership Transfer

GLib provides automatic cleanup helpers (g_auto, g_autoptr, g_autofd, g_autolist). A macro associates the function to release the object with the type of the object (e.g. G_DEFINE_AUTOPTR_CLEANUP_FUNC). If they are being used, the single exit and goto cleanup approaches become unnecessary:

g_autoptr(MyThing) thing = my_thing_new ();
if (!my_thing_a (thing))
  return;
if (!my_thing_b (thing))
  return;
my_thing_c (thing);

The nice side effect of using automatic cleanup is that for a reader of the code, the g_auto helpers become a definite mark that the variable they are applied on own the object!

If we have a function which takes ownership over an object passed in (i.e. the called function will eventually release the resource itself) then in normal C usage this is indistinguishable from a function call which does not take ownership:

MyThing *thing = my_thing_new ();
my_thing_finish_thing (thing);

If my_thing_finish_thing takes ownership, then the code is correct, otherwise it leaks the object thing.

On the other hand, if automatic cleanup is used, there is only one correct way to handle either case.

A function call which does not take ownership is just a normal function call and the variable thing is not modified, so it keeps ownership:

g_autoptr(MyThing) thing = my_thing_new ();
my_thing_finish_thing (thing);

A function call which takes ownership on the other hand has to unset the variable thing to remove ownership from the variable and ensure the cleanup function is not called. This is done by "stealing" the object from the variable:

g_autoptr(MyThing) thing = my_thing_new ();
my_thing_finish_thing (g_steal_pointer (&thing));

By using g_steal_pointer and friends, the ownership transfer becomes obvious in the code, just like ownership of an object by a variable becomes obvious with g_autoptr.

Ownership Annotations

Now you could argue that the g_autoptr and g_steal_pointer combination without any conditional early exit is functionally exactly the same as the example with the normal C usage, and you would be right. We also need more code and it adds a tiny bit of runtime overhead.

I would still argue that it helps readers of the code immensely which makes it an acceptable trade-off in almost all situations. As long as you haven't profiled and determined the overhead to be problematic, you should always use g_auto and g_steal!

The way I like to look at g_auto and g_steal is that it is not only a mechanism to release objects and unset variables, but also annotations about the ownership and ownership transfers.

Scoping

One pattern that is still somewhat pronounced in older code using GLib, is the declaration of all variables at the top of a function:

static void
foobar (void)
{
  MyThing *thing = NULL;
  size_t i;

  for (i = 0; i < len; i++) {
    g_clear_pointer (&thing);
    thing = my_thing_new (i);
    my_thing_bar (thing);
  }
}

We can still avoid mixing declarations and code, but we don't have to do it at the granularity of a function, but of natural scopes:

static void
foobar (void)
{
  for (size_t i = 0; i < len; i++) {
    g_autoptr(MyThing) thing = NULL;

    thing = my_thing_new (i);
    my_thing_bar (thing);
  }
}

Similarly, we can introduce our own scopes which can be used to limit how long variables, and thus objects are alive:

static void
foobar (void)
{
  g_autoptr(MyOtherThing) other = NULL;

  {
    /* we only need `thing` to get `other` */
    g_autoptr(MyThing) thing = NULL;

    thing = my_thing_new ();
    other = my_thing_bar (thing);
  }

  my_other_thing_bar (other);
}

Fibers

When somewhat complex asynchronous patterns are required in a piece of GLib software, it becomes extremely advantageous to use libdex and the system of fibers it provides. They allow writing what looks like synchronous code, which suspends on await points:

g_autoptr(MyThing) thing = NULL;

thing = dex_await_object (my_thing_new_future (), NULL);

If this piece of code doesn't make much sense to you, I suggest reading the libdex Additional Documentation.

Unfortunately the await points can also be a bit of a pitfall: the call to dex_await is semantically like calling g_main_loop_run on the thread default main context. If you use an object which is not owned across an await point, the lifetime of that object becomes critical. Often the lifetime is bound to another object which you might not control in that particular function. In that case, the pointer can point to an already released object when dex_await returns:

static DexFuture *
foobar (gpointer user_data)
{
  /* foo is owned by the context, so we do not use an autoptr */
  MyFoo *foo = context_get_foo ();
  g_autoptr(MyOtherThing) other = NULL;
  g_autoptr(MyThing) thing = NULL;

  thing = my_thing_new ();
  /* side effect of running g_main_loop_run */
  other = dex_await_object (my_thing_bar (thing, foo), NULL);
  if (!other)
    return dex_future_new_false ();

  /* foo here is not owned, and depending on the lifetime
   * (context might recreate foo in some circumstances),
   * foo might point to an already released object
   */
  dex_await (my_other_thing_foo_bar (other, foo), NULL);
  return dex_future_new_true ();
}

If we assume that context_get_foo returns a different object when the main loop runs, the code above will not work.

The fix is simple: own the objects that are being used across await points, or re-acquire an object. The correct choice depends on what semantic is required.

We can also combine this with improved scoping to only keep the objects alive for as long as required. Unnecessarily keeping objects alive across await points can keep resource usage high and might have unintended consequences.

static DexFuture *
foobar (gpointer user_data)
{
  /* we now own foo */
  g_autoptr(MyFoo) foo = g_object_ref (context_get_foo ());
  g_autoptr(MyOtherThing) other = NULL;

  {
    g_autoptr(MyThing) thing = NULL;

    thing = my_thing_new ();
    /* side effect of running g_main_loop_run */
    other = dex_await_object (my_thing_bar (thing, foo), NULL);
    if (!other)
      return dex_future_new_false ();
  }

  /* we own foo, so this always points to a valid object */
  dex_await (my_other_thing_bar (other, foo), NULL);
  return dex_future_new_true ();
}
static DexFuture *
foobar (gpointer user_data)
{
  /* we now own foo */
  g_autoptr(MyOtherThing) other = NULL;

  {
    /* We do not own foo, but we only use it before an
     * await point.
     * The scope ensures it is not being used afterwards.
     */
    MyFoo *foo = context_get_foo ();
    g_autoptr(MyThing) thing = NULL;

    thing = my_thing_new ();
    /* side effect of running g_main_loop_run */
    other = dex_await_object (my_thing_bar (thing, foo), NULL);
    if (!other)
      return dex_future_new_false ();
  }

  {
    MyFoo *foo = context_get_foo ();

    dex_await (my_other_thing_bar (other, foo), NULL);
  }

  return dex_future_new_true ();
}

One of the scenarios where re-acquiring an object is necessary, are worker fibers which operate continuously, until the object gets disposed. Now, if this fiber owns the object (i.e. holds a reference to the object), it will never get disposed because the fiber would only finish when the reference it holds gets released, which doesn't happen because it holds a reference. The naive code also suspiciously doesn't have any exit condition.

static DexFuture *
foobar (gpointer user_data)
{
  g_autoptr(MyThing) self = g_object_ref (MY_THING (user_data));

  for (;;)
    {
      g_autoptr(GBytes) bytes = NULL;

      bytes = dex_await_boxed (my_other_thing_bar (other, foo), NULL);

      my_thing_write_bytes (self, bytes);
    }
}

So instead of owning the object, we need a way to re-acquire it. A weak-ref is perfect for this.

static DexFuture *
foobar (gpointer user_data)
{
  /* g_weak_ref_init in the caller somewhere */
  GWeakRef *self_wr = user_data;

  for (;;)
    {
      g_autoptr(GBytes) bytes = NULL;

      bytes = dex_await_boxed (my_other_thing_bar (other, foo), NULL);

      {
        g_autoptr(MyThing) self = g_weak_ref_get (&self_wr);
        if (!self)
          return dex_future_new_true ();

        my_thing_write_bytes (self, bytes);
      }
    }
}

Conclusion

21 Jan 2026 3:31pm GMT

14 Jan 2026

feedplanet.freedesktop.org

Mike Blumenkrantz: 2026 Status

Not A Real Post

Still digging myself out of a backlog (and remembering how to computer), so probably no real post this week. I do have some exciting news for the blog though.

Now that various public announcements have been made, I can finally reveal the reason why I've been less active in Mesa of late is because I've been hard at work on Steam Frame. There's a lot of very cool tech involved, and I'm planning to do some rundowns on the software-related projects I've been tackling.

Temper your expectations: I won't be discussing anything hardware-related, and there will likely be no mentions of any specific game performance/issues.

14 Jan 2026 12:00am GMT

05 Jan 2026

feedplanet.freedesktop.org

Sebastian Wick: Improving the Flatpak Graphics Drivers Situation

Graphics drivers in Flatpak have been a bit of a pain point. The drivers have to be built against the runtime to work in the runtime. This usually isn't much of an issue but it breaks down in two cases:

  1. If the driver depends on a specific kernel version
  2. If the runtime is end-of-life (EOL)

The first issue is what the proprietary Nvidia drivers exhibit. A specific user space driver requires a specific kernel driver. For drivers in Mesa, this isn't an issue. In the medium term, we might get lucky here and the Mesa-provided Nova driver might become competitive with the proprietary driver. Not all hardware will be supported though, and some people might need CUDA or other proprietary features, so this problem likely won't go away completely.

Currently we have runtime extensions for every Nvidia driver version which gets matched up with the kernel version, but this isn't great.

The second issue is even worse, because we don't even have a somewhat working solution to it. A runtime which is EOL doesn't receive updates, and neither does the runtime extension providing GL and Vulkan drivers. New GPU hardware just won't be supported and the software rendering fallback will kick in.

How we deal with this is rather primitive: keep updating apps, don't depend on EOL runtimes. This is in general a good strategy. A EOL runtime also doesn't receive security updates, so users should not use them. Users will be users though and if they have a goal which involves running an app which uses an EOL runtime, that's what they will do. From a software archival perspective, it is also desirable to keep things working, even if they should be strongly discouraged.

In all those cases, the user most likely still has a working graphics driver, just not in the flatpak runtime, but on the host system. So one naturally asks oneself: why not just use that driver?

That's a load-bearing "just". Let's explore our options.

Exploration

Attempt #1: Bind mount the drivers into the runtime.

Cool, we got the driver's shared libraries and ICDs from the host in the runtime. If we run a program, it might work. It might also not work. The shared libraries have dependencies and because we are in a completely different runtime than the host, they most likely will be mismatched. Yikes.

Attempt #2: Bind mount the dependencies.

We got all the dependencies of the driver in the runtime. They are satisfied and the driver will work. But your app most likely won't. It has dependencies that we just changed under its nose. Yikes.

Attempt #3: Linker magic.

Until here everything is pretty obvious, but it turns out that linkers are actually quite capable and support what's called linker namespaces. In a single process one can load two completely different sets of shared libraries which will not interfere with each other. We can bind mount the host shared libraries into the runtime, and dlmopen the driver into its own namespace. This is exactly what libcapsule does. It does have some issues though, one being that the libc can't be loaded into multiple linker namespaces because it manages global resources. We can use the runtime's libc, but the host driver might require a newer libc. We can use the host libc, but now we contaminate the apps linker namespace with a dependency from the host.

Attempt #4: Virtualization.

All of the previous attempts try to load the host shared objects into the app. Besides the issues mentioned above, this has a few more fundamental issues:

  1. The Flatpak runtimes support i386 apps; those would require a i386 driver on the host, but modern systems only ship amd64 code.
  2. We might want to support emulation of other architectures later
  3. It leaks an awful lot of the host system into the sandbox
  4. It breaks the strict separation of the host system and the runtime

If we avoid getting code from the host into the runtime, all of those issues just go away, and GPU virtualization via Virtio-GPU with Venus allows us to do exactly that.

The VM uses the Venus driver to record and serialize the Vulkan commands, sends them to the hypervisor via the virtio-gpu kernel driver. The host uses virglrenderer to deserializes and executes the commands.

This makes sense for VMs, but we don't have a VM, and we might not have the virtio-gpu kernel module, and we might not be able to load it without privileges. Not great.

It turns out however that the developers of virglrenderer also don't want to have to run a VM to run and test their project and thus added vtest, which uses a unix socket to transport the commands from the mesa Venus driver to virglrenderer.

It also turns out that I'm not the first one who noticed this, and there is some glue code which allows Podman to make use of virgl.

You can most likely test this approach right now on your system by running two commands:

rendernodes=(/dev/dri/render*)
virgl_test_server --venus --use-gles --socket-path /tmp/flatpak-virgl.sock --rendernode "${rendernodes[0]}" &
flatpak run --nodevice=dri --filesystem=/tmp/flatpak-virgl.sock --env=VN_DEBUG=vtest --env=VTEST_SOCKET_NAME=/tmp/flatpak-virgl.sock org.gnome.clocks

If we integrate this well, the existing driver selection will ensure that this virtualization path is only used if there isn't a suitable driver in the runtime.

Implementation

Obviously the commands above are a hack. Flatpak should automatically do all of this, based on the availability of the dri permission.

We actually already start a host program and stop it when the app exits: xdg-dbus-proxy. It's a bit involved because we have to wait for the program (in our case virgl_test_server) to provide the service before starting the app. We also have to shut it down when the app exits, but flatpak is not a supervisor. You won't see it in the output of ps because it just execs bubblewrap (bwrap) and ceases to exist before the app even started. So instead we have to use the kernel's automatic cleanup of kernel resources to signal to virgl_test_server that it is time to shut down.

The way this is usually done is via a so called sync fd. If you have a pipe and poll the file descriptor of one end, it becomes readable as soon as the other end writes to it, or the file description is closed. Bubblewrap supports this kind of sync fd: you can hand in a one end of a pipe and it ensures the kernel will close the fd once the app exits.

One small problem: only one of those sync fds is supported in bwrap at the moment, but we can add support for multiple in Bubblewrap and Flatpak.

For waiting for the service to start, we can reuse the same pipe, but write to the other end in the service, and wait for the fd to become readable in Flatpak, before exec'ing bwrap with the same fd. Also not too much code.

Finally, virglrenderer needs to learn how to use a sync fd. Also pretty trivial. There is an older MR which adds something similar for the Podman hook, but it misses the code which allows Flatpak to wait for the service to come up, and it never got merged.

Overall, this is pretty straight forward.

Conclusion

The virtualization approach should be a robust fallback for all the cases where we don't get a working GPU driver in the Flatpak runtime, but there are a bunch of issues and unknowns as well.

It is not entirely clear how forwards and backwards compatible vtest is, if it even is supposed to be used in production, and if it provides a strong security boundary.

None of that is a fundamental issue though and we could work out those issues.

It's also not optimal to start virgl_test_server for every Flatpak app instance.

Given that we're trying to move away from blanket dri access to a more granular and dynamic access to GPU hardware via a new daemon, it might make sense to use this new daemon to start the virgl_test_server on demand and only for allowed devices.

05 Jan 2026 11:30pm GMT

01 Jan 2026

feedplanet.freedesktop.org

Timur Kristóf: A love song for Linux gamers with old GPUs (EOY 2025)

AMD GPUs are famous for working very well on Linux. However, what about the very first GCN GPUs? Are they working as well as the new ones? In this post, I'm going to summarize how well these old GPUs are supported and what I've been doing to improve them.

This story is about the first two generations of GCN: Southern Islands (aka. SI, GCN1, GFX6) and Sea Islands (aka. CIK, GCN2, GFX7).

Working on old GPUs

While AMD GPUs generally have a good reputation on Linux, these old GCN graphics cards have been a sore spot for as long as I've been working on the driver stack.

It occurred to me that resolving some of the long-standing issues on these old GPUs might be a great way to get me started on working on the amdgpu kernel driver and would help improve the default user experience of Linux users on these GPUs. I figured that it would give me a good base understanding, and later I could also start contributing code and bug fixes to newer GPUs.

Where I started

The RADV team has supported RADV on SI and CIK GPUs for a long time. RADV support was already there even before I joined the team in mid-2019. Daniel added ACO support for GFX7 in November 2019, and Samuel added ACO support for GFX6 in January 2020. More recently, Martin added a Tahiti (GFX6) and Hawaii (GFX7) GPU to the Mesa CI which are running post-merge "nightly" jobs. So we can catch regressions and test our work on these GPUs quite quickly.

The kernel driver situation was less fortunate.

On the kernel side, amdgpu (the newer kernel driver) has supported CIK since June 2015 and SI since August 2016. DC (the new display driver) has supported CIK since September 2017 (the beginning), and SI support was added in July 2020 by Mauro. However, the old radeon driver was the default driver. Unfortunately, radeon doesn't support Vulkan, so in the default user experience, users couldn't play most games or benefit from any of the Linux gaming related work we've been doing for the last 10 years.

In order to get working Vulkan support on SI and CIK, we needed to use the following kernel params:

radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1

Then, you could boot with amdgpu and enjoy a semblance of a good user experience until the GPU crashed / hung, or until you tried to use some functionality which was missing from amdgpu, or until you plugged in a display which the display driver couldn't handle.

It was… not the best user experience.

Where to go from there?

The first question that came to mind is, why wasn't amdgpu the default kernel driver for these GPUs? Since the "experimental" support had been there for 10 years, we had thought the kernel devs would eventually just enable amdgpu by default, but that never happened. At XDC 2024 I met Alex Deucher, the lead developer of amdgpu and asked him what was missing. Alex explained to me that the main reason the default wasn't switched is to avoid regressions for users who rely on some functionality not supported by amdgpu:

It doesn't seem like much, does it? How hard can it be?

Display features

On a 2025 summer afternoon…

I messaged Alex Deucher to get some advice on where to start. Alex was very considerate and helped me to get a good understanding of how the code is organized, how the parts fit together and where I should start reading. Harry Wentland also helped a lot with making a plan how to fit analog connectors in DC. Then, I plugged my monitors into my Raphael iGPU to be used as a primary GPU, then plugged in an old Oland card as a secondary GPU, and started hacking.

Focus

For the display, I decided that the best way forward is to add what is missing from DC for these GPUs and use DC by default. That way, we can eventually get rid of the legacy display code (which was always meant as a temporary solution until DC landed).

Additionally, I decided to focus on dedicated GPUs because these are the most useful for gaming and are easy to test using a desktop computer. There is still work left to do for CIK APUs.

Analog connector support in DC

Analog connectors have been actually quite easy to deal with, once I understood the structure of the DC (display core) codebase. I could use the legacy display code as a reference. The DAC (digital to analog converter) is actually programmed by the VBIOS, and the driver just needs to call the VBIOS to tell it what to do. Easier said than done, but not too hard.

It also turned out that some chips that already defaulted to DC (eg. Tonga, Hawaii) also have analog connectors, which apparently just didn't work on Linux by default. I managed to submit the first version of this in July. Then I was sidetracked with a lot of other issues, so I submitted the second version of the series in September, which then got merged.

Going shopping

It is incredibly difficult to debug issues when you don't have the hardware to reproduce them yourself. Some developers have a good talent for writing patches to fix issues without actually seeing the issue, but I feel I still have a long way to go to be that good. It was pretty clear from the beginning that the only way to make sure my work actually works on all SI/CIK GPUs is to test all of them myself.

So, I went ahead an acquired at least one of each SI and CIK chip. I got most of them from used hardware ad sites, and Leonardo Frassetto sent me a few as well.

Fixing DC support on SI (DCE6)

After I got the analog connector working using the old GPUs as a secondary GPU, I thought it's time to test how well it works as a primary GPU. You know, the way most actual users would use them. So I disabled the iGPU and booted my computer with each dGPU with amdgpu.dc=1 to see what happens. This is where things started going crazy…

The way to debug these problems is the following:

I decided to fix the bugs before adding new features. I sent a few patch series to address a bunch of display issues mainly with SI (DCE6):

DisplayPort/HDMI audio support on SI (DCE6)

I noticed that HDMI audio worked alright on all GPUs with DC (as expected), however DP audio didn't (which was unexpected). However it worked when both DP and HDMI were plugged in… After consulting with Alex and doing some trial and error, it turned out that this was just due to setting some clock frequency in the wrong way: Fix DP audio DTO1 clock source on DCE6.

In order to figure out the correct frequencies, I wrote a script that set the frequency using umr then played a sound. I just laid back and let the script run until I heard the sound. Then it was just a matter of figuring out why that frequency was correct.

A small fun fact: it turns out that DP audio on Tahiti didn't work on any Linux driver before. Now it works with DC.

Poweeeeer

The DCE (Display Controller Engine), just like other parts of the GPU, has its own power requirements and needs certain clocks, voltages, etc. It is the responsibility of the power management code to make sure DCE gets the power it needs. Unfortunately, DC didn't talk to the legacy power management code. Even more unfortunately, the power management code was buggy so that's what I started with.

After I was mostly done with SI, I also fixed an issue with CIK, where the shutdown temperature was incorrectly reported.

VCE1 video encoding on SI

Video encoding is usually an afterthought, not something that most users think about unless they are interested in streaming or video transcoding. It was definitely an afterthought for the hardware designers of SI, which has the first generation VCE (video coding engine) that only supports H264 and only up to 2048 x 1152. However, the old radeon kernel driver supports this engine and for anyone relying on this functionality, it would be a regression when switching to amdgpu. So we need to support it. There was already some work by Alexandre Demers to support VCE1, but that work was stalled due to issues caused by the firmware validation mechanism.

In order to switch SI to amdgpu by default, I needed to deal with VCE1. So I started a conversation with Christian König (amdgpu expert) to identify what the problem actually was, and with Alexandre to see how far along his work was.

Christian helped me a lot with understanding how the memory controller and the page table work.

After I got over the headache, I came up with this idea:

With that out of the way, the rest of the work on VCE1 was pretty straightforward. I could use Alexandre's research, as well as the VCE2 code from amdgpu and the VCE1 code from radeon as a reference. Finally, a few reviews and three revisions later, the VCE1 series was accepted.

Final thoughts

Who is this for?

In the current economic situation of our world, I expect that people are going to use GPUs for much longer, and replace them less often. And when an old GPU is replaced, it doesn't die, it goes to somebody who upgrades an even older GPU. Eventually it will reach somebody that can't afford a better one. There are some efforts to use Linux to keep old computers alive, for example this one. My goal with this work is to make Linux gaming a good experience also for those who use old GPUs.

Other than that, I also did it for myself. Not because I want to run old GPUs myself, but because it has been a great learning experience to get into the amdgpu kernel driver.

Why amdgpu? Why DC?

The open source community including AMD themselves as well as other entities like Valve, Igalia, Red Hat etc. have invested a lot of time and effort into amdgpu and DC, which now support many generations of AMD GPUs: GCN1-5, RDNA1-4, as well as CDNA. In fact amdgpu supports more generations of GPUs now than what came before, and it looks like it will support many generations of future GPUs.

By making amdgpu work well with SI and CIK, we ensure that these GPUs remain competently supported for the foreseeable future.

By switching SI and CIK to use DC by default, we enable display features like atomic modesetting, VRR, HDR, etc. and this also allows the amdgpu maintainers to eventually move on from the legacy display code without losing functionality.

What is left to do?

Now that amdgpu is at feature parity with radeon on old GPUs, we switched the default to amdgpu on SI and CIK dedicated GPUs. It's time to start thinking about what else is left to do.

What have I learned from all this?

It isn't that scary

Kernel development is not as scary as it looks. It is a different technical challenge than what I was used to, but not in a bad way. Just needed to figure out a good workflow for how to configure a code editor, as well as what is a good way to test my work without rebuilding everything all the time.

Maintainers are friendly

AMD engineers have been very friendly and helpful to me all the way. Although there are a lot of memes and articles on the internet about attitude and rude/toxic messages by some kernel developers, I didn't see that in amdgpu at least.

My approach was that even before I wrote a single line of code, I started talking to the maintainers (who would eventually review my patches) to find out what would be the good solution to them and how to get my work accepted. Communicating with the maintainers saved a lot of time and made the work faster, more pleasant and more collaborative.

Development latency

Sadly, there is a huge latency between a Linux kernel developer working on something and the work reaching end users. Even if the patches are accepted quickly, it can take 3~6 months until users can actually use it.

In hindsight, if I had focused on finishing the analog support and VCE1 first (instead of fixing all the bugs I found), my work would have ended up in Linux 6.18 (including the bug fixes, as there is no deadline for those). Due to how I prioritized bug fixing, the features I've developed are only included in Linux 6.19, so this will be the version where SI and CIK will default to amdgpu by default.

XDC 2025

I presented a lightning talk on this topic at XDC 2025, where I talked about the state of SI and CIK support as of September 2025. You can find the slide deck here and the video here.

Acknowledgements

I'd like to say a big thank you to all of these people. All of the work I mentioned in this post would not have been possible without them.

Which graphics cards are affected exactly?

When in doubt, consult Wikipedia.

GFX6 aka. GCN1 - Southern Islands (SI) dedicated GPUs: amdgpu is now the default kernel driver as of Linux 6.19. The DC display driver is now the default and is now usable for these GPUs. DC now supports analog connectors, power management is less buggy, and video encoding is now supported by amdgpu.

GFX7 aka. GCN2 - Sea Islands (CIK) dedicated GPUs: amdgpu is now the default kernel driver as of Linux 6.19. The DC display driver is now the default for Bonaire (was already the case for Hawaii). DC now supports analog connectors. Minor bug fixes.

GFX8 aka. GCN3 - Volcanic Islands (VI) dedicated GPUs: DC now supports analog connectors.
(Note that amdgpu and DC were already supported on these GPUs since release.)

01 Jan 2026 12:00am GMT

30 Dec 2025

feedplanet.freedesktop.org

Lennart Poettering: Mastodon Stories for systemd v259

On Dec 17 we released systemd v259 into the wild.

In the weeks leading up to that release (and since then) I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd259 hash tag. In case you aren't using Mastodon, but would like to read up, here's a list of all 25 posts:

I intend to do a similar series of serieses of posts for the next systemd release (v260), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

My series for v260 will begin in a few weeks most likely, under the #systemd260 hash tag.

In case you are interested, here is the corresponding blog story for systemd v258, here for v257, and here for v256.

30 Dec 2025 11:00pm GMT

21 Dec 2025

feedplanet.freedesktop.org

Timur Kristóf: Understanding your Linux open source drivers

After introducing how graphics drivers work in general, I'd like to give a brief overview about what is what in the Linux graphics stack, what are the important parts and what the key projects are where the development happens, as well as what you need to do to get the best user experience out of it.

The open source Linux graphics driver stack

Please refer to my previous post for a more detailed general explanation on graphics drivers in general. This post focuses on how things work in the open source graphics stack on Linux.

Which GPUs are supported?

We have open source drivers for the GPUs from all major manufacturers with varying degrees of success.

What parts do you need?

The components you need in order to get your GPU working on open source drivers on a Linux distro are the following:

To make your GPU work, you need new enough versions of the Linux kernel, linux-firmware and Mesa (and LLVM) that include support for your GPU.

To make your GPU work well, I highly recommend to use the latest stable versions of all of the above. If you use old versions, you are missing out. By using old versions you are choosing not to benefit from the latest developments (features and bug fixes) that open source developers have worked on, and you will have a sub-par experience (especially on relatively new hardware).

Wait, aren't the drivers in the kernel?

If you read Reddit posts, you will stumble upon some people who believe that "the drivers are in the kernel" on Linux. This is a half-truth. Only the KMDs are part of the kernel, everything else (linux-firmware, Mesa, LLVM) is distributed in separate packges. How exactly those packages are organized, depends on your distribution.

What is the Mesa project?

Mesa is a collection of userspace drivers which implement various different APIs. It is the cornerstone of the open source graphics stack. I'm going to attempt to give a brief overview of what are the most relevant parts of Mesa.

Gallium

An important part of Mesa is the Gallium driver infrastructure, which contains a lot of common code for implementing different APIs, such as:

Vulkan

Mesa also contains a collection of Vulkan drivers. Originally, Vulkan was deemed "lower level than Gallium", so Vulkan drivers are not part of the Gallium driver infrastructure. However, Vulkan has a lot of overlapping functionality with the aforementioned APIs, so Vulkan drivers still share a lot of code with their Gallium counterparts when appropriate.

NIR

Another important part of Mesa is the NIR shader compiler stack, which is at the heart of every Mesa driver that is still being maintained. This enables sharing a lot of compiler code across different drivers. I highly recommend Faith Ekstrand's post In defense of NIR to learn more about it.

Compatibility layers and API translation

Technically they are not drivers, but in practice, if you want to run Windows games, you will need a compatibility layer like Wine or Proton, including graphics translation layers. The recommended ones are:

Those are default in Proton and offer the best performance. However, for "political" reasons, these are sadly not the defaults in Wine, so either you'll have to use Proton or make sure to install the above in Wine manually.

Just for the sake of completeness, I'll also mention the Wine defaults:

Side note about window systems

Despite the X server being abandoned for a long time, there is still a debate between Linux users whether to use a Wayland compositor or the X server. I'm not going to discuss the advantages and disadvantages of these, because I don't participate in their development and I feel it has already been well-explained by the community.

I'm just going to say that it helps to choose a competent compositor that implements direct scanout. This means that the frames as rendered by your game can be sent directly to the display without the need for the compositor to do any additional processing on it.

In this blog post I focus on just the driver stack, because that is largely shared between both solutions.

Making your games run (well)

Sadly, many Linux distributions choose to ship old versions of the kernel and/or other parts of the driver stack, giving their users a sub-par experience. Debian, Ubuntu LTS and their derivatives like Mint, Pop OS, etc. are all guilty of this. They justify this by claiming that older versions are more reliable, but this is actually not true.

In reality, us driver developers as well as the developers of the API translation layers work hard to implement new features that are needed to get new games working, as well as fixing bugs that are exposed by new games (or updates of old games).

Regressions are real, but they are usually quickly fixed thanks to the extensive testing that we do: every time we merge new code, our automated testing system runs the full Vulkan conformance test suite to make sure that all functionality is still intact, thanks to Martin's genious.

21 Dec 2025 11:52pm GMT

Simon Ser: Status update, December 2025

Hi all!

This month the new KMS plane color pipeline API has finally been merged! It took multiple years and continued work and review by engineers from multiple organizations, but at last we managed to push it over the finish line. This new API exposes to user-space new hardware blocks: these applying color transformations before blending multiple KMS planes as a final composited image to be sent on the wire. This API unlocks power-efficient and low-latency color management features such as HDR.

Still, much remains to be done. Color pipelines are now exposed on AMD and VKMS, Intel and other vendors are still working on their driver implementation. Melissa Wen has written a drm_info patch to show pipeline information, some more work is needed to plumb it through drmdb. Some patches have been floated to leverage color pipelines for post-blending transforms too (currently KMS only supports a fixed rudimentary post-blending pipeline with two LUTs and one 3×3 matrix).

On the wlroots side, Félix Poisot has redesigned the way post-blending color transforms are applied by the renderer. The API used to be a mix of descriptive (describing which primaries and transfer functions the output buffer uses) and prescriptive (passing a list of operations to apply). Now it's fully prescriptive, which will help for offloading these transformations to the DRM backend.

GnSight has contributed support for the wlr-foreign-toplevel-management-v1 protocol to the Cage kiosk compositor. This enables better control over windows running inside the compositor: external tools can close or bring windows to the front.

mhorky has added client support for one-way method calls to go-varlink, as well as a nice Registry enhancement to add support for the org.varlink.service interface for free, for discovery and introspection of Varlink services. Now that the module is feature-complete I've released version 0.1.0.

delthas has introduced support for authenticating with the soju IRC bouncer via TLS client certificates. He has contributed a simple audio recorder to the Goguma mobile IRC client, plus new buttons above the reaction list to be able to easily +1 another user's reaction. Hubert Hirtz has sent a collection of bug fixes and has added a button to reveal the password field contents on the connection screen.

I've resurrected work on some old projects I'd almost forgotten about. I've pushed a few patches for libicc, adding support for encoding multi-process transforms, luminance and metadata. I've added a basic test suite to libjsonschema, and improved handling of objects and arrays without enough information to automatically generate types from.

But the old project I've spent most of my time on is go-mls, a Go implementation of the Messaging Layer Security (MLS) protocol. MLS is an end-to-end encryption protocol for chat messages. My goal is twofold: learn how MLS works under the hood (implementing something is one of the best ways for me to understand that something), and lay the groundwork for a future end-to-end encryption IRC extension. This month I've fixed up the remaining failures in the test suite and I've implemented just enough to be able to create a group, add members to it, and exchange an encrypted message. I'll work on remaining group operations (e.g. removing a member) next.

Last, I've migrated FreeDesktop's Mailman 3 installation to PostgreSQL from SQLite. Mailman 3's SQLite integration had pretty severe performance issues, these are gone with PostgreSQL. The migration wasn't straightforward: there is no tooling to migrate Mailman 3 core's data between database engines, so I had to manually fill the new database with the old data. I've migrated two more mailing lists to Mailman 3: fhs and nouveau. I plan to continue the migration in the coming months, and hopefully we'll be able to decommission Mailman 2 in a not-so-distant future.

See you next year!

21 Dec 2025 10:00pm GMT

16 Dec 2025

feedplanet.freedesktop.org

Timur Kristóf: How do graphics drivers work?

I'd like to give an overview on how graphics drivers work in general, and then write a little bit about the Linux graphics stack for AMD GPUs. The intention of this post is to clear up a bunch of misunderstandings that people have on the internet about open source graphics drivers.

What is a graphics driver?

A graphics driver is a piece of software code that is written for the purpose of allowing programs on your computer to access the features of your GPU. Every GPU is different and may have different capabilities or different ways of achieving things, so they need different drivers, or at least different code paths in a driver that may handle multiple GPUs from the same vendor and/or the same hardware generation.

The main motivation for graphics drivers is to allow applications to utilize your hardware efficiently. This enables games to render pretty pixels, scientific apps to calculate stuff, as well as video apps to encode / decode efficiently.

Organization of graphics drivers

Compared to drivers for other hardware, graphics is very complicated because the functionality is very broad and the differences between each piece of hardware can be also vast.

Here is a simplified explanation on how a graphics driver stack usually works. Note that most of the time, these components (or some variation) are bundled together to make them easier to use.

I'll give a brief overview of each component below.

GPU firmware

Most GPUs have additional processors (other than the shader cores) which run a firmware that is responsible for operating the low-level details of the hardware, usually stuff that is too low-level even for the kernel.

The firmware on those processors are responsible for: power management, context switching, command processing, display, video encoding/decoding etc. Among other things it parses the commands we submitted to it, launches shaders, distributes work between the shader cores etc.

Some GPU manufacturers are moving more and more functionality to firmware, which means that the GPU can operate more autonomously and less intervention is needed by the CPU. This tendency is generally positive for reducing CPU time spent on programming the GPU (as well as "CPU bubbles"), but at the same time it also means that the way the GPU actually works becomes less transparent.

Kernel driver

You might ask, why not implement all driver functionality in the kernel? Wouldn't it be simpler to "just" have everything in the kernel? The answer is no, mainly because there is a LOT going on which nobody wants in the kernel.

So, usually, the KMD is only left with some low-level tasks that every user needs:

Userspace driver

Applications interact with userspace drivers instead of the kernel (or the hardware directly). Userspace drivers are compiled as shared libraries and are responsible for implementing one or more specific APIs for graphics, compute or video for a specific family of GPUs. (For example, Vulkan, OpenGL or OpenCL, etc.) Each graphics API has entry points which load the available driver(s) for the GPU(s) in the user's system. The Vulkan loader is an example of this; other APIs have similar components for this purpose.

The main functionality of a userspace driver is to take the commands from the API (for example, draw calls or compute dispatches) and turn them into low level commands in a binary format that the GPU can understand. In Vulkan, this is analogous to recording a command buffer. Additionally, they utilize a shader compiler to turn a higher level shader language (eg. GLSL) or bytecode (eg. SPIR-V) into hardware instructions which the GPU's shader cores can execute.

Furthermore, userspace drivers also take part in memory management, they basically act as an interface between the memory model of the graphics API and kernel's memory manager.

The userspace driver calls the aforementioned kernel uAPI to submit the recorded commands to the kernel which then schedules it and hands it to the firmware to be executed.

Shader compiler

If you've seen a loading screen in your favourite game which told you it was "compiling shaders…" you probably wondered what that's about and why it's necessary.

Unlike CPUs which have converged to a few common instruction set architectures (ISA), GPUs are a mess and don't share the same ISA, not even between different GPU models from the same manufacturer. Although most modern GPUs have converged to SIMD based architectures, the ISA is still very different between manufacturers and it still changes from generation to generation (sometimes different chips of the same generation have slightly different ISA). GPU makers keep adding new instructions when they identify new ways to implement some features more effectively.

To deal with all that mess, graphics drivers have to do online compilation of shaders (as opposed to offline compilation which usually happens for apps running on your CPU).

This means that shaders have to be recompiled when the userspace graphics driver is updated either because new functionality is available or because bug fixes were added to the driver and/or compiler.

But I only downloaded one driver!

On some systems (especially proprietary operating systems like Windows), GPU manufacturers intend to make users' lives easier by offering all of the above in a single installer package, which is just called "the driver".

Typically such a package includes:

But I didn't download any drivers!

On some systems (typically on open source systems like Linux distributions), usually you can already find a set of packages to handle most common hardware, so you can use most functionality out of the box without needing to install anything manually.

Neat, isn't it?

However, on open source systems, the graphics stack is more transparent, which means that there are many parts that are scattered across different projects, and in some cases there is more than one driver available for the same HW. To end users, it can be very confusing.

However, this doesn't mean that open source drivers are designed worse. It is just that due to their community oriented nature, they are organized differently.

One of the main sources of confusion is that various Linux distributions mix and match different versions of the kernel with different versions of different UMDs which means that users of different distros can get a wildly different user experience based on the choices made for them by the developers of the distro.

Another source of confusion is that we driver developers are really, really bad at naming things, so sometimes different projects end up having the same name, or some projects have nonsensical or outdated names.

The Linux graphics stack

In the next post, I'll continue this story and discuss how the above applies to the open source Linux graphics stack.

16 Dec 2025 12:09am GMT

Hari Rana: Please Fund My Continued Accessibility Work on GNOME!

Hey, I have been under distress lately due to personal circumstances that are outside my control. I cannot find a permanent job that allows me to function, I am not eligible for government benefits, my grant proposals to work on free and open-source projects got rejected, paid internships are quite difficult to find, especially when many of them prioritize new contributors. Essentially, I have no stable, monthly income that allows me to sustain myself.

Nowadays, I mostly volunteer to improve accessibility throughout GNOME apps, either by enhancing the user experience for people with disabilities, or enabling them to use them. I helped make most of GNOME Calendar accessible with a keyboard and screen reader, with additional ongoing effort involving merge requests !564 and !598 to make the month view accessible, all of which is an effort no company has ever contributed to, or would ever contribute to financially. These merge requests require literal thousands of hours for research, development, and testing, enough to sustain me for several years if I were employed.

I would really appreciate any kinds of donations, especially ones that happen periodically to increase my monthly income. These donations will allow me to sustain myself while allowing me to work on accessibility throughout GNOME, essentially 'crowdfunding' development without doing it on the behalf of the GNOME Foundation or another organization.

Donate on Liberapay

Support on Ko-fi

Sponsor on GitHub

Send via PayPal

16 Dec 2025 12:00am GMT

13 Dec 2025

feedplanet.freedesktop.org

Sebastian Wick: Flatpak Pre-Installation Approaches

Together with my then-colleague Kalev Lember, I recently added support for pre-installing Flatpak applications. It sounds fancy, but it is conceptually very simple: Flatpak reads configuration files from several directories to determine which applications should be pre-installed. It then installs any missing applications and removes any that are no longer supposed to be pre-installed (with some small caveats).

For example, the following configuration tells Flatpak that the devel branch of the app org.test.Foo from remotes which serve the collection org.test.Collection, and the app org.test.Bar from any remote should be installed:

[Flatpak Preinstall org.test.Foo]
CollectionID=org.test.Collection
Branch=devel

[Flatpak Preinstall org.test.Bar]

By dropping in another confiuration file with a higher priority, pre-installation of the app org.test.Foo can be disabled:

[Flatpak Preinstall org.test.Foo]
Install=false

The installation procedure is the same as it is for the flatpak-install command. It supports installing from remotes and from side-load repositories, which is to say from a repository on a filesystem.

This simplicity also means that system integrators are responsible for assembling all the parts into a functioning system, and that there are a number of choices that need to be made for installation and upgrades.

The simplest way to approach this is to just ship a bunch of config files in /usr/share/flatpak/preinstall.d and config files for the remotes from which the apps are available. In the installation procedure, flatpak-preinstall is called and it will download the Flatpaks from the remotes over the network into /var/lib/flatpak. This works just fine, until someone needs one of those apps but doesn't have a suitable network connection.

The next way one could approach this is exactly the same way, but with a sideload repository on the installation medium which contains the apps that will get pre-installed. The flatpak-preinstall command needs to be pointed at this repository at install time, and the process which creates the installation medium needs to be adjusted to create this repository. The installation process now works without a network connection. System updates are usually downloaded over the network, just as new pre-installed applications will be.

It is also possible to simply skip flatpak-preinstall, and use flatpak-install to create a Flatpak installation containing the pre-installed apps which get shipped on the installation medium. This installation can then be copied over from the installation medium to /var/lib/flatpak in the installation process. It unfortunately also makes the installation process less flexible because it becomes impossible to dynamically build the configuration.

On modern, image-based operating systems, it might be tempting to just ship this Flatpak installation on the image because the flexibility is usually neither required nor wanted. This currently does not work for the simple reason that the default system installation is in /var/lib/flatpak, which is not in /usr which is the mount point of the image. If the default system installation was in the image, then it would be read-only because the image is read-only. This means we could not update or install anything new to the system installation. If we make it possible to have two different system installations - one in the image, and one in /var - then we could update and install new things, but the installation on the image would become useless over time because all the runtimes and apps will be in /var anyway as they get updated.

All of those issues mean that even for image-based operating systems, pre-installation via a sideload repository is not a bad idea for now. It is however also not perfect. The kind of "pure" installation medium which is simply an image now contains a sideload repository. It also means that a factory reset functionality is not possible because the image does not contain the pre-installed apps.

In the future, we will need to revisit these approaches to find a solution that works seamlessly with image-based operating systems and supports factory reset functionality. Until then, we can use the systems mentioned above to start rolling out pre-installed Flatpaks.

13 Dec 2025 5:17pm GMT

24 Nov 2025

feedplanet.freedesktop.org

Dave Airlie (blogspot): fedora 43: bad mesa update oopsie

F43 picked up the two patches I created to fix a bunch of deadlocks on laptops reported in my previous blog posting. Turns out Vulkan layers have a subtle thing I missed, and I removed a line from the device select layer that would only matter if you have another layer, which happens under steam.

The fedora update process caught this, but it still got published which was a mistake, need to probably give changes like this more karma thresholds.

I've released a new update https://bodhi.fedoraproject.org/updates/FEDORA-2025-2f4ba7cd17 that hopefully fixes this. I'll keep an eye on the karma.

24 Nov 2025 1:42am GMT