11 Mar 2025
planet.freedesktop.org
Ricardo Garcia: Device-Generated Commands at Vulkanised 2025
A month ago I attended Vulkanised 2025 in Cambridge, UK, to present a talk about Device-Generated Commands in Vulkan. The event was organized by Khronos and took place in the Arm Cambridge office. The talk I presented was similar to the one from XDC 2024, but instead of being a lightning 5-minutes talk, I had 25-30 minutes to present and I could expand the contents to contain proper explanations of almost all major DGC concepts that appear in the spec.
I attended the event together with my Igalia colleagues Lucas Fryzek and Stéphane Cerveau, who presented about lavapipe and Vulkan Video, respectively. We had a fun time in Cambridge and I can sincerely recommend attending the event to any Vulkan enthusiasts out there. It allows you to meet Khronos members and people working on both the specification and drivers, as well as many other Vulkan users from a wide variety of backgrounds.
The recordings for all sessions are now publicly available, and the one for my talk can be found embedded below. For those of you preferring slides and text, I'm also providing a transcription of my presentation together with slide screenshots further down.
In addition, at the end of the video there's a small Q&A section but I've always found it challenging to answer questions properly on the fly and with limited time. For this reason, instead of transcribing the Q&A section literally, I've taken the liberty of writing down the questions and providing better answers in written form, and I've also included an extra question that I got in the hallways as bonus content. You can find the Q&A section right after the embedded video.
Vulkanised 2025 recording
Questions and answers with longer explanations
Question: can you give an example of when it's beneficial to use Device-Generated Commands?
There are two main use cases where DGC would improve performance: on the one hand, many times game engines use compute pre-passes to analyze the scene they want to draw and prepare some data for that scene. This includes maybe deciding LOD levels, discarding content, etc. After that compute pre-pass, results would need to be analyzed from the CPU in some way. This implies a stall: the output from that compute pre-pass needs to be transferred to the CPU so the CPU can use it to record the right drawing commands, or maybe you do this compute pre-pass during the previous frame and it contains data that is slightly out of date. With DGC, this compute dispatch (or set of compute dispatches) could generate the drawing commands directly, so you don't stall or you can use more precise data. You also save some memory bandwidth because you don't need to copy the compute results to host-visible memory.
On the other hand, sometimes scenes contain so much detail and geometry that recording all the draw calls from the CPU takes a nontrivial amount of time, even if you distribute this draw call recording among different threads. With DGC, the GPU itself can generate these draw calls, so potentially it saves you a lot of CPU time.
Question: as the extension makes heavy use of buffer device addresses, what are the challenges for tools like GFXReconstruct when used to record and replay traces that use DGC?
The extension makes use of buffer device addresses for two separate things. First, it uses them to pass some buffer information to different API functions, instead of passing buffer handles, offsets and sizes. This is not different from other APIs that existed before. The VK_KHR_buffer_device_address extension contains APIs like vkGetBufferOpaqueCaptureAddressKHR, vkGetDeviceMemoryOpaqueCaptureAddressKHR that are designed to take care of those cases and make it possible to record and reply those traces. Contrary to VK_KHR_ray_tracing_pipeline, which has a feature to indicate if you can capture and replay shader group handles (fundamental for capture and replay when using ray tracing), DGC does not have any specific feature for capture-replay. DGC does not add any new problem from that point of view.
Second, the data for some commands that is stored in the DGC buffer sometimes includes device addresses. This is the case for the index buffer bind command, the vertex buffer bind command, indirect draws with count (double indirection here) and ray tracing command. But, again, the addresses in those commands are buffer device addresses. That does not add new challenges for capture and replay compared to what we already had.
Question: what is the deal with the last token being the one that dispatches work?
One minor detail from DGC, that's important to remember, is that, by default, DGC respects the order in which sequences appear in the DGC buffer and the state used for those sequences. If you have a DGC buffer that dispatches multiple draws, you know the state that is used precisely for each draw: it's the state that was recorded before the execute-generated-commands call, plus the small changes that a particular sequence modifies like push constant values or vertex and index buffer binds, for example. In addition, you know precisely the order of those draws: executing the DGC buffer is equivalent, by default, to recording those commands in a regular command buffer from the CPU, in the same order they appear in the DGC buffer.
However, when you create an indirect commands layout you can indicate that the sequences in the buffer may run in an undefined order (this is VK_INDIRECT_COMMANDS_LAYOUT_USAGE_UNORDERED_SEQUENCES_BIT_EXT). If the sequences could dispatch work and then change state, we would have a logical problem: what do those state changes affect? The sequence that is executed right after the current one? Which one is that? We would not know the state used for each draw. Forcing the work-dispatching command to be the last one is much easier to reason about and is also logically tight.
Naturally, if you have a series of draws on the CPU where, for some of them, you change some small bits of state (e.g. like disabling the depth or stencil tests) you cannot do that in a single DGC sequence. For those cases, you need to batch your sequences in groups with the same state (and use multiple DGC buffers) or you could use regular draws for parts of the scene and DGC for the rest.
Question from the hallway: do you know what drivers do exactly at preprocessing time that is so important for performance?
Most GPU drivers these days have a kernel side and a userspace side. The kernel driver does a lot of things like talking to the hardware, managing different types of memory and buffers, talking to the display controller, etc. The kernel driver normally also has facilities to receive a command list from userspace and send it to the GPU.
These command lists are particular for each GPU vendor and model. The packets that form it control different aspects of the GPU. For example (this is completely made-up), maybe one GPU has a particular packet to modify depth buffer and test parameters, and another packet for the stencil test and its parameters, while another GPU from another vendor has a single packet that controls both. There may be another packet that dispatches draw work of all kinds and is flexible to accomodate the different draw commands that are available on Vulkan.
The Vulkan userspace driver translates Vulkan command buffer contents to these GPU-specific command lists. In many drivers, the preprocessing step in DGC takes the command buffer state, combines it with the DGC buffer contents and generates a final command list for the GPU, storing that final command list in the preprocess buffer. Once the preprocess buffer is ready, executing the DGC commands is only a matter of sending that command list to the GPU.
Talk slides and transcription
Hello, everyone! I'm Ricardo from Igalia and I'm going to talk about device-generated commands in Vulkan.
First, some bits about me. I have been part of the graphics team at Igalia since 2019. For those that don't know us, Igalia is a small consultancy company specialized in open source and my colleagues in the graphics team work on things such as Mesa drivers, Linux kernel drivers, compositors… that kind of things. In my particular case the focus of my work is contributing to the Vulkan Conformance Test Suite and I do that as part of a collaboration between Igalia and Valve that has been going on for a number of years now. Just to highlight a couple of things, I'm the main author of the tests for the mesh shading extension and device-generated commands that we are talking about today.
So what are device-generated commands? So basically it's a new extension, a new functionality, that allows a driver to read command sequences from a regular buffer: something like, for example, a storage buffer, instead of the usual regular command buffers that you use. The contents of the DGC buffer could be filled from the GPU itself. This is what saves you the round trip to the CPU and, that way, you can improve the GPU-driven rendering process in your application. It's like one step ahead of indirect draws and dispatches, and one step behind work graphs. And it's also interesting because device-generated commands provide a better foundation for translating DX12. If you have a translation layer that implements DX12 on top of Vulkan like, for example, Proton, and you want to implement ExecuteIndirect, you can do that much more easily with device generated commands. This is important for Proton, which Valve uses to run games on the Steam Deck, i.e. Windows games on top of Linux.
If we set aside Vulkan for a moment, and we stop thinking about GPUs and such, and you want to come up with a naive CPU-based way of running commands from a storage buffer, how do you do that? Well, one immediate solution we can think of is: first of all, I'm going to assign a token, an identifier, to each of the commands I want to run, and I'm going to store that token in the buffer first. Then, depending on what the command is, I want to store more information.
For example, if we have a sequence like we see here in the slide where we have a push constant command followed by dispatch, I'm going to store the token for the push constants command first, then I'm going to store some information that I need for the push constants command, like the pipeline layout, the stage flags, the offset and the size. Then, after that, depending on the size that I said I need, I am going to store the data for the command, which is the push constant values themselves. And then, after that, I'm done with it, and I store the token for the dispatch, and then the dispatch size, and that's it.
But this doesn't really work: this is not how GPUs work. A GPU would have a hard time running commands from a buffer if we store them this way. And this is not how Vulkan works because in Vulkan you want to provide as much information as possible in advance and you want to make things run in parallel as much as possible, and take advantage of the GPU.
So what do we do in Vulkan? In Vulkan, and in the Vulkan VK_EXT_device_generated_commands extension, we have this central concept, which is called the Indirect Commands Layout. This is the main thing, and if you want to remember just one thing about device generated commands, you can remember this one.
The indirect commands layout is basically like a template for a short sequence of commands. The way you build this template is using the tokens and the command information that we saw colored red and green in the previous slide, and you build that in advance and pass that in advance so that, in the end, in the command buffer itself, in the buffer that you're filling with commands, you don't need to store that information. You just store the data for each command. That's how you make it work.
And the result of this is that with the commands layout, that I said is a template for a short sequence of commands (and by short I mean a handful of them like just three, four or five commands, maybe 10), the DGC buffer can be pretty large, but it does not contain a random sequence of commands where you don't know what comes next. You can think about it as divided into small chunks that the specification calls sequences, and you get a large number of sequences stored in the buffer but all of them follow this template, this commands layout. In the example we had, push constant followed by dispatch, the contents of the buffer would be push constant values, dispatch size, push content values, dispatch size, many times repeated.
The second thing that Vulkan does to be able to make this work is that we limit a lot what you can do with device-generated commands. There are a lot of things you cannot do. In fact, the only things you can do are the ones that are present in this slide.
You have some things like, for example, update push constants, you can bind index buffers, vertex buffers, and you can draw in different ways, using mesh shading maybe, you can dispatch compute work and you can dispatch raytracing work, and that's it. You also need to check which features the driver supports, because maybe the driver only supports device-generated commands for compute or ray tracing or graphics. But you notice you cannot do things like start render passes or insert barriers or bind descriptor sets or that kind of thing. No, you cannot do that. You can only do these things.
This indirect commands layout, which is the backbone of the extension, specifies, as I said, the layout for each sequence in the buffer and it has additional restrictions. The first one is that it must specify exactly one token that dispatches some kind of work and it must be the last token in the sequence. You cannot have a sequence that dispatches graphics work twice, or that dispatches computer work twice, or that dispatches compute first and then draws, or something like that. No, you can only do one thing with each DGC buffer and each commands layout and it has to be the last one in the sequence.
And one interesting thing that also Vulkan allows you to do, that DX12 doesn't let you do, is that it allows you (on some drivers, you need to check the properties for this) to choose which shaders you want to use for each sequence. This is a restricted version of the bind pipeline command in Vulkan. You cannot choose arbitrary pipelines and you cannot change arbitrary states but you can switch shaders. For example, if you want to use a different fragment shader for each of the draws in the sequence, you can do that. This is pretty powerful.
How do you create one of those indirect commands layout? Well, with one of those typical Vulkan calls, to create an object that you pass these CreateInfo structures that are always present in Vulkan.
And, as you can see, you have to pass these shader stages that will be used, will be active, while you draw or you execute those indirect commands. You have to pass the pipeline layout, and you have to pass in an indirect stride. The stride is the amount of bytes for each sequence, from the start of a sequence to the next one. And the most important information of course, is the list of tokens: an array of tokens that you pass as the token count and then the pointer to the first element.
Now, each of those tokens contains a bit of information and the most important one is the type, of course. Then you can also pass an offset that tells you how many bytes into the sequence for the start of the data for that command. Together with the stride, it tells us that you don't need to pack the data for those commands together. If you want to include some padding, because it's convenient or something, you can do that.
And then there's also the token data which allows you to pass the information that I was painting in green in other slides like information to be able to run the command with some extra parameters. Only a few tokens, a few commands, need that. Depending on the command it is, you have to fill one of the pointers in the union but for most commands they don't need this kind of information. Knowing which command it is you just know you are going to find some fixed data in the buffer and you just read that and process that.
One thing that is interesting, like I said, is the ability to switch shaders and to choose which shaders are going to be used for each of those individual sequences. Some form of pipeline switching, or restricted pipeline switching. To do that you have to create something that is called Indirect Execution Sets.
Each of these execution sets is like a group or an array, if you want to think about it like that, of pipelines: similar pipelines or shader objects. They have to share something in common, which is that all of the state in the pipeline has to be identical, basically. Only the shaders can change.
When you create these execution sets and you start adding pipelines or shaders to them, you assign an index to each pipeline in the set. Then, you pass this execution set beforehand, before executing the commands, so that the driver knows which set of pipelines you are going to use. And then, in the DGC buffer, when you have this pipeline token, you only have to store the index of the pipeline that you want to use. You create the execution set with 20 pipelines and you pass an index for the pipeline that you want to use for each draw, for each dispatch, or whatever.
The way to create the execution sets is the one you see here, where we have, again, one of those CreateInfo structures. There, we have to indicate the type, which is pipelines or shader objects. Depending on that, you have to fill one of the pointers from the union on the top right here.
If we focus on pipelines because it's easier on the bottom left, you have to pass the maximum pipeline count that you're going to store in the set and an initial pipeline. The initial pipeline is what is going to set the template that all pipelines in the set are going to conform to. They all have to share essentially the same state as the initial pipeline and then you can change the shaders. With shader objects, it's basically the same, but you have to pass more information for the shader objects, like the descriptor set layouts used by each stage, push-constant information… but it's essentially the same.
Once you have that execution set created, you can use those two functions (vkUpdateIndirectExecutionSetPipelineEXT and vkUpdateIndirectExecutionSetShaderEXT) to update and add pipelines to that execution set. You need to take into account that you have to pass a couple of special creation flags to the pipelines, or the shader objects, to tell the driver that you may use those inside an execution set because the driver may need to do something special for them. And one additional restriction that we have is that if you use an execution set token in your sequences, it must appear only once and it must be the first one in the sequence.
The recap, so far, is that the DGC buffer is divided into small chunks that we call sequences. Each sequence follows a template that we call the Indirect Commands Layout. Each sequence must dispatch work exactly once and you may be able to switch the set of shaders we used with with each sequence with an Indirect Execution Set.
Wow do we go about actually telling Vulkan to execute the contents of a specific buffer? Well, before executing the contents of the DGC buffer the application needs to have bound all the needed states to run those commands. That includes descriptor sets, initial push constant values, initial shader state, initial pipeline state. Even if you are going to use an Execution Set to switch shaders later you have to specify some kind of initial shader state.
Once you have that, you can call this vkCmdExecuteGeneratedCommands. You bind all the state into your regular command buffer and then you record this command to tell the driver: at this point, execute the contents of this buffer. As you can see, you typically pass a regular command buffer as the first argument. Then there's some kind of boolean value called isPreprocessed, which is kind of confusing because it's the first time it appears and you don't know what it is about, but we will talk about it in a minute. And then you pass a relatively larger structure containing information about what to execute.
In that GeneratedCommandsInfo structure, you need to pass again the shader stages that will be used. You have to pass the handle for the Execution Set, if you're going to use one (if not you can use the null handle). Of course, the indirect commands layout, which is the central piece here. And then you pass the information about the buffer that you want to execute, which is the indirect address and the indirect address size as the buffer size. We are using buffer device address to pass information.
And then we have something again mentioning some kind of preprocessing thing, which is really weird: preprocess address and preprocess size which looks like a buffer of some kind (we will talk about it later). You have to pass the maximum number of sequences that you are going to execute. Optionally, you can also pass a buffer address for an actual counter of sequences. And the last thing that you need is the max draw count, but you can forget about that if you are not dispatching work using draw-with-count tokens as it only applies there. If not, you leave it as zero and it should work.
We have a couple of things here that we haven't talked about yet, which are the preprocessing things. Starting from the bottom, that preprocess address and size give us a hint that there may be a pre-processing step going on. Some kind of thing that the driver may need to do before actually executing the commands, and we need to pass information about the buffer there.
The boolean value that we pass to the command ExecuteGeneratedCommands tells us that the pre-processing step may have happened before so it may be possible to explicitly do that pre-processing instead of letting the driver do that at execution time. Let's take a look at that in more detail.
First of all, what is the pre-process buffer? The pre-process buffer is auxiliary space, a scratch buffer, because some drivers need to take a look at how the command sequence looks like before actually starting to execute things. They need to go over the sequence first and they need to write a few things down just to be able to properly do the job later to execute those commands.
Once you have the commands layout and you have the maximum number of sequences that you are going to execute, you can call this vkGetGeneratedCommandMemoryRequirementsEXT and the driver is going to tell you how much space it needs. Then, you can create a buffer, you can allocate the space for that, you need to pass a special new buffer usage flag (VK_BUFFER_USAGE_2_PREPROCESS_BUFFER_BIT_EXT) and, once you have that buffer, you pass the address and you pass a size in the previous structure.
Now the second thing is that we have the possibility of ding this preprocessing step explicitly. Explicit pre-processing is something that's optional, but you probably want to do that if you care about performance because it's the key to performance with some drivers.
When you use explicit pre-processing you don't want to (1) record the state, (2) call this vkPreProcessGeneratedCommandsEXT and (3) call vkExecuteGeneratedCommandsEXT. That is what implicit pre-processing does so this doesn't give you anything if you do it this way.
This is designed so that, if you want to do explicit pre-processing, you're going to probably want to use a separate command buffer for pre-processing. You want to batch pre-processing calls together and submit them all together to keep the GPU busy and to give you the performance that you want. While you submit the pre-processing steps you may be still preparing the rest of the command buffers to enqueue the next batch of work. That's the key to doing pre-processing optimally.
You need to decide beforehand if you are going to use explicit pre-processing or not because, if you're going to use explicit preprocessing, you need to pass a flag when you create the commands layout, and then you have to call the function to preprocess generated commands. If you don't pass that flag, you cannot call the preprocessing function, so it's an all or nothing. You have to decide, and you do what you want.
One thing that is important to note is that preprocessing needs to know and has to have the same state, the same contents of the input buffers as when you execute so it can run properly.
The video contains a cut here because the presentation laptop ran out of battery.
If the pre-processing step needs to have the same state as the execution, you need to have bound the same pipeline state, the same shaders, the same descriptor sets, the same contents. I said that explicit pre-processing is normally used using a separate command buffer that we submit before actual execution. You have a small problem to solve, which is that you would need to record state twice: once on the pre-process command buffer, so that the pre-process step knows everything, and once on the execution, the regular command buffer, when you call execute. That would be annoying.
Instead of that, the pre-process generated commands function takes an argument that is a state command buffer and the specification tells you: this is a command buffer that needs to be in the recording state, and the pre-process step is going to read the state from it. This is the first time, and I think the only time in the specification, that something like this is done. You may be puzzled about what this is exactly: how do you use this and how do we pass this?
I just wanted to get this slide out to tell you: if you're going to use explicit pre-processing, the ergonomic way of using it and how we thought about using the processing step is like you see in this slide. You take your main command buffer and you record all the state first and, just before calling execute-generated-commands, the regular command buffer contains all the state that you want and that preprocess needs. You stop there for a moment and then you prepare your separate preprocessing command buffer passing the main one as an argument to the preprocess call, and then you continue recording commands in your regular command buffer. That's the ergonomic way of using it.
You do need some synchronization at some steps. The main one is that, if you generate the contents of the DGC buffer from the GPU itself, you're going to need some synchronization: writes to that buffer need to be synchronized with something else that comes later which is executing or reading those commands from from the buffer.
Depending on if you use explicit preprocessing you can use the pipeline stage command-pre-process which is new and pre-process-read or you synchronize that with the regular device-generated-commands-execution which was considered part of the regular draw-indirect-stage using indirect-command-read access.
If you use explicit pre-processing you need to make sure that writes to the pre-process buffer happen before you start reading from that. So you use these just here (VK_PIPELINE_STAGE_COMMAND_PREPROCESS_BIT_EXT, VK_ACCESS_COMMAND_PREPROCESS_WRITE_BIT_EXT) to synchronize processing with execution (VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, VK_ACCESS_INDIRECT_COMMAND_READ_BIT) if you use explicit preprocessing.
The quick how-to: I just wanted to get this slide out for those wanting a reference that says exactly what you need to do. All the steps that I mentioned here about creating the commands layout, the execution set, allocating the preprocess buffer, etc. This is the basic how-to.
And that's it. Thanks for watching! Questions?
11 Mar 2025 4:30pm GMT
Mike Blumenkrantz: Znvk
27 Feb 2025
planet.freedesktop.org
Mike Blumenkrantz: Slow Down
Once Again We Return Home
It's been a while, but for the first time this year I have to do it. Some of you are shaking your heads, saying you knew it, and you were right. Here we are again.
It's time to vkoverhead.
The Numbers Must Go Up
I realized while working on some P E R F that there was a lot of perf to be gained in places I wasn't testing. That makes sense, right? If there's no coverage, the perf can't go up.
So I added a new case for the path I was using, and boy howdy did I start to see some weird stuff.
Normally this is where I'd post up some gorgeous flamegraphs, and we would sit back in our expensive leather armchairs debating the finer points of optimization. But you know what? We can't do that anymore.
Why, you're asking. The reason is simple: perf
is totally fucking broken and has been for a while. But only on certain machines. Specifically, mine. So no more flamegraphs for you, and none for me.
Despite this massive roadblock, the perf gains must continue. Through the power of guesswork and frustration, I've managed some sizable gains:
# | Draw Tests | 1000op/s (before) | % relative to 'draw' (before) | 1000op/s (after) | % relative to 'draw' (after) |
---|---|---|---|---|---|
0 | draw | 46298 | 100.0% | 46426 | 100.0% |
16 | vbo change | 17741 | 38.3% | 22413 | 48.3% |
17 | vbo change dynamic (new!) | 4544 | 9.8% | 8686 | 18.7% |
18 | 1vattrib change | 3021 | 6.5% | 3316 | 7.1% |
20 | 16vattrib 16vbo change | 5266 | 11.4% | 6398 | 13.8% |
21 | 16vattrib change | 2352 | 5.1% | 2512 | 5.4% |
22 | 16vattrib change dynamic | 3976 | 8.6% | 5003 | 10.8% |
Though I was mainly targeting the case of using dynamic vertex input and binding new vertex buffers for every draw (and managed a nearly 100% improvement there) , I ended up seeing noteworthy gains across the board for binding vertex buffers, even when using fully static state. This should provide some minor gains to general RADV perf.
Future Improvements
Given the still-massive perf gap between using static and dynamic vertex state when only vertex buffers change, it seems likely there's still some opportunities to reclaim more perf. Only time will tell what can be achieved here, but for now this is what I've got.
27 Feb 2025 12:00am GMT
26 Feb 2025
planet.freedesktop.org
Mike Blumenkrantz: CLthulhu
Insanity Has A Name
Karol Herbst. At SGC, we know this man. We fear him. His photo is on the wall over a break-in-case-of-emergency glass panel which shields a button activating a subterranean escape route set to implode as soon as I sprint through.
Despite this, and despite all past evidence leading me to be wary of any idea he pitched, the madman got me again.
cl_khr_image2d_from_buffer. On the surface, an innocuous little extension used to access a buffer like a 2D image. Vulkan already has this support for 1D images in the form of VkBufferView
, so why would adding a stride to that be any harder (aside from the fact that the API doesn't support it)?
I was deep into otherworldly optimizations at this point, far beyond the point where I was able to differentiate between improvement and neutral, let alone sane or twisted. His words seemed so reasonable: why couldn't I just throw a buffer to the GPU as a 2D image? I'd have to be an idiot not to be able to do something as simple as that. Wouldn't I?
Dammit, Karol.
How to 2D a Buffer
You can't. I mean, I can, but you? Vulkan won't let you do it. There's (currently) no extension that enables a 2D bufferview. Rumor has it some madman on a typewriter is preparing to fax over an extension specification to add this, but only time will tell whether Khronos accepts submissions in this format.
Here at SGC, we're all smart HUMANS though, so there's an obvious solution to this.
It's not memory aliasing. Sure, rebinding buffer memory onto an image might work. But in reading the spec, the synchronization guarantees for buffer-image aliasing didn't seem that strong. And also it'd be a whole bunch of code to track it, and maybe do weird layout stuff, and add some kind of synchronization on the buffer too, and pray the driver isn't buggy, and doesn't this sound a lot like the we-have-this-at-home version of another, better mechanism that zink already has incredible support for?
Yeah. What about these things? How do they wORK?
DMA Buffers: Totally Normal
A DMAbuf is basically a pipe. On one end you have memory. And if you yell TRIANGLE into the other end really loud, something unimaginable and ineffable that lurks deep withinthevoid will slitherand crawl its way up the pipeuntil it GAZES UPON YOU IN YOUR FLESHY MORTAL SHELL ATTEMPTING TO USURP THE POWERS OF THE OLD ONES. It's a fun little experiment with absolutely no unwanted consequences. Try it at home!
The nice thing about dmabufs is I know they work. And I know they work in zink. That's because in order to run an x̸̧̠͓̣̣͎͚̰͎̍̾s̶̡̢͙̞̙͍̬̝̠̩̱̞̮̩̣̑͂͊̎͆̒̓͐͛͊̒͆̄̋ȩ̶̡̨̳̭̲̹̲͎̪̜͒̓̈́̏r̶̩̗͖͙͖̬̟̞̜̠͙̠̎͑̉̌̎̍̑́̏̓̏̒̍͜͝v̶̞̠̰̘̞͖̙̯̩̯̝̂̃̕͜e̴̢̡͎̮͔̤͖̤͙̟̳̹͛̓͌̈̆̈́̽͘̕ŕ̶̫̾͐͘ or a Wayland compositor (e.g., Ŵ̶̢͍̜̙̺͈͉̼̩̯̺̗̰̰͕͍̱͊͊̓̈̀͛̾̒̂̚̕͝ͅḙ̵̛̬̜͔̲͕͖̜̱̻͊̌̾͊͘s̶̢̗̜͈̘͎̠̘̺͉͕̣̯̘̦͓͈̹̻͙̬̘̿͆̏̃̐̍̂̕ͅt̷̨͈̠͕͔̬̙̣͈̪͕̱͕̙̦͕̼̩͙̲͖͉̪̹̼͛̌͋̃̂̂̓̏̂́̔͠͝ͅơ̸̢̛̛̲̟͙͚̰͇̞̖̭̲͍͇̫̘̦̤̩̖͍̄̓́͑̉̿̅̀̉͒͋͒̂́̆̋̚͝ͅͅn̶̢̡̝̥̤̣͔̣͉͖̖̻̬̝̥̦͇͕̘͋͂͛̌̃͠ͅͅ, the reference compositor), dmabufs have to work. Zink can run both of those just fine, so I know there's absolutely zero bugs. There can't be any bugs. No. Not bugs again. NO MORE BUGS
Even better, I know that I can do imports and exports of dmabufs in any dimensionality thanks to that crazy CL-GL sharing extension Karol already suckered me into supporting at the expense of every Vulkan driver's bug tracker. That KAROL HERBST guy, hah, he's such a kidder!
So obviously-It's just common sense at this point-Obviously I should just be able to hook up the pipes here. Export a buffer and then import a 2D image with whatever random CAUSALITY IS A LIE passes for stride. Right? Basically a day at the beach for me.
And of course it works perfectly with no problems whatsoever, giving Davinci Resolve a nice performance boost.
Stay sane, readers.
26 Feb 2025 12:00am GMT
24 Feb 2025
planet.freedesktop.org
Hans de Goede: ThinkPad X1 Carbon Gen 12 camera support and other IPU6 camera work
I have been working on getting the camera on the ThinkPad X1 Carbon Gen 12 to work under Fedora.
This requires 3 things:
- Some ov08x40 sensor patches, these are available as downstream cherry-picks in Fedora kernels >= 6.12.13
- A small pipewire fix to avoid WirePlumber listing a bunch of bogus extra "ipu6" Video Sources, these fixes are available in Fedora's pipewire packages >= 1.2.7-4
- I2C and GPIO drivers for the new Lattice USB IO-expander, these drivers are not available in the upstream / mainline kernel yet
I have also rebased the out of tree IPU6 ISP and proprietary userspace stack in rpmfusion and I have integrated the USBIO drivers into the intel-ipu6-kmod package. So for now getting the cameras to work on the X1 Carbon Gen 12 requires installing the out of tree drivers through rpmfusion. Follow these instructions to enable rpmfusion, you need both the free and nonfree repos.
Then make sure you have a new enough kernel installed and install the rpmfusion akmod for the USBIO drivers:
sudo dnf update 'kernel*'
sudo dnf install akmod-intel-ipu6
The latest version of the out of tree IPU6 ISP driver can co-exist with the mainline / upstream IPU6 CSI receiver kernel driver. So both the libcamera software ISP FOSS stack and Intel's proprietary stack can co-exist now. If you do not want to use the proprietary stack you can disable it by running 'sudo ipu6-driver-select foss'.
After installing the kmod package reboot and then in Firefox go to Mozilla's webrtc test page and click on the "Camera" button, you should now get a camera permisson dialog with 2 cameras: "Built in Front Camera" and "Intel MIPI Camera (V4L2)" the "Built in Front Camera" is the FOSS stack and the "Intel MIPI Camera (V4L2)" is the proprietary stack. Note the FOSS stack will show a strongly zoomed in (cropped) image, this is caused by the GUM test-page, in e.g. google-meet this will not be the case.
I have also been making progress with some of the other open IPU6 issues:
- Camera's failing on Dell XPS laptops due to iVSC errors (rhbz#2316918, rhbz#2324683) after a long debugging session this is finally fixed, the fix for this will be available in Fedora kernels >= 6.13.4 which should show up in updates-testing today
- Camera's no working on Microsoft Surface book with ov7251 sensor, the fix for this has landed upstream
comments
24 Feb 2025 2:44pm GMT
Peter Hutterer: libinput and 3-finger dragging
Ready in time for libinput 1.28 [1] and after a number of attempts over the years we now finally have 3-finger dragging in libinput. This is a long-requested feature that allows users to drag by using a 3-finger swipe on the touchpad. Instead of the normal swipe gesture you simply get a button down, pointer motion, button up sequence. Without having to tap or physically click and hold a button, so you might be able to see the appeal right there.
Now, as with any interaction that relies on the mere handful of fingers that are on our average user's hand, we are starting to have usage overlaps. Since the only difference between a swipe gesture and a 3-finger drag is in the intention of the user (and we can't detect that yet, stay tuned), 3-finger swipes are disabled when 3-finger dragging is enabled. Otherwise it does fit in quite nicely with the rest of the features we have though.
There really isn't much more to say about the new feature except: It's configurable to work on 4-finger drag too so if you mentally substitute all the threes with fours in this article before re-reading it that would save me having to write another blog post. Thanks.
[1] "soonish" at the time of writing
24 Feb 2025 5:38am GMT
Peter Hutterer: GNOME 48 and a changed tap-and-drag drag lock behaviour
This is a heads up as mutter PR!4292 got merged in time for GNOME 48. It (subtly) changes the behaviour of drag lock on touchpads, but (IMO) very much so for the better. Note that this feature is currently not exposed in GNOME Settings so users will have to set it via e.g. the gsettings commandline tool. I don't expect this change to affect many users.
This is a feature of a feature of a feature, so let's start at the top.
"Tapping" on touchpads refers to the ability to emulate button presses via short touches ("taps") on the touchpad. When enabled, a single-finger tap corresponds emulates a left mouse button click, a two-finger tap a right button click, etc. Taps are short interactions and to be recognised the finger must be set down and released again within a certain time and not move more than a certain distance. Clicking is useful but it's not everything we do with touchpads.
"Tap-and-drag" refers to the ability to keep the pointer down so it's possible to drag something while the mouse button is logically down. The sequence required to do this is a tap immediately followed by the finger down (and held down). This will press the left mouse button so that any finger movement results in a drag. Releasing the finger releases the button. This is convenient but especially on large monitors or for users with different-than-whatever-we-guessed-is-average dexterity this can make it hard to drag something to it's final position - a user may run out of touchpad space before the pointer reaches the destination. For those, the tap-and-drag "drag lock" is useful.
"Drag lock" refers to the ability of keeping the mouse button pressed until "unlocked", even if the finger moves off the touchpads. It's the same sequence as before: tap followed by the finger down and held down. But releasing the finger will not release the mouse button, instead another tap is required to unlock and release the mouse button. The whole sequence thus becomes tap, down, move.... tap with any number of finger releases in between. Sounds (and is) complicated to explain, is quite easy to try and once you're used to it it will feel quite natural.
The above behaviour is the new behaviour which non-coincidentally also matches the macOS behaviour (if you can find the toggle in the settings, good practice for easter eggs!). The previous behaviour used a timeout instead so the mouse button was released automatically if the finger was up after a certain timeout. This was less predictable and caused issues with users who weren't fast enough. The new "sticky" behaviour resolves this issue and is (alanis morissette-stylue ironically) faster to release (a tap can be performed before the previous timeout would've expired).
Anyway, TLDR, a feature that very few people use has changed defaults subtly. Bring out the pitchforks!
As said above, this is currently only accessible via gsettings and the drag-lock behaviour change only takes effect if tapping, tap-and-drag and drag lock are enabled:
$ gsettings set org.gnome.desktop.peripherals.touchpad tap-to-click true $ gsettings set org.gnome.desktop.peripherals.touchpad tap-and-drag true $ gsettings set org.gnome.desktop.peripherals.touchpad tap-and-drag-lock true
All features above are actually handled by libinput, this is just about a default change in GNOME.
24 Feb 2025 4:17am GMT
22 Feb 2025
planet.freedesktop.org
Simon Ser: Using Podman, Compose and BuildKit
For my day job, I need to build and run a Docker Compose project. However, because Docker doesn't play well with nftables and I prefer a rootless + daemonless approach, I'm using Podman.
Podman supports Docker Compose projects with two possible solutions: either by connecting the official Docker Compose CLI to a Podman socket, either by using their own drop-in replacement. They ship a small wrapper to select one of these options. (The wrapper has the same name as the replacement, which makes things confusing.)
Unfortunately, both options have downsides. When using the official Docker Compose CLI, the classic builder is used instead of the newer BuildKit builder. As a result, some features such as additional contexts are not supported. When using the podman-compose replacement, some other features are missing, such as !reset
, configs
and referencing another service in additional contexts. It would be possible to add these features to podman-compose, but that's an endless stream of work (Docker Compose regularly adds new features) and I don't really see the value in re-implementing all of this (the fact that it's Python doesn't help me getting motivated).
I've started looking for a way to convince the Docker Compose CLI to run under Podman with BuildKit enabled. I've tried a few months ago and never got it to work, but it seems like this recently became easier! The podman-compose wrapper force-disables BuildKit, so we need to use directly the Docker Compose CLI without the wrapper. On Arch Linux, this can be achieved by enabling the Podman socket and creating a new Docker context (same as setting DOCKER_HOST
, but more permanent):
pacman -S docker-compose docker-buildx
systemctl --user start podman.socket
docker context create podman --docker host=unix://$XDG_RUNTIME_DIR/podman/podman.sock
docker context use podman
With that, docker compose
just works! It turns out it automagically creates a buildx_buildkit_default
container under-the-hood to run the BuildKit daemon. Since I don't like automagical things, I immediately tried to run BuildKit daemon myself:
pacman -S buildkit
systemctl --user start buildkit.service
docker buildx create --name local unix://$XDG_RUNTIME_DIR/buildkit/rootless
docker buildx use local
Now docker compose
uses our systemd-managed BuildKit service. But we're not done yet! One of the reasons I like Podman is because it's daemonless, and we've got a daemon running in the background. This isn't the end of the world, but it'd be nicer to be able to run the build without BuildKit.
Fortunately, there's a way around this: any Compose project can be turned into a JSON description of the build commands called Bake. docker buildx bake --print
will print that JSON file (and the Docker Compose CLI will use Bake files if COMPOSE_BAKE=true
is set since v2.33). Note, Bake supports way more features (e.g. HCL files) but we don't really need these for our purposes (and the command above can lower fancy Bake files into dumb JSON ones).
The JSON file is pretty similar to the podman build
CLI arguments. It's not that hard to do the translation, so I've written Bakah, a small tool which does exactly this. It uses Buildah instead of shelling out to Podman (Buildah is the library used by Podman under-the-hood to build images). A few details required a bit more attention, for instance dependency resolution and parallel builds, but it's quite simple. It can be used like so:
docker buildx bake --print >bake.json
bakah --file bake.json
Bakah is still missing the fancier Bake features (HCL files, inheritance, merging/overriding files, variables, and so on), but it's enough to build complex Compose projects. I plan to use it for soju-containers in the future, to better split my Dockerfiles (one for the backend, one for the frontend) and remove the CI shell script (which contains a bunch of Podman CLI invocations). I hope it can be useful to you as well!
22 Feb 2025 10:00pm GMT
20 Feb 2025
planet.freedesktop.org
Mike Blumenkrantz: Againicl
Busy.
I didn't forget to blog. I know you don't believe me, but I've been accumulating items to blog about for the past month. Powering up. Preparing. And now, finally, it's time to begin opening the valves.
Insanity Returns
When I got back from hibernation, I was immediately accosted by a developer I'd forgotten. One with whom I spent an amount of time consuming adult beverages at XDC again. One who walks with a perpetual glint of madness in his eyes, ready at the drop of a hat to tackle the nearest driver developer and begin raving about the benefits of supporting OpenCL.
Obviously I'm talking about Karol "HOW IS THE PUB ALREADY CLOSED IT'S ONLY 10:30???" Herbst.
I was minding my own business, fixing bugs and addressing perf issues when he assaulted me with a vicious nerdsnipe late one night in January. "Hey, why can't I run DaVinci Resolve on Zink?" he casually asked me, knowing full well the ramifications of such a question.
I tried to put him off, but he persisted. "You know, RadeonSI supports all those features," he said next, and my entire week was ruined. As everyone knows, Zink can only ever be compared to one driver, and the comparisons can't be too uneven.
So it was that I started looking at the CL CTS for the first time this year to implement cl_khr_gl_sharing. This extension is basically EXT_external_objects
for CL. It should "just work". Right?
Right…
The thing is, this mechanism (on Linux) uses dmabufs. You know, that thing we all love because they make display servers go vroom. dmabufs allow sharing memory regions between processes through file descriptors. Or just within the same process. Anywhere, really. One side exports the memory object to the FD, and the other side imports it.
But that's how normal people use dmabufs. 2D image import/export for display server usage. Or, occasionally, some crazy multi-process browser engine thing. But still 2D.
You know who uses dmabufs with all-the-Ds? OpenCL.
You know who doesn't implement all-the-Ds? Any Vulkan drivers. Probably. Case in point, I had to hack it in for RADV before I could get CTS to pass and VVL to stop screaming at me.
From there, it turned out zink mostly supported everything already. A minor bugfix and some conditionals to enable raw buffer import/export, and it just works.
Brace yourselves, because this is the foundation for getting Cthulhu-level insane next time.
20 Feb 2025 12:00am GMT
17 Feb 2025
planet.freedesktop.org
Simon Ser: Status update, February 2025
Hi!
This month has been pretty hectic, with FOSDEM and all. I've really enjoyed meeting face-to-face all of these folks I work online with the rest of the year! My talk about modern IRC has been published on the FOSDEM website (unfortunately the audio quality isn't great).
In Wayland news, the color management protocol has finally been merged! I haven't done much apart cheering from the sidelines: huge thanks to everyone involved for carrying this over the finish line, especially Pekka Paalanen, Sebastian Wick and Xaver Hugl! I've started a wlroots implementation, which was enough with some hacks to get MPV to display a HDR video on Sway. I've also posted a patch to convert to BT2020 and encode to PQ, but I still need to figure out why red shows up as pink (or rebrand it as lipstick-filter
in the Sway config file).
I've released sway 1.10.1 with a bunch of bugfixes, as well as wlr-randr 0.5.0 which adds relative positioning options (e.g. --left-of
) and a man page. I've rewritten makoctl
in C (the shell script approach has been showing its limitations for a while), and merged support for icon border radius, per-corner radius settings, and a new signal in the mako-specific D-Bus API to notify when the current modes are changed.
delthas has contributed support for showing redacted messages as such in gamja. goguma's compact mode now displays an unread and date delimiter, just like the default mode (thanks Eigil Skjæveland!). I've added a basic UI to my WebDAV server, sogogi, to display directory listings and easily upload files from the browser.
That's all, see you next month!
17 Feb 2025 10:00pm GMT
03 Feb 2025
planet.freedesktop.org
Christian Schaller: Looking ahead at 2025 and Fedora Workstation and jobs on offer!
So a we are a little bit into the new year I hope everybody had a great break and a good start of 2025. Personally I had a blast having gotten the kids an air hockey table as a Yuletide present :). Anyway, wanted to put this blog post together talking about what we are looking at for the new year and to let you all know that we are hiring.
Artificial Intelligence
One big item on our list for the year is looking at ways Fedora Workstation can make use of artificial intelligence. Thanks to IBMs Granite effort we know have an AI engine that is available under proper open source licensing terms and which can be extended for many different usecases. Also the IBM Granite team has an aggressive plan for releasing updated versions of Granite, incorporating new features of special interest to developers, like making Granite a great engine to power IDEs and similar tools. We been brainstorming various ideas in the team for how we can make use of AI to provide improved or new features to users of GNOME and Fedora Workstation. This includes making sure Fedora Workstation users have access to great tools like RamaLama, that we make sure setting up accelerated AI inside Toolbx is simple, that we offer a good Code Assistant based on Granite and that we come up with other cool integration points.
Wayland
The Wayland community had some challenges last year with frustrations boiling over a few times due to new protocol development taking a long time. Some of it was simply the challenge of finding enough people across multiple projects having the time to follow up and help review while other parts are genuine disagreements of what kind of things should be Wayland protocols or not. That said I think that problem has been somewhat resolved with a general understanding now that we have the 'ext' namespace for a reason, to allow people to have a space to review and make protocols without an expectation that they will be universally implemented. This allows for protocols of interest only to a subset of the community going into 'ext' and thus allowing protocols that might not be of interest to GNOME and KDE for instance to still have a place to live.
The other more practical problem is that of having people available to help review protocols or providing reference implementations. In a space like Wayland where you need multiple people from multiple different projects it can be hard at times to get enough people involved at any given time to move things forward, as different projects have different priorities and of course the developers involved might be busy elsewhere. One thing we have done to try to help out there is to set up a small internal team, lead by Jonas Ådahl, to discuss in-progress Wayland protocols and assign people the responsibility to follow up on those protocols we have an interest in. This has been helpful both as a way for us to develop internal consensus on the best way forward, but also I think our contribution upstream has become more efficient due to this.
All that said I also believe Wayland protocols will fade a bit into the background going forward. We are currently at the last stage of a community 'ramp up' on Wayland and thus there is a lot of focus on it, but once we are over that phase we will probably see what we saw with X.org extensions over time, that for the most time new extensions are so niche that 95% of the community don't pay attention or care. There will always be some new technology creating the need for important new protocols, but those are likely to come along a relatively slow cadence.
High Dynamic Range

HDR support in GNOME Control Center
As for concrete Wayland protocols the single biggest thing for us for a long while now has of course been the HDR support for Linux. And it was great to see the HDR protocol get merged just before the holidays. I also want to give a shout out to Xaver Hugl from the KWin project. As we where working to ramp up HDR support in both GNOME Shell and GTK+ we ended up working with Xaver and using Kwin for testing especially the GTK+ implementation. Xaver was very friendly and collaborative and I think HDR support in both GNOME and KDE is more solid thanks to that collaboration, so thank you Xaver!
Talking about concrete progress on HDR support Jonas Adahl submitted merge requests for HDR UI controls for GNOME Control Center. This means you will be able to configure the use of HDR on your system in the next Fedora Workstation release.
PipeWire
I been sharing a lot of cool PipeWire news here in the last couple of years, but things might slow down a little as we go forward just because all the major features are basically working well now. The PulseAudio support is working well and we get very few bug reports now against it. The reports we are getting from the pro-audio community is that PipeWire works just as well or better as JACK for most people in terms of for instance latency, and when we do see issues with pro-audio it tends to be more often caused by driver issues triggered by PipeWire trying to use the device in ways that JACK didn't. We been resolving those by adding more and more options to hardcode certain options in PipeWire, so that just as with JACK you can force PipeWire to not try things the driver has problems with. Of course fixing the drivers would be the best outcome, but for some of these pro-audio cards they are so niche that it is hard to find developers who wants to work on them or who has hardware to test with.
We are still maturing the video support although even that is getting very solid now. The screen capture support is considered fully mature, but the camera support is still a bit of a work in progress, partially because we are going to a generational change the camera landscape with UVC cameras being supplanted by MIPI cameras. Resolving that generational change isn't just on PipeWire of course, but it does make the a more volatile landscape to mature something in. Of course an advantage here is that applications using PipeWire can easily switch between V4L2 UVC cameras and libcamera MIPI cameras, thus helping users have a smooth experience through this transition period.
But even with the challenges posed by this we are moving rapidly forward with Firefox PipeWire camera support being on by default in Fedora now, Chrome coming along quickly and OBS Studio having PipeWire support for some time already. And last but not least SDL3 is now out with PipeWire camera support.
MIPI camera support
Hans de Goede and Kate Hsuan keeps working on making sure MIPI cameras work under Linux. MIPI cameras are a step forward in terms of technical capabilities, but at the moment a bit of a step backward in terms of open source as a lot of vendors believe they have 'secret sauce' in the MIPI camera stacks. Our works focuses mostly on getting the Intel MIPI stack fully working under Linux with the Lattice MIPI aggregator being the biggest hurdle currently for some laptops. Luckily Alan Stern, the USB kernel maintainer, is looking at this now as he got the hardware himself.
Flatpak
Some major improvements to the Flatpak stack has happened recently with the USB portal merged upstream. The USB portal came out of the Sovereign fund funding for GNOME and it gives us a more secure way to give sandboxed applications access to you USB devcices. In a somewhat related note we are still working on making system daemons installable through Flatpak, with the usecase being applications that has a system daemon to communicate with a specific piece of hardware for example (usually through USB). Christian Hergert got this on his todo list, but we are at the moment waiting for Lennart Poettering to merge some pre-requisite work into systemd that we want to base this on.
Accessibility
We are putting in a lot of effort towards accessibility these days. This includes working on portals and Wayland extensions to help facilitate accessibility, working on the ORCA screen reader and its dependencies to ensure it works great under Wayland. Working on GTK4 to ensure we got top notch accessibility support in the toolkit and more.
GNOME Software
Last year Milan Crha landed the support for signing the NVIDIA driver for use on secure boot. The main feature Milan he is looking at now is getting support for DNF5 into GNOME Software. Doing this will resolve one of the longest standing annoyances we had, which is that the dnf command line and GNOME Software would maintain two separate package caches. Once the DNF5 transition is done that should be a thing of the past and thus less risk of disk space being wasted on an extra set of cached packages.
Firefox
Martin Stransky and Jan Horak has been working hard at making Firefox ready for the future, with a lot of work going into making sure it supports the portals needed to function as a flatpak and by bringing HDR support to Firefox. In fact Martin just got his HDR patches for Firefox merged this week. So with the PipeWire camera support, Flatpak support and HDR support in place, Firefox will be ready for the future.
We are hiring! looking for 2 talented developers to join the Red Hat desktop team
We are hiring! So we got 2 job openings on the Red Hat desktop team! So if you are interested in joining us in pushing the boundaries of desktop linux forward please take a look and apply. For these 2 positions we are open to remote workers across the globe and while the job adds list specific seniorities we are somewhat flexible on that front too for the right candidate. So be sure to check out the two job listings and get your application in! If you ever wanted to work fulltime on GNOME and related technologies this is your chance.
03 Feb 2025 12:29pm GMT
20 Jan 2025
planet.freedesktop.org
André Almeida: Linux 6.13, I WANT A GUITAR PEDAL
Just as 2025 is starting, we got a new Linux release in mid January, tagged as 6.13. In the spirit of holidays, Linus Torvalds even announced during 6.13-rc6 that he would be building and raffling a guitar pedal for a random kernel developer!
As usual, this release comes with a pack of exciting news done by the kernel community:
-
This release has two important improvements for task scheduling: lazy preemption and proxy execution. The goal with lazy preemption is to find a better balance between throughput and response time. A secondary goal is being able to make it the preferred non-realtime scheduling policy for most cases. Tasks that really need a reschedule in a hurry will use the older
TIF_NEED_RESCHED
flag. A preliminary work for proxy execution was merged, which will let us avoid priority-inversion scenarios when using real time tasks with deadline scheduling, for use cases such as Android. -
New important Rust abstractions arrived, such as VFS data structures and interfaces, and also abstractions for misc devices.
-
Lightweight guard pages: guard pages are used to raise a fatal signal when accessed. This feature had the drawback of having a heavy performance impact, but in this new release the flag
MADV_GUARD_INSTALL
was added for themadvise()
syscall, offering a lightweight way to guard pages.
To know more about the community improvements, check out the summary made by Kernel Newbies.
Now let's highlight the contributions made by Igalians for this release.
Case-insensitive support for tmpfs
Case sensitivity has been a traditional difference between Linux distros and MS Windows, with the most popular filesystems been in opposite sides: while ext4 is case sensitive, NTFS is case insensitive. This difference proved to be challenging when Windows apps, mainly games, started to be a common use case for Linux distros (thanks to Wine!). For instance, games running through Steam's Proton would expect that the path assets/player.png
and assets/PLAYER.PNG
would point to be the same file, but this is not the case in ext4. To avoid doing workarounds in userspace, ext4 has support for casefolding since Linux 5.2.
Now, tmpfs joins the group of filesystems with case-insensitive support. This is particularly useful for running games inside containers, like the combination of Wine + Flatpak. In such scenarios, the container shares a subset of the host filesystem with the application, mounting it using tmpfs. To keep the filesystem consistent, with the same expectations of the host filesystem about the mounted one, if the host filesystem is case-insensitive we can do the same thing for the container filesystem too. You can read more about the use case in the patchset cover letter.
While the container frameworks implement proper support for this feature, you can play with it and try it yourself:
$ mount -t tmpfs -o casefold fs_name /mytmpfs
$ cd /mytmpfs # case-sensitive by default, we still need to enable it
$ mkdir a
$ touch a; touch A
$ ls
A a
$ mkdir B; cd b
cd: The directory 'b' does not exist
$ # now let's create a case-insensitive dir
$ mkdir case_dir
$ chattr +F case_dir
$ cd case_dir
$ touch a; touch A
$ ls
a
$ mkdir B; cd b
$ pwd
$ /home/user/mytmpfs/case_dir/B
V3D Super Pages support
As part of Igalia's effort for enhancing the graphics stack for Raspberry Pi, the V3D DRM driver now has support for Super Pages, improving performance and making memory usage more efficient for Raspberry Pi 4 and 5. Using Linux 6.13, the driver will enable the MMU to allocate not only the default 4KB pages, but also 64KB "Big Pages" and 1MB "Super Pages".
To measure the difference that Super Pages made to the performance, a series of benchmarks where used, and the highlights are:
- +8.36% of FPS boost for Warzone 2100 in RPi4
- +3.62% of FPS boost for Quake 2 in RPi5
- 10% time reduction for the Mesa CI job
v3dv-rpi5-vk-full:arm64
- Aether SX2 emulator is more fluid to play
You can read a detailed post about this, with all benchmark results, in Maíra's blog post, including a super cool PlayStation 2 emulation showcase!
New transparent_hugepage_shmem=
command-line parameter
Igalia contributed new kernel command-line parameters to improve the configuration of multi-size Transparent Huge Pages (mTHP) for shmem. These parameters, transparent_hugepage_shmem=
and thp_shmem=
, enable more flexible and fine-grained control over the allocation of huge pages when using shmem.
The transparent_hugepage_shmem=
parameter allows users to set a global default huge page allocation policy for the internal shmem mount. This is particularly valuable for DRM GPU drivers. Just as CPU architectures, GPUs can also take advantage of huge pages, but this is possible only if DRM GEM objects are backed by huge pages.
Since GEM uses shmem to allocate anonymous pageable memory, having control over the default huge page allocation policy allows for the exploration of huge pages use on GPUs that rely on GEM objects backed by shmem.
In addition, the thp_shmem=
parameter provides fine-grained control over the default huge page allocation policy for specific huge page sizes.
By configuring page sizes and policies of huge-page allocations for the internal shmem mount, these changes complement the V3D Super Pages feature, as we can now tailor the size of the huge pages to the needs of our GPUs.
DRM and AMDGPU improvements
As usual in Linux releases, this one collects a list of improvements made by our team in DRM and AMDGPU driver from the last cycle.
Cosmic (the desktop environment behind Pop! OS) users discovered some bugs in the AMD display driver regarding the handling of overlay planes. These issues were pre-existing and came to light with the introduction of cursor overlay mode. They were causing page faults and divide errors. We debugged the issue together with reporters and proposed a set of solutions that were ultimately accepted by AMD developers in time for this release.
In addition, we worked with AMD developers to migrate the driver-specific handling of EDID data to the DRM common code, using drm_edid opaque objects to avoid handling raw EDID data. The first phase was incorporated and allowed the inclusion of new functionality to get EDID from ACPI. However, some dependencies between the AMD the Linux-dependent and OS-agnostic components were left to be resolved in next iterations. It means that next steps will focus on removing the legacy way of handling this data.
Also in the AMD driver, we fixed one out of bounds memory write, fixed one warning on a boot regression and exposed special GPU memory pools via the fdinfo common DRM framework.
In the DRM scheduler code, we added some missing locking, removed a couple of re-lock cycles for slightly reduced command submission overheads and clarified the internal documentation.
In the common dma-fence code, we fixed one memory leak on the failure path and one significant runtime memory leak caused by incorrect merging of fences. The latter was found by the community and was manifesting itself as a system out of memory condition after a few hours of gameplay.
sched_ext
The sched_ext landed in kernel 6.12 to enable the efficient development of BPF-based custom schedulers. During the 6.13 development cycle, the sched_ext community has made efforts to harden the code to make it more reliable and clean up the BPF APIs and documentation for clarity.
Igalia has contributed to hardening the sched_ext core code. We fixed the incorrect use of the scheduler run queue lock, especially during initializing and finalizing the BPF scheduler. Also, we fixed the missing RCU lock protections when the sched_core selects a proper CPU for a task. Without these fixes, the sched_ext core, in the worst case, could crash or raise a kernel oops message.
Other Contributions & Fixes
syzkaller, a kernel fuzzer, has been an important instrument to find kernel bugs. With the help of KASAN, a memory error detector, and syzbot, numerous such bugs have been reported and fixed.
Igalians have contributed to such fixes around a lot of subsystems (like media, network, etc), helping reduce the number of open bugs.
Check the complete list of Igalia's contributions for the 6.13 release
Authored (70)
André Almeida
- unicode: Fix utf8_load() error path
- MAINTAINERS: Add Unicode tree
- scripts/kernel-doc: Fix build time warnings
- libfs: Create the helper function generic_ci_validate_strict_name()
- ext4: Use generic_ci_validate_strict_name helper
- unicode: Export latest available UTF-8 version number
- unicode: Recreate utf8_parse_version()
- libfs: Export generic_ci_ dentry functions
- tmpfs: Add casefold lookup support
- tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirs
- tmpfs: Expose filesystem features via sysfs
- docs: tmpfs: Add casefold options
- libfs: Fix kernel-doc warning in generic_ci_validate_strict_name
- tmpfs: Fix type for sysfs' casefold attribute
- tmpfs: Initialize sysfs during tmpfs init
Changwoo Min
- sched_ext: Replace rq_lock() to raw_spin_rq_lock() in scx_ops_bypass()
- sched_ext: Clarify sched_ext_ops table for userland scheduler
- sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl()
- MAINTAINERS: add me as reviewer for sched_ext
Christian Gmeiner
Guilherme G. Piccoli
- Documentation: Improve crash_kexec_post_notifiers description
- wifi: rtlwifi: Drastically reduce the attempts to read efuse in case of failures
Maíra Canal
- drm/v3d: Address race-condition in MMU flush
- drm/v3d: Flush the MMU before we supply more memory to the binner
- drm/v3d: Fix return if scheduler initialization fails
- drm/gem: Create a drm_gem_object_init_with_mnt() function
- drm/v3d: Introduce gemfs
- drm/gem: Create shmem GEM object in a given mountpoint
- drm/v3d: Reduce the alignment of the node allocation
- drm/v3d: Support Big/Super Pages when writing out PTEs
- drm/v3d: Use gemfs/THP in BO creation if available
- drm/v3d: Add modparam for turning off Big/Super Pages
- drm/v3d: Expose Super Pages capability
- drm/vc4: Use
vc4_perfmon_find()
- MAINTAINERS: Add Maíra to VC4 reviewers
- mm: shmem: control THP support through the kernel command line
- mm: move
get_order_from_str()
to internal.h - mm: shmem: override mTHP shmem default with a kernel parameter
- mm: huge_memory: use strscpy() instead of strcpy()
- drm/v3d: Enable Performance Counters before clearing them
- drm/v3d: Ensure job pointer is set to NULL after job completion
Melissa Wen
- drm/amd/display: switch amdgpu_dm_connector to use struct drm_edid
- drm/amd/display: switch to setting physical address directly
- drm/amd/display: always call connector_update when parsing freesync_caps
- drm/amd/display: remove redundant freesync parser for DP
- drm/amd/display: add missing tracepoint event in DM atomic_commit_tail
- drm/amd/display: fix page fault due to max surface definition mismatch
- drm/amd/display: increase MAX_SURFACES to the value supported by hw
- drm/amd/display: fix divide error in DM plane scale calcs
Thadeu Lima de Souza Cascardo
- media: uvcvideo: Require entities to have a non-zero unique ID
- hfsplus: don't query the device logical block size multiple times
- Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Tvrtko Ursulin
- drm/v3d: Appease lockdep while updating GPU stats
- drm/sched: Add locking to drm_sched_entity_modify_sched
- Documentation/gpu: Document the situation with unqualified drm-memory-
- drm/amdgpu: Drop unused fence argument from amdgpu_vmid_grab_used
- drm/amdgpu: Use drm_print_memory_stats helper from fdinfo
- drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_job
- drm/amdgpu: Remove the while loop from amdgpu_job_prepare_job
- drm/sched: Optimise drm_sched_entity_push_job
- drm/sched: Stop setting current entity in FIFO mode
- drm/sched: Re-order struct drm_sched_rq members for clarity
- drm/sched: Re-group and rename the entity run-queue lock
- drm/sched: Further optimise drm_sched_entity_push_job
- drm/amd/pm: Vangogh: Fix kernel memory out of bounds write
- drm/amdgpu: Stop reporting special chip memory pools as CPU memory in fdinfo
- drm/amdgpu: Expose special on chip memory pools in fdinfo
- dma-fence: Fix reference leak on fence merge failure path
- dma-fence: Use kernel's sort for merging fences
- workqueue: Do not warn when cancelling WQ_MEM_RECLAIM work from !WQ_MEM_RECLAIM worker
Reviewed (41)
André Almeida
- futex: Use atomic64_inc_return() in get_inode_sequence_number()
- futex: Use atomic64_try_cmpxchg_relaxed() in get_inode_sequence_number()
- mm: shmem: use signed int for version handling in casefold option
Christian Gmeiner
- drm/vc4: Use
vc4_perfmon_find()
- drm/etnaviv: Request pages from DMA32 zone on addressing_limited
- drm/etnaviv: Use unsigned type to count the number of pages
- drm/etnaviv: Use 'unsigned' type to count the number of pages
- drm/etnaviv: Drop the <linux/pm_runtime.h> header
- drm/etnaviv: Fix missing mutex_destroy()
- drm/etnaviv: hold GPU lock across perfmon sampling
- drm/etnaviv: assert GPU lock held in perfmon pipe_*_read functions
- drm/etnaviv: unconditionally enable debug registers
- drm/etnaviv: update hardware headers from rnndb
- drm/etnaviv: take current primitive into account when checking for hung GPU
- drm/etnaviv: always allocate 4K for kernel ringbuffers
- drm/etnaviv: flush shader L1 cache after user commandstream
Iago Toral Quiroga
- drm/v3d: Address race-condition in MMU flush
- drm/v3d: Flush the MMU before we supply more memory to the binner
- drm/v3d: Fix return if scheduler initialization fails
- drm/v3d: Introduce gemfs
- drm/v3d: Reduce the alignment of the node allocation
- drm/v3d: Expose Super Pages capability
- drm/v3d: Enable Performance Counters before clearing them
Jose Maria Casanova Crespo
Juan A. Suarez
Maíra Canal
- drm/v3d: Use v3d_perfmon_find()
- drm/vc4: Run default client setup for all variants.
- drm/vc4: Match drm_dev_enter and exit calls in vc4_hvs_lut_load
- drm/vc4: Match drm_dev_enter and exit calls in vc4_hvs_atomic_flush
- drm/vc4: Correct generation check in vc4_hvs_lut_load
- drm/vkms: Drop unnecessary call to drm_crtc_cleanup()
Tvrtko Ursulin
- drm/gem: Create a drm_gem_object_init_with_mnt() function
- drm/gem: Create shmem GEM object in a given mountpoint
- drm/v3d: Support Big/Super Pages when writing out PTEs
- drm/v3d: Use gemfs/THP in BO creation if available
- drm/v3d: Add modparam for turning off Big/Super Pages
- drm: add DRM_SET_CLIENT_NAME ioctl
- drm: use drm_file client_name in fdinfo
- drm/amdgpu: make drm-memory-* report resident memory
- dma-buf: fix dma_fence_array_signaled v4
- dma-buf: Fix __dma_buf_debugfs_list_del argument for !CONFIG_DEBUG_FS
Tested (1)
Christian Gmeiner
Acked (5)
Changwoo Min
- sched_ext: Rename
scx_bpf_dispatch[_vtime]()
toscx_bpf_dsq_insert[_vtime]()
- sched_ext: Rename
scx_bpf_consume()
toscx_bpf_dsq_move_to_local()
- sched_ext: Rename
scx_bpf_dispatch[_vtime]_from_dsq*()
->scx_bpf_dsq_move[_vtime]*()
Maíra Canal
Maintainer SoB (6)
Maíra Canal
- MAINTAINERS: remove myself as a VKMS maintainer
- MAINTAINERS: Add myself as VKMS Maintainer
- drm/vkms: Add documentation
- drm/vkms: Suppress context imbalance detected by sparse warning
- drm/vkms: Add missing check for CRTC initialization
- drm/v3d: Drop allocation of object without mountpoint
20 Jan 2025 12:00am GMT
18 Jan 2025
planet.freedesktop.org
Simon Ser: Status update, January 2025
Hi all!
FOSDEM is approaching rapidly! I'll be there and will give a talk about modern IRC.
In wlroots land, we've finally merged support for the next-generation screen capture protocols, ext-image-capture-source-v1 and ext-image-copy-capture-v1! Compared to the previous wlroots-specific protocol, the new one provides better damage tracking, enables cursor capture (useful for remote desktop apps) and per-window capture (this part is not yet implemented in wlroots). Thanks to Kirill Primak, wlroots now supports the xdg-toplevel-icon-v1 protocol, useful for clients which want to update their window icon without changing their application ID (either by providing an icon name or pixel buffers). Kirill also added safety assertions everywhere in wlroots to ensure that all listeners are properly removed when a struct is destroyed.
I've revived some old patches to better identify outputs in wlroots and libdisplay-info. Currently, there are two common ways to refer to an output: either by its name (e.g. "DP-2"), or by its make+model+serial (e.g. "Foo Corp C4FE 42424242"). Unfortunately, both of these naming schemes have downsides. The name is ill-suited to configuration files because it's unstable and might change on reboot or unplug (it depends on driver load order, and DP-MST connectors get a new name each time they are re-plugged). The make+model+serial uses a database to look up the human-readable manufacturer name (so database updates break config files), and is not unique enough (different models might share a duplicate string). A new wlr_output.port
field and a libdisplay-info device tag should address these shortcomings.
Jacob McNamee has contributed a Sway patch to add security context properties to IPC, criteria and title format. With this patch, scripts can now figure out whether an application is sandboxed, and a special title can be set for sandboxed (or unsandboxed) apps. There are probably more use-cases we didn't think of!
I've managed to put aside some time to start reviewing the DRM color pipeline patches. As discussed in the last XDC it's in a pretty good shape so I've started dropping some Reviewed-by
tags. While discussing with David Turner about libliftoff, I've realized that the DRM_MODE_PAGE_FLIP_EVENT
flag was missing some documentation (it's not obvious how it interacts with the atomic uAPI) so I've sent a patch to fix that.
I continue pushing small updates to go-imap, bringing it little by little closer to version 2.0. I've added helpers to make it easier for servers to implement the FETCH
command, implemented FETCH BINARY
and header field decoding for SEARCH
in the built-in in-memory server, added limits for the IMAP command size to prevent denial-of-service, and fixed a few bugs. While testing with ImapTest, I've discovered and fixed a bug in Go's mime/quotedprintable
package.
Thanks to pounce, goguma now internally keeps track of message reactions. This is not used just yet, but will be soon once we add a user interface to display and send reactions. Support for deleting messages (called "redact" in the spec) has been merged. I've also implemented a small date indicator which shows up when scrolling in a conversation.
That's all for this month, see you at FOSDEM!
18 Jan 2025 10:00pm GMT
16 Jan 2025
planet.freedesktop.org
Christian Gmeiner: Multiple Render Targets for etnaviv
Modern graphics programming revolves around achieving high-performance rendering and visually stunning effects. Among OpenGL's capabilities, Multiple Render Targets (MRTs) are particularly valuable for enabling advanced rendering techniques with greater efficiency. With the latest release of Mesa 24.03 and the commitment from Igalia, the etnaviv GPU driver now includes support for MRTs. If you've ever wondered how MRTs can transform your graphics pipeline or are curious about the challenges of implementing this feature, this blog post is for you.
16 Jan 2025 12:00am GMT
14 Jan 2025
planet.freedesktop.org
Hans de Goede: IPU6 camera support status update
The initial IPU6 camera support landed in Fedora 41 only works on a limited set of laptops. The reason for this is that with MIPI cameras every different sensor and glue-chip like IO-expanders needs to be supported separately.
I have been working on making the camera work on more laptop models. After receiving and sending many emails and blog post comments about this I have started filing Fedora bugzilla issues on a per sensor and/or laptop-model basis to be able to properly keep track of all the work.
Currently the following issues are being either actively being worked on, or are being tracked to be fixed in the future.
Issues which have fixes pending (review) upstream:
- IPU6 camera on TERRA PAD 1262 V2 not working, fix has been accepted upstream.
- IPU6 camera on Dell XPS 9x40 models with ov02c10 sensor not working, sensor driver has been submitted upstream.
Open issues with various states of progress:
- IPU6 camera on Dell Latitude 7450 laptop not working
- IPU6 camera on HP Spectre x360 14-eu0xxx / Spectre 16 MeteorLake with ov08x40 not working
- IPU6 camera on HP Spectre x360 2-in-1 16-f1xxx/891D with hi556 sensor not working
- IPU6 camera on Lenovo ThinkPad X1 Carbon Gen 12 not working
- Lattice MIPI Aggregator support for IPU6 cameras
- Lunar Lake MIPI camera / IPU7 CSI receiver support
- ov01a10 camera sensor driver lacks 1296x816 mode support
- No driver for ov01a1s camera sensor
- iVSC fails to probe with ETIMEDOUT
- iVSC fails to probe with EINVAL on XPS 9315
See all the individual bugs for more details. I plan to post semi-regular status updates on this on my blog.
This above list of issues can also be found on my Fedora 42 change proposal tracking this and I intent to keep an updated complete list of all x86 MIPI camera issues (including closed ones) there.
comments
14 Jan 2025 2:21pm GMT
09 Jan 2025
planet.freedesktop.org
Mike Blumenkrantz: Rake In Bike
First Perf of the Year
I got a ticket last year about this game Everspace having bad perf on zink. I looked at it a little then, but it was the end of the year and I was busy doing other stuff. More important stuff. I definitely wasn't just procrastinating.
In any case, I didn't fix it last year, so I dusted it off the other day and got down to business. Unsurprisingly, it was still slow.
Easing Into Speed
The first step is always a flamegraph, and as expected, I got a hit:
Huge bottlenecking when checking query results, specifically in semaphore waits. What's going on here?
What's going on is this game is blocking on timestamp queries, and the overhead of doing vkWaitSemaphores(t=0)
to check drm syncobj progress for the result is colossal. Who could have guessed that using core Vulkan mechanics in a hotpath would obliterate perf?
Fixing this is very stupid: directly checking query results with vkGetQueryPoolResults
avoids syncobj access inside drivers by accessing what are effectively userspace fences, which Vulkan doesn't directly permit. If an app starts polling on query results, zink now uses this rather than its usual internal QBO mechanism.
Bottleneck uncorked and performance fixed. Right?
Naaaaaa
The perf is still pretty bad. It's time to check in with the doctor. Looking through some of the renderpasses reveals all kinds of begin/end tomfoolery. Paring this down, renderpasses are being split for layout changes to toggle feedback loops:
The game is rendering to one miplevel of a framebuffer attachment while sampling from another miplevel of the same image. This breaks zink's heuristic for detecting implicit feedback loops. Improvements here tighten up that detection to flatten out the renderpasses.
Gottagofastium
Perf recovered: the game runs roughly 150% faster, putting it on par with RadeonSI. Maybe some other games will be affected? Who can say.
09 Jan 2025 12:00am GMT