26 May 2025
planet.freedesktop.org
Mike Blumenkrantz: Monthly Post
Well.
I had intended to be writing this post over a month ago, but [for reasons] I'm here writing it now.
Way back in March of '25, I was doing work that I could talk about publicly, and a sizable chunk of that was working to improve Gallium. The stopping point of that work was the colossal !34054, which roughly amounts to "remove a single * from a struct". The result was rewriting every driver and frontend in the tree to varying extents:
` 260 files changed, 2179 insertions(+), 2331 deletions(-)`
So as I was saying, I expected to merge this right after the 25.1 branchpoint back around mid-April, which would have allowed me to keep my train of thought and momentum. Sadly this did not come to pass, and as a result I've forgotten most of the key points of that blog post (and related memes). But I still have this:
But Hwhy?
As readers of this blog, you're all very smart. You can smell bullshit a country mile away. That's why I'm going to treat you like the intelligent rhinoceroses you are and tell you right now that I no longer have any of the performance statistics I'd gathered for this post. We're all gonna have to go on vibes and #TrustMeBuddy energy. I'll begin by posing a hypothetical to you.
Suppose you're running a complex application. Suppose this application has threads which share data. Now suppose you're running this on an AMD CPU. What is your most immediate, significant performance concern?
If you said atomic operations, you are probably me from way back in February-Take that time machine and get back where you belong. The problems are not fixed.
AMD CPUs are bad with atomic operations. It's a feature. No, I will not go into more detail; months have passed since I read all those dissertations, and I can't remember what I ate for breakfast an hour ago. #TrustMeBuddy.
I know what you're thinking. Mike, why aren't you just pinning your threads?
Well, you incredibly handsome reader, the thing is thread pinning is a lie. You can pin threads by setting their affinity to keep them on the same CCX, and L3 cache, and blah blah blah, and even when you do that sometimes it has absolutely zero fucking effect and your fps is still 6. There is no explanation. PhDs who work on atomic operations in compilers cannot explain this. The dark lord Yog-Sothoth cowers in fear when pressed for details. Even tariffs on performance penalties cannot mitigate this issue.
In that sense, when you have your complex threadful application which uses atomic operations on an AMD CPU, and when you want to achieve the same performance it can have for free on a different type of CPU, you have four options:
- pray that CPUs, kernels, and compilers improve to the extent that the problem eventually goes away on its own
- stop using AMD CPUs
- stop using threads
- stop using atomic operations
Obviously none of these options are very appealing. If you have a complex application, you need threads, you need your AMD CPU with its bazillion cores, you need atomic operations, and, being realistic, the situation here with hardware/kernel/compiler is not going to improve before AI takes over my job and I quit to become a full-time novel writer in the budding rom-pixel-com genre.
While eliminating all atomic operations isn't viable, eliminating a certain class of them is theoretically possible. I'm talking, of course, about reference counting, the means by which C developers LARP as Java developers.
In Mesa, nearly every object is reference counted, especially the ones which have no need for garbage collection. Haters will scream REWRITE IT IN RUST, but I'm not going to do that until someone finally rewrites the GLSL compiler in Rust to kick off the project. That's right, I'm talking to all you rustaceans out there: do something useful for once instead of rewriting things that aren't the best graphics stack on the planet.
A great example of this reference counting overreliance was sampler views, which I took a hatchet to some months ago. This is a context-specific object which has a clear ownership pattern. Why was it reference counted? Science cannot explain this, but psychologists will tell you that engineers will always follow existing development patterns without question regardless of how non-performant they may be. Don't read any zink code to find examples. #TrustMeBuddy.
Sampler views were a relatively easy pickup, more like a proof of concept to see if the path was viable. Upon succeeding, I immediately rushed to the hardest possible task: the framebuffer. Framebuffer surfaces can be shared between contexts, which makes them extra annoying to solve in this case. For that reason, the solution was not to try a similar approach, it was to step back and analyze the usage and ownership pattern.
Why pipe_surface*?
Originally the pipe_surface
object was used solely for framebuffers, but this concept has since metastacized to clear operations and even video. It's a useful object at a technical level: it provides a resource, format, miplevel, and layers. But does it really need to be an object?
Deeper analysis said no: the vast majority of drivers didn't use this for anything special, and few drivers invested into architecture based on this being an actual object vs just having the state available. The majority of usage was pointlessly passing the object around because the caller handed it off to another function.
Of course, in the process of this analysis, I noted that zink was one of the heaviest investors into pipe_surface*
. Pour one out for my past decision-making process. But I pulled myself up by my bootstraps, and I rewrote every driver and every frontend, and now whenever the framebuffer changes there are at least num_attachments * (frontend_thread + tc_thread + driver_thread)
fewer atomic operations.
More Work
This saga is not over. There's still base buffers and images to go, which is where a huge amount of performance is lost if you are hitting an affected codepath. Ideally those changes will be smaller and more concentrated than the framebuffer refactor.
Ideally I will find time for it.
#TrustMeBuddy.
26 May 2025 12:00am GMT
André Almeida: Linux 6.15, DRM scheduler, wedged events, sched_ext and more
The Linux 6.15 has just been released, bringing a lot of new features:
- nova-core, the "base" driver for the new NVIDIA GPU driver, written in Rust. nova project will eventually replace Nouveau driver for all GSP-based GPUs.
- RISC-V gained support for some extensions: BFloat16 floating-point, Zaamo, Zalrsc and ZBKB.
- The fwctl subsystem has been merged. This new family of drivers acts as a transport layer between userspace and complex firmware. To understand more about its controversies and how it got merged, check out this LWN article.
- Support for MacBook touch bars, both as a DRM driver and input source.
- Support for Adreno 623 GPU.
As always, I suggest to have a look at the Kernel Newbies summary. Now, let's have a look at Igalia's contributions.
DRM wedged events
In 3D graphics APIs such Vulkan and OpenGL, there are some mechanisms that applications can rely to check if the GPU had reset (you can read more about this in the kernel documentation). However, there was no generic mechanism to inform userspace that a GPU reset has happened. This is useful because in some cases the reset affected not only the app involved in the reset, but the whole graphic stack and thus needs some action to recover, like doing a module rebind or even bus reset to recovery the hardware. For this release, we helped to add an userspace event for this, so a daemon or the compositor can listen to it and trigger some recovery measure after the GPU has reset. Read more in the kernel docs.
DRM scheduler work
In the DRM scheduler area, in preparation for the future scheduling improvements, we worked on cleaning up the code base, better separation of the internal and external interfaces, and adding formal interfaces at places where individual drivers had too much knowledge of the scheduler internals.
General GPU/DRM stack
In the wider GPU stack area we optimised the most frequent dma-fence single fence merge operation to avoid memory allocations and array sorting. This should slightly reduce the CPU utilisation with workloads which use the DRM sync objects heavily, such as the modern composited desktops using Vulkan explicit sync.
Some releases ago, we helped to enable async page flips in the atomic DRM uAPI. So far, this feature was only enabled for the primary plane. In this release, we added a mechanism for the driver to decide which plane can perform async flips. We used this to enable overlay planes to do async flips in AMDGPU driver.
We also fixed a bug in the DRM fdinfo common layer which could cause use after free after driver unbind.
Intel Xe driver improvements
On the Intel GPU specific front we worked on adding better Alderlake-P support to the new Intel Xe driver by identifying and adding missing hardware workarounds, fixed the workaround application in general and also made some other smaller improvements.
sched_ext
When developing and optimizing a sched_ext
-based scheduler, it is important to understand the interactions between the BPF scheduler and the in-kernel sched_ext
core. If there is a mismatch between what the BPF scheduler developer expects and how the sched_ext
core actually works, such a mismatch could often be the source of bugs or performance issues.
To address such a problem, we added a mechanism to count and report the internal events of the sched_ext
core. This significantly improves the visibility of subtle edge cases, which might easily slip off. So far, eight events have been added, and the events can be monitored through a BPF program, sysfs, and a tracepoint.
A few less bugs
As usual, as part of our work on diverse projects, we keep an eye on automated test results to look for potential security and stability issues in different kernel areas. We're happy to have contributed to making this release a bit more robust by fixing bugs in memory management, network (SCTP), ext4, suspend/resume and other subsystems.
This is the complete list of Igalia's contributions for this release:
Authored (75)
André Almeida
- drm/amdgpu: Use device wedged event
- drm/atomic: Let drivers decide which planes to async flip
- drm/amdgpu: Enable async flip on overlay planes
- drm/amdgpu: Log the creation of a coredump file
- drm/amdgpu: Log after a successful ring reset
- drm/amdgpu: Create a debug option to disable ring reset
- drm/amdgpu: Trigger a wedged event for ring reset
Angelos Oikonomopoulos
Bhupesh
Changwoo Min
- sched_ext: Implement event counter infrastructure
- sched_ext: Add an event, SCX_EV_SELECT_CPU_FALLBACK
- sched_ext: Add an event, SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE
- sched_ext: Add an event, SCX_EV_DISPATCH_KEEP_LAST
- sched_ext: Add an event, SCX_EV_ENQ_SKIP_EXITING
- sched_ext: Add an event, SCX_EV_BYPASS_ACTIVATE
- sched_ext: Add an event, SCX_EV_BYPASS_DISPATCH
- sched_ext: Add an event, SCX_EV_BYPASS_DURATION
- sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers
- sched_ext: Print core event count in scx_central scheduler
- sched_ext: Print core event count in scx_qmap scheduler
- sched_ext: Add an event, SCX_EV_ENQ_SLICE_DFL
- sched_ext: Print an event, SCX_EV_ENQ_SLICE_DFL, in scx_qmap/central
- tools/sched_ext: Compatible testing of SCX_ENQ_CPU_SELECTED
- tools/sched_ext: Update enum_defs.autogen.h
- sched_ext: Provides a sysfs 'events' to expose core event counters
- sched_ext: Change the event type from u64 to s64
- sched_ext: Add trace point to track sched_ext core events
Gavin Guo
Guilherme G. Piccoli
- x86/tsc: Always save/restore TSC sched_clock() on suspend/resume
- scripts: add script to extract built-in firmware blobs
Luis Henriques
Maíra Canal
- drm/v3d: Remove
v3d->cpu_job
- drm/vc4: Use DRM Execution Contexts
- drm/vc4: Use DMA Resv to implement VC4 wait BO IOCTL
- drm/vc4: Remove BOs seqnos
- drm/v3d: Fix Indirect Dispatch configuration for V3D 7.1.6 and later
- drm/v3d: Add job to pending list if the reset was skipped
Melissa Wen
- drm/amd/display: restore edid reading from a given i2c adapter
- drm/amd/display: Fix null check of pipe_ctx->plane_state for update_dchubp_dpp
- Revert "drm/amd/display: Hardware cursor changes color when switched to software cursor"
Ricardo Cañuelo Navarro
- sctp: detect and prevent references to a freed transport in sendmsg
- mm: fix copy_vma() error handling for hugetlb mappings
Rodrigo Siqueira
- Documentation/gpu: Add acronyms for some firmware components
- MAINTAINERS: Change my role from Maintainer to Reviewer
- mailmap: Add entry for Rodrigo Siqueira
Thadeu Lima de Souza Cascardo
- dlm: prevent NPD when writing a positive value to event_done
- char: misc: improve testing Kconfig description
- parisc: perf: use named initializers for struct miscdevice
- drm/amd/display: avoid NPD when ASIC does not support DMUB
- tpm: do not start chip while suspended
- i2c: cros-ec-tunnel: defer probe if parent EC is not present
- char: misc: register chrdev region with all possible minors
Tvrtko Ursulin
- dma-fence: Add a single fence fast path for fence merging
- dma-fence: Add some more fence-merge-unwrap tests
- drm/sched: Delete unused update_job_credits
- drm/sched: Remove weak paused submission checks
- drm/sched: Add helper to check job dependencies
- drm/imagination: Use the drm_sched_job_has_dependency helper
- drm/scheduler: Remove some unused prototypes
- drm/sched: Add internal job peek/pop API
- drm/amdgpu: Pop jobs from the queue more robustly
- drm/sched: Remove a hole from struct drm_sched_job
- drm/sched: Move drm_sched_entity_is_ready to internal header
- drm/sched: Move internal prototypes to internal header
- drm/sched: Group exported prototypes by object type
- drm/xe: Fix GT "for each engine" workarounds
- drm/xe/xelp: Move Wa_16011163337 from tunings to workarounds
- drm/xe/xelp: Add Wa_1604555607
- drm/xe/xelp: L3 recommended hashing mask
- drm/xe: Add performance tunings to debugfs
- drm/xe: Fix MOCS debugfs LNCF readout
- drm/xe: Fix ring flush invalidation
- drm/xe: Pass flags directly to emit_flush_imm_ggtt
- drm/xe: Use correct type width for alignment in fb pinning code
- drm/fdinfo: Protect against driver unbind
Reviewed (30)
André Almeida
- drm: Introduce device wedged event
- drm/doc: Document device wedged event
- selftests/futex: futex_waitv wouldblock test should fail
- drm: Fix potential overflow issue in event_string array
Christian Gmeiner
Iago Toral Quiroga
- drm/v3d: Fix Indirect Dispatch configuration for V3D 7.1.6 and later
- drm/v3d: Add job to pending list if the reset was skipped
Jose Maria Casanova Crespo
Luis Henriques
Maíra Canal
- drm/vkms: Switch to managed for connector
- drm/vkms: Switch to managed for encoder
- drm/sched: Use struct for drm_sched_init() params
- drm/v3d: Add clock handling
Melissa Wen
- drm/vc4: Use DRM Execution Contexts
- drm/vc4: Use DMA Resv to implement VC4 wait BO IOCTL
- drm/vc4: Remove BOs seqnos
Rodrigo Siqueira
- Documentation/gpu: remove duplicate entries in different glossaries
- drm/amd/display: change kzalloc to kcalloc in dcn30_validate_bandwidth()
- drm/amd/display: change kzalloc to kcalloc in dcn31_validate_bandwidth()
- drm/amd/display: change kzalloc to kcalloc in dcn314_validate_bandwidth()
- drm/amd/display: change kzalloc to kcalloc in dml1_validate()
- drm/amd/display: Remove incorrect macro guard
- drm/amd/display: avoid NPD when ASIC does not support DMUB
Thadeu Lima de Souza Cascardo
Tvrtko Ursulin
- dma-buf: add selftest for fence order after merge
- drm/i915/pmu: Drop custom hotplug code
- drm/file: Add fdinfo helper for printing regions with prefix
- drm/sched: Use struct for drm_sched_init() params
- drm/sched: drm_sched_job_cleanup(): correct false doc
- drm/xe/rtp: Drop sentinels from arg to xe_rtp_process_to_sr()
Tested (2)
Changwoo Min
Guilherme G. Piccoli
Acked (12)
Changwoo Min
- tool/sched_ext: Event counter dumping updates
- sched_ext: Count SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE in the right spot
- sched_ext: Add SCX_EV_ENQ_SKIP_MIGRATION_DISABLED
- sched_ext: Take NUMA node into account when allocating per-CPU cpumasks
- tools/sched_ext: Provide consistent access to scx flags
- sched_ext: Track currently locked rq
- sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
Maíra Canal
Tvrtko Ursulin
- Documentation/gpu: Clarify format of driver-specific fidnfo keys
- drm/i915: Use device wedged event
- drm/doc: Document KUnit expectations
- drm/v3d: Add job to pending list if the reset was skipped
Maintainer SoB (2)
Maíra Canal
Tvrtko Ursulin
26 May 2025 12:00am GMT
23 May 2025
planet.freedesktop.org
Hans de Goede: IPU6 cameras with ov02c10 / ov02e10 now supported in Fedora
I'm happy to share that 3 major IPU6 camera related kernel changes from linux-next have been backported to Fedora and have been available for about a week now the Fedora kernel-6.14.6-300.fc42 (or later) package:
- Support for the OV02C10 camera sensor, this should e.g. enable the camera to work out of the box on all Dell XPS 9x40 models.
- Support for the OV02E10 camera sensor, this should e.g. enable the camera to work out of the box on Dell Precision 5690 laptops. When combined with item 3. below and the USBIO drivers from rpmfusion this should also e.g. enable the camera on other laptop models like e.g. the Dell Latitude 7450.
- Support for the special handshake GPIO used to turn on the sensor and allow sensor i2c-access on various new laptop models using the Lattice MIPI aggregator FPGA / USBIO chip.
If you want to give this a test using the libcamera-softwareISP FOSS stack, run the following commands:
sudo rm -f /etc/modprobe.d/ipu6-driver-select.conf
sudo dnf update 'kernel*'
sudo dnf install libcamera-qcam
reboot
qcam
Note the colors being washed out and/or the image possibly being a bit over or under exposed is expected behavior ATM, this is due to the software ISP needing more work to improve the image quality. If your camera still does not work after these changes and you've not filed a bug for this camera already please file a bug following these instructions.
See my previous blogpost on how to also test Intel's proprietary stack from rpmfusion if you also have that installed.
comments
23 May 2025 4:09pm GMT
Hans de Goede: IPU6 FOSS and proprietary stack co-existence
Since the set of rpmfusion intel-ipu6-kmod + ipu6-camera-* package updates from last February the FOSS libcamera-softwareISP and Intel's proprietary stack using the Intel hardware ISP can now co-exist on Fedora systems, sharing the mainline IPU6-CSI2 receiver driver.
Because of this it is no longer necessary to blacklist the kernel-modules from the other stack. Unfortunately when the rpmfusion packages first generated "/etc/modprobe.d/ipu6-driver-select.conf" for blacklisting this file was not marked as "%ghost" in the specfile and now with the February ipu6-camera-hal the file has been removed from the package. This means that if you've jumped from an old ipu6-camera-hal where the file was not marked as "%ghost directly to the latest you may still have the modprobe.d conf file around causing issues. To fix this run:
sudo rm -f /etc/modprobe.d/ipu6-driver-select.conf
and then reboot. I'll also add this as post-install script to the ipu6-camera-hal packages, to fix systems being broken because of this.
If you want the rpmfusion packages because your system needs the USBIO drivers, but you do not want the proprietary stack, you can run the following command to disable the proprietary stack:
sudo ipu6-driver-select foss
Or if you have disabled the prorietary stack in the past and want to give it a try, run:
sudo ipu6-driver-select proprietary
To test switching between the 2 stacks in Firefox go to Mozilla's webrtc test page and click on the "Camera" button, you should now get a camera permisson dialog with 2 cameras: "Built in Front Camera" and "Intel MIPI Camera (V4L2)" the "Built in Front Camera" is the FOSS stack and the "Intel MIPI Camera (V4L2)" is the proprietary stack. Note the FOSS stack will show a strongly zoomed in (cropped) image, this is caused by the GUM test-page, in e.g. google-meet this will not be the case.
Unfortunately switching between the 2 cameras in jitsi does not work well. The jitsi camera selector tries to show a preview of both cameras at the same time and while one stack is streaming the other stack cannot access the camera. You should be able to switch by: 1. Selecting the camera you want 2. Closing the jitsi tab 3. wait a few seconds for the camera to stop streaming 4. open jitsi in a new tab.
Note I already mentioned most of this in my previous blog post but it was a bit buried there.
comments
23 May 2025 3:42pm GMT
21 May 2025
planet.freedesktop.org
Peter Hutterer: libinput and Lua plugins
First of all, what's outlined here should be available in libinput 1.29 but I'm not 100% certain on all the details yet so any feedback (in the libinput issue tracker) would be appreciated. Right now this is all still sitting in the libinput!1192 merge request. I'd specifically like to see some feedback from people familiar with Lua APIs. With this out of the way:
Come libinput 1.29, libinput will support plugins written in Lua. These plugins sit logically between the kernel and libinput and allow modifying the evdev device and its events before libinput gets to see them.
The motivation for this are a few unfixable issues - issues we knew how to fix but we cannot actually implement and/or ship the fixes without breaking other devices. One example for this is the inverted Logitech MX Master 3S horizontal wheel. libinput ships quirks for the USB/Bluetooth connection but not for the Bolt receiver. Unlike the Unifying Receiver the Bolt receiver doesn't give the kernel sufficient information to know which device is currently connected. Which means our quirks could only apply to the Bolt receiver (and thus any mouse connected to it) - that's a rather bad idea though, we'd break every other mouse using the same receiver. Another example is an issue with worn out mouse buttons - on that device the behavior was predictable enough but any heuristics would catch a lot of legitimate buttons. That's fine when you know your mouse is slightly broken and at least it works again. But it's not something we can ship as a general solution. There are plenty more examples like that - custom pointer deceleration, different disable-while-typing, etc.
libinput has quirks but they are internal API and subject to change without notice at any time. They're very definitely not for configuring a device and the local quirk file libinput parses is merely to bridge over the time until libinput ships the (hopefully upstreamed) quirk.
So the obvious solution is: let the users fix it themselves. And this is where the plugins come in. They are not full access into libinput, they are closer to a udev-hid-bpf in userspace. Logically they sit between the kernel event devices and libinput: input events are read from the kernel device, passed to the plugins, then passed to libinput. A plugin can look at and modify devices (add/remove buttons for example) and look at and modify the event stream as it comes from the kernel device. For this libinput changed internally to now process something called an "evdev frame" which is a struct that contains all struct input_events up to the terminating SYN_REPORT. This is the logical grouping of events anyway but so far we didn't explicitly carry those around as such. Now we do and we can pass them through to the plugin(s) to be modified.
The aforementioned Logitech MX master plugin would look like this: it registers itself with a version number, then sets a callback for the "new-evdev-device" notification and (where the device matches) we connect that device's "evdev-frame" notification to our actual code:
libinput:register(1) -- register plugin version 1 libinput:connect("new-evdev-device", function (_, device) if device:vid() == 0x046D and device:pid() == 0xC548 then device:connect("evdev-frame", function (_, frame) for _, event in ipairs(frame.events) do if event.type == evdev.EV_REL and (event.code == evdev.REL_HWHEEL or event.code == evdev.REL_HWHEEL_HI_RES) then event.value = -event.value end end return frame end) end end)
This file can be dropped into /etc/libinput/plugins/10-mx-master.lua and will be loaded on context creation. I'm hoping the approach using named signals (similar to e.g. GObject) makes it easy to add different calls in future versions. Plugins also have access to a timer so you can filter events and re-send them at a later point in time. This is useful for implementing something like disable-while-typing based on certain conditions.
So why Lua? Because it's very easy to sandbox. I very explicitly did not want the plugins to be a side-channel to get into the internals of libinput - specifically no IO access to anything. This ruled out using C (or anything that's a .so file, really) because those would run a) in the address space of the compositor and b) be unrestricted in what they can do. Lua solves this easily. And, as a nice side-effect, it's also very easy to write plugins in.[1]
Whether plugins are loaded or not will depend on the compositor: an explicit call to set up the paths to load from and to actually load the plugins is required. No run-time plugin changes at this point either, they're loaded on libinput context creation and that's it. Otherwise, all the usual implementation details apply: files are sorted and if there are files with identical names the one from the highest-precedence directory will be used. Plugins that are buggy will be unloaded immediately.
If all this sounds interesting, please have a try and report back any APIs that are broken, or missing, or generally ideas of the good or bad persuation. Ideally before we ship it and the API is stable forever :)
[1] Benjamin Tissoires actually had a go at WASM plugins (via rust). But ... a lot of effort for rather small gains over Lua
21 May 2025 4:09am GMT
19 May 2025
planet.freedesktop.org
Melissa Wen: A Look at the Latest Linux KMS Color API Developments on AMD and Intel
This week, I reviewed the last available version of the Linux KMS Color API. Specifically, I explored the proposed API by Harry Wentland and Alex Hung (AMD), their implementation for the AMD display driver and tracked the parallel efforts of Uma Shankar and Chaitanya Kumar Borah (Intel) in bringing this plane color management to life. With this API in place, compositors will be able to provide better HDR support and advanced color management for Linux users.
To get a hands-on feel for the API's potential, I developed a fork of drm_info
compatible with the new color properties. This allowed me to visualize the display hardware color management capabilities being exposed. If you're curious and want to peek behind the curtain, you can find my exploratory work on the drm_info/kms_color branch. The README there will guide you through the simple compilation and installation process.
Note: You will need to update libdrm to match the proposed API. You can find an updated version in my personal repository here. To avoid potential conflicts with your official libdrm
installation, you can compile and install it in a local directory. Then, use the following command: export LD_LIBRARY_PATH="/usr/local/lib/"
In this post, I invite you to familiarize yourself with the new API that is about to be released. You can start doing as I did below: just deploy a custom kernel with the necessary patches and visualize the interface with the help of drm_info
. Or, better yet, if you are a userspace developer, you can start developing user cases by experimenting with it.
The more eyes the better.
KMS Color API on AMD
The great news is that AMD's driver implementation for plane color operations is being developed right alongside their Linux KMS Color API proposal, so it's easy to apply to your kernel branch and check it out. You can find details of their progress in the AMD's series.
I just needed to compile a custom kernel with this series applied, intentionally leaving out the AMD_PRIVATE_COLOR
flag. The AMD_PRIVATE_COLOR
flag guards driver-specific color plane properties, which experimentally expose hardware capabilities while we don't have the generic KMS plane color management interface available.
If you don't know or don't remember the details of AMD driver specific color properties, you can learn more about this work in my blog posts [1] [2] [3]. As driver-specific color properties and KMS colorops are redundant, the driver only advertises one of them, as you can see in AMD workaround patch 24.
So, with the custom kernel image ready, I installed it on a system powered by AMD DCN3 hardware (i.e. my Steam Deck). Using my custom drm_info, I could clearly see the Plane Color Pipeline with eight color operations as below:
└───"COLOR_PIPELINE" (atomic): enum {Bypass, Color Pipeline 258} = Bypass
├───Bypass
└───Color Pipeline 258
├───Color Operation 258
│ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, Multiplier, 3D LUT} = 1D Curve
│ ├───"BYPASS" (atomic): range [0, 1] = 1
│ └───"CURVE_1D_TYPE" (atomic): enum {sRGB EOTF, PQ 125 EOTF, BT.2020 Inverse OETF} = sRGB EOTF
├───Color Operation 263
│ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, Multiplier, 3D LUT} = Multiplier
│ ├───"BYPASS" (atomic): range [0, 1] = 1
│ └───"MULTIPLIER" (atomic): range [0, UINT64_MAX] = 0
├───Color Operation 268
│ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, Multiplier, 3D LUT} = 3x4 Matrix
│ ├───"BYPASS" (atomic): range [0, 1] = 1
│ └───"DATA" (atomic): blob = 0
├───Color Operation 273
│ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, Multiplier, 3D LUT} = 1D Curve
│ ├───"BYPASS" (atomic): range [0, 1] = 1
│ └───"CURVE_1D_TYPE" (atomic): enum {sRGB Inverse EOTF, PQ 125 Inverse EOTF, BT.2020 OETF} = sRGB Inverse EOTF
├───Color Operation 278
│ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, Multiplier, 3D LUT} = 1D LUT
│ ├───"BYPASS" (atomic): range [0, 1] = 1
│ ├───"SIZE" (atomic, immutable): range [0, UINT32_MAX] = 4096
│ ├───"LUT1D_INTERPOLATION" (immutable): enum {Linear} = Linear
│ └───"DATA" (atomic): blob = 0
├───Color Operation 285
│ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, Multiplier, 3D LUT} = 3D LUT
│ ├───"BYPASS" (atomic): range [0, 1] = 1
│ ├───"SIZE" (atomic, immutable): range [0, UINT32_MAX] = 17
│ ├───"LUT3D_INTERPOLATION" (immutable): enum {Tetrahedral} = Tetrahedral
│ └───"DATA" (atomic): blob = 0
├───Color Operation 292
│ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, Multiplier, 3D LUT} = 1D Curve
│ ├───"BYPASS" (atomic): range [0, 1] = 1
│ └───"CURVE_1D_TYPE" (atomic): enum {sRGB EOTF, PQ 125 EOTF, BT.2020 Inverse OETF} = sRGB EOTF
└───Color Operation 297
├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, Multiplier, 3D LUT} = 1D LUT
├───"BYPASS" (atomic): range [0, 1] = 1
├───"SIZE" (atomic, immutable): range [0, UINT32_MAX] = 4096
├───"LUT1D_INTERPOLATION" (immutable): enum {Linear} = Linear
└───"DATA" (atomic): blob = 0
Note that Gamescope is currently using AMD driver-specific color properties implemented by me, Autumn Ashton and Harry Wentland. It doesn't use this KMS Color API, and therefore COLOR_PIPELINE
is set to Bypass
. Once the API is accepted upstream, all users of the driver-specific API (including Gamescope) should switch to the KMS generic API, as this will be the official plane color management interface of the Linux kernel.
KMS Color API on Intel
On the Intel side, the driver implementation available upstream was built upon an earlier iteration of the API. This meant I had to apply a few tweaks to bring it in line with the latest specifications. You can explore their latest work here. For a more simplified handling, combining the V9 of the Linux Color API, Intel's contributions, and my necessary adjustments, check out my dedicated branch.
I then compiled a kernel from this integrated branch and deployed it on a system featuring Intel TigerLake GT2 graphics. Running my custom drm_info revealed a Plane Color Pipeline with three color operations as follows:
├───"COLOR_PIPELINE" (atomic): enum {Bypass, Color Pipeline 480} = Bypass
│ ├───Bypass
│ └───Color Pipeline 480
│ ├───Color Operation 480
│ │ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, 1D LUT Mult Seg, 3x3 Matrix, Multiplier, 3D LUT} = 1D LUT Mult Seg
│ │ ├───"BYPASS" (atomic): range [0, 1] = 1
│ │ ├───"HW_CAPS" (atomic, immutable): blob = 484
│ │ └───"DATA" (atomic): blob = 0
│ ├───Color Operation 487
│ │ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, 1D LUT Mult Seg, 3x3 Matrix, Multiplier, 3D LUT} = 3x3 Matrix
│ │ ├───"BYPASS" (atomic): range [0, 1] = 1
│ │ └───"DATA" (atomic): blob = 0
│ └───Color Operation 492
│ ├───"TYPE" (immutable): enum {1D Curve, 1D LUT, 3x4 Matrix, 1D LUT Mult Seg, 3x3 Matrix, Multiplier, 3D LUT} = 1D LUT Mult Seg
│ ├───"BYPASS" (atomic): range [0, 1] = 1
│ ├───"HW_CAPS" (atomic, immutable): blob = 496
│ └───"DATA" (atomic): blob = 0
Observe that Intel's approach introduces additional properties like "HW_CAPS" at the color operation level, along with two new color operation types: 1D LUT with Multiple Segments and 3x3 Matrix. It's important to remember that this implementation is based on an earlier stage of the KMS Color API and is awaiting review.
A Shout-Out to Those Who Made This Happen
I'm impressed by the solid implementation and clear direction of the V9 of the KMS Color API. It aligns with the many insightful discussions we've had over the past years. A huge thank you to Harry Wentland and Alex Hung for their dedication in bringing this to fruition!
Beyond their efforts, I deeply appreciate Uma and Chaitanya's commitment to updating Intel's driver implementation to align with the freshest version of the KMS Color API. The collaborative spirit of the AMD and Intel developers in sharing their color pipeline work upstream is invaluable. We're now gaining a much clearer picture of the color capabilities embedded in modern display hardware, all thanks to their hard work, comprehensive documentation, and engaging discussions.
Finally, thanks all the userspace developers, color science experts, and kernel developers from various vendors who actively participate in the upstream discussions, meetings, workshops, each iteration of this API and the crucial code review process. I'm happy to be part of the final stages of this long kernel journey, but I know that when it comes to colors, one step is completed for new challenges to be unlocked.
Looking forward to meeting you in this year Linux Display Next hackfest, organized by AMD in Toronto, to further discuss HDR, advanced color management, and other display trends.
19 May 2025 9:05pm GMT
14 May 2025
planet.freedesktop.org
Simon Ser: Status update, May 2025
Hi!
Today wlroots 0.19.0 has finally been released! Among the newly supported protocols, color-management-v1 lays the first stone of HDR support (backend and renderer bits are still being reviewed) and ext-image-copy-capture-v1 enhances the previous screen capture protocol with better performance. Explicit synchronization is now fully supported, and display-only devices such as gud or DisplayLink can now be used with wlroots. See the release notes for more details! I hope I'll be able to go back to some feature work and reviews now that the release is out of the way.
In other graphics news, I've finished my review of the core DRM patches for the new KMS color pipeline. Other kernel folks have reviewed the patches, we're just waiting on a user-space implementation now (which various compositor folks are working on). I've started a discussion about libliftoff support.
In addition to wlroots, this month I've also released a new version of my mobile IRC client, Goguma 0.8.0. delthas has sent a patch to synchronize pinned and muted conversations across devices via soju. Thanks to pounce, Goguma now supports message reactions (not included in the release):




My extended-isupport IRCv3 specification has been accepted. It allows servers to advertise metadata such as the maximum nickname length or IRC network name early (before the user provides a nickname and authentication details), which is useful for building nice connection UIs. I've posted another proposal for IRC network icons.
go-smtp 0.22.0 has been released with an improved DATA
command API, RRVS support (Require Recipient Valid Since), and custom hello after reset or STARTTLS. I've also spent quite a bit of time reaching out to companies for XDC 2025 sponsorships.
See you next month!
14 May 2025 10:00pm GMT
12 May 2025
planet.freedesktop.org
Tomeu Vizoso: Rockchip NPU update 5: Progress on the kernel driver
It has been almost a year since my last update on the Rockchip NPU, and though I'm a bit sad that I haven't had more time to work on it, I'm happy that I found some time earlier this year for this.
Quoting from my last update on the Rockchip NPU driver:
The kernel driver is able to fully use the three cores in the NPU, giving us the possibility of running 4 simultaneous object detection inferences such as the one below on a stream, at almost 30 frames per second.
All feedback has been incorporated in a new revision of the kernel driver and it was submitted to the Linux kernel mailing list.
Though I'm very happy with the direction the kernel driver is taking, I would have liked to make faster progress on it. I have spent the time since the first revision on making the Etnaviv NPU driver ready to be deployed in production (will be blogging about this soon), and also had to take some non-upstream work to pay my bills.
Next I plan to cleanup the userspace driver so it's ready for review, and then I will go for a third revision of the kernel driver.
12 May 2025 5:30am GMT
22 Apr 2025
planet.freedesktop.org
Melissa Wen: 2025 FOSDEM: Don't let your motivation go, save time with kworkflow
2025 was my first year at FOSDEM, and I can say it was an incredible experience where I met many colleagues from Igalia who live around the world, and also many friends from the Linux display stack who are part of my daily work and contributions to DRM/KMS. In addition, I met new faces and recognized others with whom I had interacted on some online forums and we had good and long conversations.
During FOSDEM 2025 I had the opportunity to present about kworkflow in the kernel devroom. Kworkflow is a set of tools that help kernel developers with their routine tasks and it is the tool I use for my development tasks. In short, every contribution I make to the Linux kernel is assisted by kworkflow.
The goal of my presentation was to spread the word about kworkflow. I aimed to show how the suite consolidates good practices and recommendations of the kernel workflow in short commands. These commands are easily configurable and memorized for your current work setup, or for your multiple setups.
For me, Kworkflow is a tool that accommodates the needs of different agents in the Linux kernel community. Active developers and maintainers are the main target audience for kworkflow, but it is also inviting for users and user-space developers who just want to report a problem and validate a solution without needing to know every detail of the kernel development workflow.
Something I didn't emphasize during the presentation but would like to correct this flaw here is that the main author and developer of kworkflow is my colleague at Igalia, Rodrigo Siqueira. Being honest, my contributions are mostly on requesting and validating new features, fixing bugs, and sharing scripts to increase feature coverage.
So, the video and slide deck of my FOSDEM presentation are available for download here.
And, as usual, you will find in this blog post the script of this presentation and more detailed explanation of the demo presented there.
Kworkflow at FOSDEM 2025: Speaker Notes and Demo
Hi, I'm Melissa, a GPU kernel driver developer at Igalia and today I'll be giving a very inclusive talk to not let your motivation go by saving time with kworkflow.
So, you're a kernel developer, or you want to be a kernel developer, or you don't want to be a kernel developer. But you're all united by a single need: you need to validate a custom kernel with just one change, and you need to verify that it fixes or improves something in the kernel.
And that's a given change for a given distribution, or for a given device, or for a given subsystem…
Look to this diagram and try to figure out the number of subsystems and related work trees you can handle in the kernel.
So, whether you are a kernel developer or not, at some point you may come across this type of situation:
There is a userspace developer who wants to report a kernel issue and says:
- Oh, there is a problem in your driver that can only be reproduced by running this specific distribution. And the kernel developer asks:
- Oh, have you checked if this issue is still present in the latest kernel version of this branch?
But the userspace developer has never compiled and installed a custom kernel before. So they have to read a lot of tutorials and kernel documentation to create a kernel compilation and deployment script. Finally, the reporter managed to compile and deploy a custom kernel and reports:
- Sorry for the delay, this is the first time I have installed a custom kernel. I am not sure if I did it right, but the issue is still present in the kernel of the branch you pointed out.
And then, the kernel developer needs to reproduce this issue on their side, but they have never worked with this distribution, so they just created a new script, but the same script created by the reporter.
What's the problem of this situation? The problem is that you keep creating new scripts!
Every time you change distribution, change architecture, change hardware, change project - even in the same company - the development setup may change when you switch to a different project, you create another script for your new kernel development workflow!
You know, you have a lot of babies, you have a collection of "my precious scripts", like Sméagol (Lord of the Rings) with the precious ring.
Instead of creating and accumulating scripts, save yourself time with kworkflow. Here is a typical script that many of you may have. This is a Raspberry Pi 4 script and contains everything you need to memorize to compile and deploy a kernel on your Raspberry Pi 4.
With kworkflow, you only need to memorize two commands, and those commands are not specific to Raspberry Pi. They are the same commands to different architecture, kernel configuration, target device.
What is kworkflow?
Kworkflow is a collection of tools and software combined to:
- Optimize Linux kernel development workflow.
- Reduce time spent on repetitive tasks, since we are spending our lives compiling kernels.
- Standardize best practices.
- Ensure reliable data exchange across kernel workflow. For example: two people describe the same setup, but they are not seeing the same thing, kworkflow can ensure both are actually with the same kernel, modules and options enabled.
I don't know if you will get this analogy, but kworkflow is for me a megazord of scripts. You are combining all of your scripts to create a very powerful tool.
What is the main feature of kworflow?
There are many, but these are the most important for me:
- Build & deploy custom kernels across devices & distros.
- Handle cross-compilation seamlessly.
- Manage multiple architecture, settings and target devices in the same work tree.
- Organize kernel configuration files.
- Facilitate remote debugging & code inspection.
- Standardize Linux kernel patch submission guidelines. You don't need to double check documentantion neither Greg needs to tell you that you are not following Linux kernel guidelines.
- Upcoming: Interface to bookmark, apply and "reviewed-by" patches from mailing lists (lore.kernel.org).
This is the list of commands you can run with kworkflow. The first subset is to configure your tool for various situations you may face in your daily tasks.
# Manage kw and kw configurations
kw init - Initialize kw config file
kw self-update (u) - Update kw
kw config (g) - Manage kernel .config files
The second subset is to build and deploy custom kernels.
# Build & Deploy custom kernels
kw kernel-config-manager (k) - Manage kernel .config files
kw build (b) - Build kernel
kw deploy (d) - Deploy kernel image (local/remote)
kw bd - Build and deploy kernel
We have some tools to manage and interact with target machines.
# Manage and interact with target machines
kw ssh (s) - SSH support
kw remote (r) - Manage machines available via ssh
kw vm - QEMU support
To inspect and debug a kernel.
# Inspect and debug
kw device - Show basic hardware information
kw explore (e) - Explore string patterns in the work tree and git logs
kw debug - Linux kernel debug utilities
kw drm - Set of commands to work with DRM drivers
To automatize best practices for patch submission like codestyle, maintainers and the correct list of recipients and mailing lists of this change, to ensure we are sending the patch to who is interested in it.
# Automatize best practices for patch submission
kw codestyle (c) - Check code style
kw maintainers (m) - Get maintainers/mailing list
kw send-patch - Send patches via email
And the last one, the upcoming patch hub.
# Upcoming
kw patch-hub - Interact with patches (lore.kernel.org)
How can you save time with Kworkflow?
So how can you save time building and deploying a custom kernel?
First, you need a .config file.
- Without kworkflow: You may be manually extracting and managing .config files from different targets and saving them with different suffixes to link the kernel to the target device or distribution, or any descriptive suffix to help identify which is which. Or even copying and pasting from somewhere.
- With kworkflow: you can use the kernel-config-manager command, or simply
kw k
, to store, describe and retrieve a specific .config file very easily, according to your current needs.
Then you want to build the kernel:
- Without kworkflow: You are probably now memorizing a combination of commands and options.
- With kworkflow: you just need
kw b
(kw build) to build the kernel with the correct settings for cross-compilation, compilation warnings, cflags, etc. It also shows some information about the kernel, like number of modules.
Finally, to deploy the kernel in a target machine.
- Without kworkflow: You might be doing things like: SSH connecting to the remote machine, copying and removing files according to distributions and architecture, and manually updating the bootloader for the target distribution.
- With kworkflow: you just need
kw d
which does a lot of things for you, like: deploying the kernel, preparing the target machine for the new installation, listing available kernels and uninstall them, creating a tarball, rebooting the machine after deploying the kernel, etc.
You can also save time on debugging kernels locally or remotely.
- Without kworkflow: you do: ssh, manual setup and traces enablement, copy&paste logs.
- With kworkflow: more straighforward access to debug utilities: events, trace, dmesg.
You can save time on managing multiple kernel images in the same work tree.
- Without kworkflow: now you can be cloning multiple times the same repository so you don't lose compiled files when changing kernel configuration or compilation options and manually managing build and deployment scripts.
- With kworkflow: you can use
kw env
to isolate multiple contexts in the same worktree as environments, so you can keep different configurations in the same worktree and switch between them easily without losing anything from the last time you worked in a specific context.
Finally, you can save time when submitting kernel patches. In kworkflow, you can find everything you need to wrap your changes in patch format and submit them to the right list of recipients, those who can review, comment on, and accept your changes.
This is a demo that the lead developer of the kw patch-hub feature sent me. With this feature, you will be able to check out a series on a specific mailing list, bookmark those patches in the kernel for validation, and when you are satisfied with the proposed changes, you can automatically submit a reviewed-by for that whole series to the mailing list.
Demo
Now a demo of how to use kw environment to deal with different devices, architectures and distributions in the same work tree without losing compiled files, build and deploy settings, .config file, remote access configuration and other settings specific for those three devices that I have.
Setup
- Three devices:
-
laptop (debian x86 intel local) -
SteamDeck (steamos x86 amd remote) -
RaspberryPi 4 (raspbian arm64 broadcomm remote)
-
- Goal: To validate a change on DRM/VKMS using a single kernel tree.
- Kworkflow commands:
- kw env
- kw d
- kw bd
- kw device
- kw debug
- kw drm
Demo script
In the same terminal and worktree.
First target device: Laptop (debian|x86|intel|local)
$ kw env --list # list environments available in this work tree
$ kw env --use LOCAL # select the environment of local machine (laptop) to use: loading pre-compiled files, kernel and kworkflow settings.
$ kw device # show device information
$ sudo modinfo vkms # show VKMS module information before applying kernel changes.
$ <open VKMS file and change module info>
$ kw bd # compile and install kernel with the given change
$ sudo modinfo vkms # show VKMS module information after kernel changes.
$ git checkout -- drivers
Second target device: RaspberryPi 4 (raspbian|arm64|broadcomm|remote)
$ kw env --use RPI_64 # move to the environment for a different target device.
$ kw device # show device information and kernel image name
$ kw drm --gui-off-after-reboot # set the system to not load graphical layer after reboot
$ kw b # build the kernel with the VKMS change
$ kw d --reboot # deploy the custom kernel in a Raspberry Pi 4 with Raspbian 64, and reboot
$ kw s # connect with the target machine via ssh and check the kernel image name
$ exit
Third target device: SteamDeck (steamos|x86|amd|remote)
$ kw env --use STEAMDECK # move to the environment for a different target device
$ kw device # show device information
$ kw debug --dmesg --follow --history --cmd="modprobe vkms" # run a command and show the related dmesg output
$ kw debug --dmesg --follow --history --cmd="modprobe -r vkms" # run a command and show the related dmesg output
$ <add a printk with a random msg to appear on dmesg log>
$ kw bd # deploy and install custom kernel to the target device
$ kw debug --dmesg --follow --history --cmd="modprobe vkms" # run a command and show the related dmesg output after build and deploy the kernel change
Q&A
Most of the questions raised at the end of the presentation were actually suggestions and additions of new features to kworkflow.
The first participant, that is also a kernel maintainer, asked about two features: (1) automatize getting patches from patchwork (or lore) and triggering the process of building, deploying and validating them using the existing workflow, (2) bisecting support. They are both very interesting features. The first one fits well the patch-hub subproject, that is under-development, and I've actually made a similar request a couple of weeks before the talk. The second is an already existing request in kworkflow github project.
Another request was to use kexec and avoid rebooting the kernel for testing. Reviewing my presentation I realized I wasn't very clear that kworkflow doesn't support kexec. As I replied, what it does is to install the modules and you can load/unload them for validations, but for built-in parts, you need to reboot the kernel.
Another two questions: one about Android Debug Bridge (ADB) support instead of SSH and another about support to alternative ways of booting when the custom kernel ended up broken but you only have one kernel image there. Kworkflow doesn't manage it yet, but I agree this is a very useful feature for embedded devices. On Raspberry Pi 4, kworkflow mitigates this issue by preserving the distro kernel image and using config.txt file to set a custom kernel for booting. For ADB, there is no support too, and as I don't see currently users of KW working with Android, I don't think we will have this support any time soon, except if we find new volunteers and increase the pool of contributors.
The last two questions were regarding the status of b4 integration, that is under development, and other debugging features that the tool doesn't support yet.
Finally, when Andrea and I were changing turn on the stage, he suggested to add support for virtme-ng to kworkflow. So I opened an issue for tracking this feature request in the project github.
With all these questions and requests, I could see the general need for a tool that integrates the variety of kernel developer workflows, as proposed by kworflow. Also, there are still many cases to be covered by kworkflow.
Despite the high demand, this is a completely voluntary project and it is unlikely that we will be able to meet these needs given the limited resources. We will keep trying our best in the hope we can increase the pool of users and contributors too.
22 Apr 2025 7:30pm GMT
16 Apr 2025
planet.freedesktop.org
Simon Ser: Status update, April 2025
Hi!
Last week wlroots 0.19.0-rc1 has been released! It includes the new color management protocol, however it doesn't include HDR10 support because the renderer and backend bits haven't yet been merged. Also worth noting is full explicit synchronization support as well as the new screen capture protocols. I plan to release new release candidates weekly until we're happy with the stability. Please test!
Sway is also getting close to its first release candidate. I plan to publish version 1.11.0-rc1 this week-end. Thanks to Ferdinand Bachmann, Sway no longer aborts on shutdown due to dangling signal listeners. I've also updated my HDR10 patch to add an output hdr
command (but it's Sway 1.12 material).
I've spent a bit of time on libicc, my C library to manipulate ICC profiles. I've introduced an encoder to make it easy to write new ICC profiles, and used that to write a small program to create an ICC profile which inverts colors. The encoder doesn't support as many ICC elements as the decoder yet (patches welcome!), but does support many interesting bits for display profiles: basic matrices and curves, lut16Type
elements and more advanced lutAToBType
elements. New APIs have been introduced to apply ICC profile transforms to a color value. I've also added tests which compare the results given by libicc and by LittleCMS. For some reason lut16Type
and lutAToBType
results are multiplied by 2 by LittleCMS, I haven't yet understood why that is, even after reading the spec in depth and staring at LittleCMS source code for a few hours (if you have a guess please ping me). In the future I'd like to add a small tool to convert ICC profiles to and from JSON files to make it easy to create new files or adjust exist ones.
Version 0.9.0 of the soju IRC bouncer has been released. Among the most notable changes, the database is used by default to store messages, pinned/muted channels and buffers can be synchronized across devices, and database queries have been optimized. I've continued working on the Goguma mobile IRC client, fixing a few bugs such as dangling Firebase push subscriptions and message notifications being dismissed too eagerly.
Max Ehrlich has contributed a mako patch to introduce a Notifications
property to the mako-specific D-Bus API, so that external programs can monitor active notifications (e.g. display a count in a status bar, or display a list on a lockscreen).
That's all I have in store, see you next month!
16 Apr 2025 10:00pm GMT
Mike Blumenkrantz: Another Milestone
16 Apr 2025 12:00am GMT
15 Apr 2025
planet.freedesktop.org
Christian Schaller: Fedora Workstation 42 is upon us!
We are excited about the Fedora Workstation 42 released today. Having worked on some great features for it.
Fedora Workstation 42 HDR edition
I would say that the main feature that landed was HDR or High Dynamic Range. It is a feature we spent years on with many team members involved and a lot of collaboration with various members of the wider community.

GNOME Settings menu showing HDR settings
The fact that we got this over the finish line was especially due to all the work Sebastian Wick put into it in collaboration with Pekka Paalanen around HDR Wayland specification and implementations.
Another important aspect was tools like libdisplay which was co-created with Simon Ser, with others providing more feedback and assistance in the final stretch of the effort.

HDR setup in Ori and Will of the Wisps
That said a lot of other people at Red Hat and in the community deserve shout outs for this too. Like Xaver Hugl whose work on HDR in Kwin was a very valuable effort that helped us move the GNOME support forward too. Matthias Clasen and Benjamin Otte for their work on HDR support in GTK+, Martin Stransky for his work on HDR support in Firefox, Jonas Aadahl and Olivier Fourdan for their protocol and patch reviews. Jose Exposito for packaging up the Mesa Vulkan support for Fedora 42.
One area that should benefit from HDR support are games. In the screenshot about you see the game Ori and the Will of the Wisps which is known for great HDR support. Valve will need to update to a Wine version for Proton that supports Wayland natively though before this just works, at the moment you can get it working using gamescope, but hopefully soon it will just work under both Mutter and Kwin.
Also a special shoutout to the MPV community for quickly jumping on this and releasing a HDR capable video player recently.

MPV video player playing HDR content
Of course getting Fedora Workstation 42 to out with these features is just the beginning, with the baseline support it now is really the time when application maintainers have a real chance of starting to make use of these features, so I would expect various content creative applications for instance to start having support over the next year.
For the desktop itself there are also open questions we need to decide on like:
- Format to use for HDR screenshots
- Better backlight and brightness handling
- Better offloading
- HDR screen recording video format
- How to handle HDR webcams (seems a lot of them are not really capable of producing HDR output).
- Version of the binary NVIDIA driver released supporting the
VK_EXT_hdr_metadata and VK_COLOR_SPACE_HDR10_ST2084_EXT
Vulkan extension on Linux - A million smaller issues we will need to iron out
Accessibility
Our accessibility team has been hard at work trying to ensure we have a great accessibility story in Fedora Workstation 42. Our accessibility team with Lukas Tyrychtr and Bohdan Milar has been working hard together with others to ensure that Fedora Workstation 42 has the best accessibility support you can get on Linux. One major effort that landed was the new keyboard monitoring interface which is critical for making Orca work well under Wayland. This was a collaboration of between Lukas Tyrychtr, Matthias Clasen and Carlos Garnacho on our team. If you are interested in Accessibility, as a user or a developer or both then make sure to join in by reaching out to the Accessibility Working group
PipeWire
PipeWire also keeps going strong with continuous improvements and bugfixes. Thanks to the great work by Jan Grulich the support for PipeWire in Firefox and Chrome is now working great, including for camera handling. It is an area where we want to do an even better job though, so Wim Taymans is currently looking at improving video handling to ensure we are using the best possible video stream the camera can provide and handle conversion between formats transparently. He is currently testing it out using a ffmpeg software backend, but the end goal is to have it all hardware accelerated through directly using Vulkan.
Another feature Wim Taymans added recently is MIDI2 support. This is the next generation of MIDI with only a limited set of hardware currently supporting it, but on the other hand it feels good that we are now able to be ahead of the curve instead of years behind thanks to the solid foundation we built with Pipewire.
Wayland
For a long time the team has been focused on making sure Wayland has all the critical pieces and was functionality wise on the same level as X11. For instance we spent a lot of time and effort on ensuring proper remote desktop support. That work all landed in the previous Fedora release which means that over the last 6 Months the team has had more time to look at things like various proposed Wayland protocols and get them supported in GNOME. Thanks to that we helped ensure the Cursor Shape Protocol and Toplevel Drag protocols got landed in time for this release. We are already looking and what to help land for the next release, so expect a continued acceleration in Wayland protocol adoption going forward.
First steps into AI
So an effort we been plugging away at recently is starting to bring AI tooling to Open Source desktop applications. Our first effort in this regard is Granite.code. Granite.code is a extension for Visual Studio Code that sets up a local AI engine on your system to help with various tasks including code generation and chat inside Visual Studio Code. So what is special about this effort is that it relies on downloading and running a copy of the open source AI Granite LLM model to your system instead on relying on it being run in a cloud instance somewhere. That means you can use Granite.code without having to share your data and work with someone else. Granite.code is still very early stage and it requires a NVIDIA or AMD GPU with over 8GB of video ram to use under Linux. (It also runs under Windows and MacOS X). It is still in a pre-release stage, we are waiting for the Granite 3.3 model update to enable some major features for us before we make the first formal release, but for those willing to help us test you can search for Granite in the Visual Studio Code extension marketplace and install it.
We are hoping though that this will just the starting point where our work can get picked up and used by other IDEs out there too and also we are thinking about how we can offer AI features in other parts of the desktop too.

Granite.code running on Linux
15 Apr 2025 2:38pm GMT
28 Mar 2025
planet.freedesktop.org
André Almeida: Linux 6.14, an almost forgotten release
Linux 6.14 is the second release of 2025, and as usual Igalia took part on it. It's a very normal release, except that it was release on Monday, instead of the usual Sunday release that has been going on for years now. The reason behind this? Well, quoting Linus himself:
I'd like to say that some important last-minute thing came up and delayed things.
But no. It's just pure incompetence.
But we did not forget about it, so here's our Linux 6.14 blog post!
A part of the development cycle for this release happened during late December, when a lot of maintainers and developers were taking their deserved breaks. As a result of this, this release contains less changes than usual as stated by LWN as the "lowest level of merge-window activity seen in years". Nevertheless, some cool features made through this release:
- NT synchronization primitives: Elizabeth Figura, from Codeweavers, is know from her work around improving Wine sync functions, like mutexes and semaphores. She was one the main collaborators behind the
futex_waitv()
work and now developed a virtual driver that is more compliant with the precise semantics that the NT kernel exposes. This allows Wine to behave closer to Windows without the need to create new syscalls, since this driver usesioctl()
as the front-end uAPI. - RWF_UNCACHED: Linux has two ways of dealing with storage I/O: buffered I/O (usually the preferred one) that stores data in a temporary buffer and regularly syncs the cache data with the device; and direct I/O that doesn't use cache and always writes/reads synchronously with the storage device. Now a new mixed approach is available: uncached buffered I/O. This method is aimed to have a fast way to write or read data that will not be needed again in the short term. For reading, the device writes data in the buffer and as soon as the user finished reading the buffer, it's cleared from the cache. For writing, as soon as userspace fills the cache, the device reads it and removes it from the cache. In this way we still have the advantage of using a fast cache but reducing the cache pressure.
- amdgpu panic support: AMD developers added kernel panic support for amdgpu driver, "which displays a pretty user friendly message on the screen when a Linux kernel panic occurs" instead of just a black screen or a partial dmesg log.
As usual Kernel Newbies provides a very good summary, you should check it for more details: Linux 6.14 changelog. Now let's jump to see what were the merged contributions by Igalia for this release!
DRM
For the DRM common infrastructure, we helped to land a standardization for DRM client memory usage reporting. Additionally, we contributed to improve and fix bugs found in drivers of AMD, Intel, Broadcom, and Vivante.
AMDGPU
For the AMD driver, we fixed bugs experienced by users of Cosmic Desktop Environment on several AMD hardware versions. One was uncovered with the introduction of overlay cursor mode, and a definition mismatch across the display driver caused a page-fault in the usage of multiple overlay planes. Another bug was related to division by zero on plane scaling. Also, we fixed regressions on VRR and MST generated by the series of changes to migrate AMD display driver from open-coded EDID handling to drm_edid
struct.
Intel
For the Intel drivers, we fixed a bug in the xe GPU driver which prevented certain type of workarounds from being applied, helped with the maintainership of the i915 driver, handled external code contributions, maintained the development branch and sent several pull requests.
Raspberry Pi (V3D)
We fixed the GPU resets for the Raspberry Pi 4 as we found out to be broken as per a user bug report.
Also in the V3D driver, the active performance monitor is now properly stopped before being destroyed, addressing a potential use-after-free issue. Additionally, support for a global performance monitor has been added via a new DRM_IOCTL_V3D_PERFMON_SET_GLOBAL
ioctl. This allows all jobs to share a single, globally configured perfmon, enabling more consistent performance tracking and paving the way for integration with user-space tools such as perfetto.
A small video demo of perfetto integration with V3D
etnaviv
On the etnaviv side, fdinfo
support has been implemented to expose memory usage statistics per file descriptor, enhancing observability and debugging capabilities for memory-related behavior.
sched_ext
Many BPF schedulers (e.g., scx_lavd
) frequently call bpf_ktime_get_ns()
for tracking tasks' runtime properties. bpf_ktime_get_ns()
eventually reads a hardware timestamp counter (TSC). However, reading a hardware TSC is not performant in some hardware platforms, degrading instructions per cycyle (IPC).
We addressed the performance problem of reading hardware TSC by leveraging the rq clock in the scheduler core, introducing a scx_bpf_now()
function for BPF schedulers. Whenever the rq clock is fresh and valid, scx_bpf_now()
provides the rq clock, which is already updated by the scheduler core, so it can reduce reading the hardware TSC. Using scx_bpf_now()
reduces the number of reading hardware TSC by 50-80% (e.g., 76% for scx_lavd
).
Assorted kernel fixes
Continuing our efforts on cleaning up kernel bugs, we provided a few fixes that address issues reported by syzbot with the goal of increasing stability and security, leveraging the fuzzing capabilities of syzkaller to bring to the surface certain bugs that are hard to notice otherwise. We're addressing bug reports from different kernel areas, including drivers and core subsystems such as the memory manager. As part of this effort, several fixes were done for the probe path of the rtlwifi driver.
Check the complete list of Igalia's contributions for the 6.14 release
Authored (38)
Changwoo Min
- sched_ext: Relocate scx_enabled() related code
- sched_ext: Implement scx_bpf_now()
- sched_ext: Add scx_bpf_now() for BPF scheduler
- sched_ext: Add time helpers for BPF schedulers
- sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now()
- sched_ext: Use time helpers in BPF schedulers
- sched_ext: Fix incorrect time delta calculation in time_delta()
Christian Gmeiner
- drm/v3d: Stop active perfmon if it is being destroyed
- drm/etnaviv: Add fdinfo support for memory stats
- drm/v3d: Add DRM_IOCTL_V3D_PERFMON_SET_GLOBAL
Luis Henriques
Maíra Canal
- drm/v3d: Fix performance counter source settings on V3D 7.x
- drm/v3d: Fix miscellaneous documentation errors
- drm/v3d: Assign job pointer to NULL before signaling the fence
- drm/v3d: Don't run jobs that have errors flagged in its fence
- drm/v3d: Set job pointer to NULL when the job's fence has an error
Melissa Wen
- drm/amd/display: fix page fault due to max surface definition mismatch
- drm/amd/display: increase MAX_SURFACES to the value supported by hw
- drm/amd/display: fix divide error in DM plane scale calcs
- drm/amd/display: restore invalid MSA timing check for freesync
- drm/amd/display: restore edid reading from a given i2c adapter
Ricardo Cañuelo Navarro
- mm,madvise,hugetlb: check for 0-length range after end address adjustment
- mm: shmem: remove unnecessary warning in shmem_writepage()
Rodrigo Siqueira
Thadeu Lima de Souza Cascardo
- wifi: rtlwifi: do not complete firmware loading needlessly
- wifi: rtlwifi: rtl8192se: rise completion of firmware loading as last step
- wifi: rtlwifi: wait for firmware loading before releasing memory
- wifi: rtlwifi: fix init_sw_vars leak when probe fails
- wifi: rtlwifi: usb: fix workqueue leak when probe fails
- wifi: rtlwifi: remove unused check_buddy_priv
- wifi: rtlwifi: destroy workqueue at rtl_deinit_core
- wifi: rtlwifi: fix memory leaks and invalid access at probe error path
- wifi: rtlwifi: pci: wait for firmware loading before releasing memory
- Revert "media: uvcvideo: Require entities to have a non-zero unique ID"
- char: misc: deallocate static minor in error path
Tvrtko Ursulin
- drm/amdgpu: Use DRM scheduler API in amdgpu_xcp_release_sched
- drm/xe: Fix GT "for each engine" workarounds
Reviewed (36)
André Almeida
- ASoC: cs35l41: Fallback to using HID for system_name if no SUB is available
- ASoC: cs35l41: Fix acpi_device_hid() not found
Christian Gmeiner
- drm/v3d: Fix performance counter source settings on V3D 7.x
- drm/etnaviv: Convert timeouts to secs_to_jiffies()
Iago Toral Quiroga
- drm/v3d: Fix performance counter source settings on V3D 7.x
- drm/v3d: Assign job pointer to NULL before signaling the fence
- drm/v3d: Don't run jobs that have errors flagged in its fence
- drm/v3d: Set job pointer to NULL when the job's fence has an error
Jose Maria Casanova Crespo
Luis Henriques
- fuse: rename to fuse_dev_end_requests and make non-static
- fuse: Move fuse_get_dev to header file
- fuse: Move request bits
- fuse: Add fuse-io-uring design documentation
- fuse: make args->in_args[0] to be always the header
- fuse: {io-uring} Handle SQEs - register commands
- fuse: Make fuse_copy non static
- fuse: Add fuse-io-uring handling into fuse_copy
- fuse: {io-uring} Make hash-list req unique finding functions non-static
- fuse: Add io-uring sqe commit and fetch support
- fuse: {io-uring} Handle teardown of ring entries
- fuse: {io-uring} Make fuse_dev_queue_{interrupt,forget} non-static
- fuse: Allow to queue fg requests through io-uring
- fuse: Allow to queue bg requests through io-uring
- fuse: {io-uring} Prevent mount point hang on fuse-server termination
- fuse: block request allocation until io-uring init is complete
- fuse: enable fuse-over-io-uring
- fuse: prevent disabling io-uring on active connections
Maíra Canal
- drm/vkms: Remove index parameter from init_vkms_output
- drm/vkms: Code formatting
- drm/vkms: Use drm_frame directly
- drm/vkms: Use const for input pointers in pixel_read an pixel_write functions
- drm/v3d: Add DRM_IOCTL_V3D_PERFMON_SET_GLOBAL
Tvrtko Ursulin
- drm/etnaviv: Add fdinfo support for memory stats
- drm: make drm-active- stats optional
- Documentation/gpu: Clarify drm memory stats definition
- drm/sched: Fix preprocessor guard
Tested (2)
André Almeida
Christian Gmeiner
Acked (1)
Iago Toral Quiroga
Maintainer SoB (6)
Maíra Canal
- drm/v3d: Stop active perfmon if it is being destroyed
- drm/v3d: Add DRM_IOCTL_V3D_PERFMON_SET_GLOBAL
- drm/vc4: plane: Remove WARN on state being set in plane_reset
Tvrtko Ursulin
- drm/i915: Remove deadcode
- drm/i915: Remove unused intel_huc_suspend
- drm/i915: Remove unused intel_ring_cacheline_align
28 Mar 2025 12:00am GMT
15 Mar 2025
planet.freedesktop.org
Simon Ser: Status update, March 2025
Hi all!
This month I've finally finished my initial work on HDR10 support for wlroots! My branch supports playing both SDR and HDR content on either an SDR or HDR output. It's a pretty basic version: wlroots only performs very basic gamut mapping, and has a simple luminance multiplier instead of proper tone mapping. Additionally the source content luminance and mastering display metadata isn't taken into account. Thus the result isn't as good as it could be, but that can be improved once the initial work is merged!
I've also been talking with dnkl about blending optical color values rather than electrical values in foot ("gamma-correct blending"). Thanks to the color-management protocol, foot can specify that its buffers contain linearly encoded values (as opposed to the default, sRGB) and can implement this blending method without sacrificing performance. See the foot pull request for more details.
We've been working on fixing the few last known blockers remaining for the next wlroots release, in particular related to scene-graph clipping, custom modes, and explicit synchronization. I hope we'll be able to start the release candidate dance soon.
The NPotM is Bakah, a small utility to build Docker Bake configuration files with Buildah (the library powering Podman). I've written more about the motivation and design of this tool in a separate article.
I've released tlstunnel 0.4 with better support for certificate files and some bugfixes. The sogogi WebDAV file server got support for graceful shutdown and Unix socket listeners thanks to Krystian Chachuła. Last, mako 1.10 adds a bunch of useful features such as include
directives, more customization for border sizes and icon border radius, and a --no-history
flag for makoctl dismiss
.
See you next month!
15 Mar 2025 10:00pm GMT
13 Mar 2025
planet.freedesktop.org
Pekka Paalanen: Wayland color-management, SDR vs. HDR, and marketing
This time I have three topics.
First, I want to promote the blog post I wrote to celebrate the landing of the Wayland color-management extension into wayland-protocols staging area. It's a brief historique of the journey.
Second, I want to discuss SDR and HDR video modes on monitors and TVs. I have seen people expect that the same sRGB content displayed on the SDR video mode and the HDR (BT.2100/PQ) video mode on the same monitor will look the same, and they can arbitrarily switch between the modes at any time. I have argued that this is a false expectation. Why?
Monitors tend to have a slew of settings. I tend to call them monitor "knobs". There are brightness, contrast, color temperature, picture mode, dynamic contrast, sharpness, gamma, and whatever. Many people have noticed that when the video source puts the monitor into BT.2100/PQ video mode, the monitor locks out some settings, often brightness and/or contrast included. So, SDR and HDR video modes do not play by the same rules. Hence, one cannot generally expect a match even if the video source does everything correctly.
Third, there is marketing. Have a look at the first third of this video. They discuss video streaming services, TV selling, and HDR from the picture quality point of view. My take of that is, that (some? most?) monitors and TVs come with a screaming broken picture out-of-the-box because marketing has to sell them. If all displays displayed a given content as intended, they would all look the same, major technology differences notwithstanding, but marketing wants to make each individual stand out.
Have you heard of TV calibration services? If I buy a new TV from a local electronics department store, they offer a calibration service, for a considerable additional fee. Why would anyone need a calibration service, the factory settings should be good, right?
13 Mar 2025 9:40am GMT
11 Mar 2025
planet.freedesktop.org
Ricardo Garcia: Device-Generated Commands at Vulkanised 2025
A month ago I attended Vulkanised 2025 in Cambridge, UK, to present a talk about Device-Generated Commands in Vulkan. The event was organized by Khronos and took place in the Arm Cambridge office. The talk I presented was similar to the one from XDC 2024, but instead of being a lightning 5-minutes talk, I had 25-30 minutes to present and I could expand the contents to contain proper explanations of almost all major DGC concepts that appear in the spec.
I attended the event together with my Igalia colleagues Lucas Fryzek and Stéphane Cerveau, who presented about lavapipe and Vulkan Video, respectively. We had a fun time in Cambridge and I can sincerely recommend attending the event to any Vulkan enthusiasts out there. It allows you to meet Khronos members and people working on both the specification and drivers, as well as many other Vulkan users from a wide variety of backgrounds.
The recordings for all sessions are now publicly available, and the one for my talk can be found embedded below. For those of you preferring slides and text, I'm also providing a transcription of my presentation together with slide screenshots further down.
In addition, at the end of the video there's a small Q&A section but I've always found it challenging to answer questions properly on the fly and with limited time. For this reason, instead of transcribing the Q&A section literally, I've taken the liberty of writing down the questions and providing better answers in written form, and I've also included an extra question that I got in the hallways as bonus content. You can find the Q&A section right after the embedded video.
Vulkanised 2025 recording
Questions and answers with longer explanations
Question: can you give an example of when it's beneficial to use Device-Generated Commands?
There are two main use cases where DGC would improve performance: on the one hand, many times game engines use compute pre-passes to analyze the scene they want to draw and prepare some data for that scene. This includes maybe deciding LOD levels, discarding content, etc. After that compute pre-pass, results would need to be analyzed from the CPU in some way. This implies a stall: the output from that compute pre-pass needs to be transferred to the CPU so the CPU can use it to record the right drawing commands, or maybe you do this compute pre-pass during the previous frame and it contains data that is slightly out of date. With DGC, this compute dispatch (or set of compute dispatches) could generate the drawing commands directly, so you don't stall or you can use more precise data. You also save some memory bandwidth because you don't need to copy the compute results to host-visible memory.
On the other hand, sometimes scenes contain so much detail and geometry that recording all the draw calls from the CPU takes a nontrivial amount of time, even if you distribute this draw call recording among different threads. With DGC, the GPU itself can generate these draw calls, so potentially it saves you a lot of CPU time.
Question: as the extension makes heavy use of buffer device addresses, what are the challenges for tools like GFXReconstruct when used to record and replay traces that use DGC?
The extension makes use of buffer device addresses for two separate things. First, it uses them to pass some buffer information to different API functions, instead of passing buffer handles, offsets and sizes. This is not different from other APIs that existed before. The VK_KHR_buffer_device_address extension contains APIs like vkGetBufferOpaqueCaptureAddressKHR, vkGetDeviceMemoryOpaqueCaptureAddressKHR that are designed to take care of those cases and make it possible to record and reply those traces. Contrary to VK_KHR_ray_tracing_pipeline, which has a feature to indicate if you can capture and replay shader group handles (fundamental for capture and replay when using ray tracing), DGC does not have any specific feature for capture-replay. DGC does not add any new problem from that point of view.
Second, the data for some commands that is stored in the DGC buffer sometimes includes device addresses. This is the case for the index buffer bind command, the vertex buffer bind command, indirect draws with count (double indirection here) and ray tracing command. But, again, the addresses in those commands are buffer device addresses. That does not add new challenges for capture and replay compared to what we already had.
Question: what is the deal with the last token being the one that dispatches work?
One minor detail from DGC, that's important to remember, is that, by default, DGC respects the order in which sequences appear in the DGC buffer and the state used for those sequences. If you have a DGC buffer that dispatches multiple draws, you know the state that is used precisely for each draw: it's the state that was recorded before the execute-generated-commands call, plus the small changes that a particular sequence modifies like push constant values or vertex and index buffer binds, for example. In addition, you know precisely the order of those draws: executing the DGC buffer is equivalent, by default, to recording those commands in a regular command buffer from the CPU, in the same order they appear in the DGC buffer.
However, when you create an indirect commands layout you can indicate that the sequences in the buffer may run in an undefined order (this is VK_INDIRECT_COMMANDS_LAYOUT_USAGE_UNORDERED_SEQUENCES_BIT_EXT). If the sequences could dispatch work and then change state, we would have a logical problem: what do those state changes affect? The sequence that is executed right after the current one? Which one is that? We would not know the state used for each draw. Forcing the work-dispatching command to be the last one is much easier to reason about and is also logically tight.
Naturally, if you have a series of draws on the CPU where, for some of them, you change some small bits of state (e.g. like disabling the depth or stencil tests) you cannot do that in a single DGC sequence. For those cases, you need to batch your sequences in groups with the same state (and use multiple DGC buffers) or you could use regular draws for parts of the scene and DGC for the rest.
Question from the hallway: do you know what drivers do exactly at preprocessing time that is so important for performance?
Most GPU drivers these days have a kernel side and a userspace side. The kernel driver does a lot of things like talking to the hardware, managing different types of memory and buffers, talking to the display controller, etc. The kernel driver normally also has facilities to receive a command list from userspace and send it to the GPU.
These command lists are particular for each GPU vendor and model. The packets that form it control different aspects of the GPU. For example (this is completely made-up), maybe one GPU has a particular packet to modify depth buffer and test parameters, and another packet for the stencil test and its parameters, while another GPU from another vendor has a single packet that controls both. There may be another packet that dispatches draw work of all kinds and is flexible to accomodate the different draw commands that are available on Vulkan.
The Vulkan userspace driver translates Vulkan command buffer contents to these GPU-specific command lists. In many drivers, the preprocessing step in DGC takes the command buffer state, combines it with the DGC buffer contents and generates a final command list for the GPU, storing that final command list in the preprocess buffer. Once the preprocess buffer is ready, executing the DGC commands is only a matter of sending that command list to the GPU.
Talk slides and transcription
Hello, everyone! I'm Ricardo from Igalia and I'm going to talk about device-generated commands in Vulkan.
First, some bits about me. I have been part of the graphics team at Igalia since 2019. For those that don't know us, Igalia is a small consultancy company specialized in open source and my colleagues in the graphics team work on things such as Mesa drivers, Linux kernel drivers, compositors… that kind of things. In my particular case the focus of my work is contributing to the Vulkan Conformance Test Suite and I do that as part of a collaboration between Igalia and Valve that has been going on for a number of years now. Just to highlight a couple of things, I'm the main author of the tests for the mesh shading extension and device-generated commands that we are talking about today.
So what are device-generated commands? So basically it's a new extension, a new functionality, that allows a driver to read command sequences from a regular buffer: something like, for example, a storage buffer, instead of the usual regular command buffers that you use. The contents of the DGC buffer could be filled from the GPU itself. This is what saves you the round trip to the CPU and, that way, you can improve the GPU-driven rendering process in your application. It's like one step ahead of indirect draws and dispatches, and one step behind work graphs. And it's also interesting because device-generated commands provide a better foundation for translating DX12. If you have a translation layer that implements DX12 on top of Vulkan like, for example, Proton, and you want to implement ExecuteIndirect, you can do that much more easily with device generated commands. This is important for Proton, which Valve uses to run games on the Steam Deck, i.e. Windows games on top of Linux.
If we set aside Vulkan for a moment, and we stop thinking about GPUs and such, and you want to come up with a naive CPU-based way of running commands from a storage buffer, how do you do that? Well, one immediate solution we can think of is: first of all, I'm going to assign a token, an identifier, to each of the commands I want to run, and I'm going to store that token in the buffer first. Then, depending on what the command is, I want to store more information.
For example, if we have a sequence like we see here in the slide where we have a push constant command followed by dispatch, I'm going to store the token for the push constants command first, then I'm going to store some information that I need for the push constants command, like the pipeline layout, the stage flags, the offset and the size. Then, after that, depending on the size that I said I need, I am going to store the data for the command, which is the push constant values themselves. And then, after that, I'm done with it, and I store the token for the dispatch, and then the dispatch size, and that's it.
But this doesn't really work: this is not how GPUs work. A GPU would have a hard time running commands from a buffer if we store them this way. And this is not how Vulkan works because in Vulkan you want to provide as much information as possible in advance and you want to make things run in parallel as much as possible, and take advantage of the GPU.
So what do we do in Vulkan? In Vulkan, and in the Vulkan VK_EXT_device_generated_commands extension, we have this central concept, which is called the Indirect Commands Layout. This is the main thing, and if you want to remember just one thing about device generated commands, you can remember this one.
The indirect commands layout is basically like a template for a short sequence of commands. The way you build this template is using the tokens and the command information that we saw colored red and green in the previous slide, and you build that in advance and pass that in advance so that, in the end, in the command buffer itself, in the buffer that you're filling with commands, you don't need to store that information. You just store the data for each command. That's how you make it work.
And the result of this is that with the commands layout, that I said is a template for a short sequence of commands (and by short I mean a handful of them like just three, four or five commands, maybe 10), the DGC buffer can be pretty large, but it does not contain a random sequence of commands where you don't know what comes next. You can think about it as divided into small chunks that the specification calls sequences, and you get a large number of sequences stored in the buffer but all of them follow this template, this commands layout. In the example we had, push constant followed by dispatch, the contents of the buffer would be push constant values, dispatch size, push content values, dispatch size, many times repeated.
The second thing that Vulkan does to be able to make this work is that we limit a lot what you can do with device-generated commands. There are a lot of things you cannot do. In fact, the only things you can do are the ones that are present in this slide.
You have some things like, for example, update push constants, you can bind index buffers, vertex buffers, and you can draw in different ways, using mesh shading maybe, you can dispatch compute work and you can dispatch raytracing work, and that's it. You also need to check which features the driver supports, because maybe the driver only supports device-generated commands for compute or ray tracing or graphics. But you notice you cannot do things like start render passes or insert barriers or bind descriptor sets or that kind of thing. No, you cannot do that. You can only do these things.
This indirect commands layout, which is the backbone of the extension, specifies, as I said, the layout for each sequence in the buffer and it has additional restrictions. The first one is that it must specify exactly one token that dispatches some kind of work and it must be the last token in the sequence. You cannot have a sequence that dispatches graphics work twice, or that dispatches computer work twice, or that dispatches compute first and then draws, or something like that. No, you can only do one thing with each DGC buffer and each commands layout and it has to be the last one in the sequence.
And one interesting thing that also Vulkan allows you to do, that DX12 doesn't let you do, is that it allows you (on some drivers, you need to check the properties for this) to choose which shaders you want to use for each sequence. This is a restricted version of the bind pipeline command in Vulkan. You cannot choose arbitrary pipelines and you cannot change arbitrary states but you can switch shaders. For example, if you want to use a different fragment shader for each of the draws in the sequence, you can do that. This is pretty powerful.
How do you create one of those indirect commands layout? Well, with one of those typical Vulkan calls, to create an object that you pass these CreateInfo structures that are always present in Vulkan.
And, as you can see, you have to pass these shader stages that will be used, will be active, while you draw or you execute those indirect commands. You have to pass the pipeline layout, and you have to pass in an indirect stride. The stride is the amount of bytes for each sequence, from the start of a sequence to the next one. And the most important information of course, is the list of tokens: an array of tokens that you pass as the token count and then the pointer to the first element.
Now, each of those tokens contains a bit of information and the most important one is the type, of course. Then you can also pass an offset that tells you how many bytes into the sequence for the start of the data for that command. Together with the stride, it tells us that you don't need to pack the data for those commands together. If you want to include some padding, because it's convenient or something, you can do that.
And then there's also the token data which allows you to pass the information that I was painting in green in other slides like information to be able to run the command with some extra parameters. Only a few tokens, a few commands, need that. Depending on the command it is, you have to fill one of the pointers in the union but for most commands they don't need this kind of information. Knowing which command it is you just know you are going to find some fixed data in the buffer and you just read that and process that.
One thing that is interesting, like I said, is the ability to switch shaders and to choose which shaders are going to be used for each of those individual sequences. Some form of pipeline switching, or restricted pipeline switching. To do that you have to create something that is called Indirect Execution Sets.
Each of these execution sets is like a group or an array, if you want to think about it like that, of pipelines: similar pipelines or shader objects. They have to share something in common, which is that all of the state in the pipeline has to be identical, basically. Only the shaders can change.
When you create these execution sets and you start adding pipelines or shaders to them, you assign an index to each pipeline in the set. Then, you pass this execution set beforehand, before executing the commands, so that the driver knows which set of pipelines you are going to use. And then, in the DGC buffer, when you have this pipeline token, you only have to store the index of the pipeline that you want to use. You create the execution set with 20 pipelines and you pass an index for the pipeline that you want to use for each draw, for each dispatch, or whatever.
The way to create the execution sets is the one you see here, where we have, again, one of those CreateInfo structures. There, we have to indicate the type, which is pipelines or shader objects. Depending on that, you have to fill one of the pointers from the union on the top right here.
If we focus on pipelines because it's easier on the bottom left, you have to pass the maximum pipeline count that you're going to store in the set and an initial pipeline. The initial pipeline is what is going to set the template that all pipelines in the set are going to conform to. They all have to share essentially the same state as the initial pipeline and then you can change the shaders. With shader objects, it's basically the same, but you have to pass more information for the shader objects, like the descriptor set layouts used by each stage, push-constant information… but it's essentially the same.
Once you have that execution set created, you can use those two functions (vkUpdateIndirectExecutionSetPipelineEXT and vkUpdateIndirectExecutionSetShaderEXT) to update and add pipelines to that execution set. You need to take into account that you have to pass a couple of special creation flags to the pipelines, or the shader objects, to tell the driver that you may use those inside an execution set because the driver may need to do something special for them. And one additional restriction that we have is that if you use an execution set token in your sequences, it must appear only once and it must be the first one in the sequence.
The recap, so far, is that the DGC buffer is divided into small chunks that we call sequences. Each sequence follows a template that we call the Indirect Commands Layout. Each sequence must dispatch work exactly once and you may be able to switch the set of shaders we used with with each sequence with an Indirect Execution Set.
Wow do we go about actually telling Vulkan to execute the contents of a specific buffer? Well, before executing the contents of the DGC buffer the application needs to have bound all the needed states to run those commands. That includes descriptor sets, initial push constant values, initial shader state, initial pipeline state. Even if you are going to use an Execution Set to switch shaders later you have to specify some kind of initial shader state.
Once you have that, you can call this vkCmdExecuteGeneratedCommands. You bind all the state into your regular command buffer and then you record this command to tell the driver: at this point, execute the contents of this buffer. As you can see, you typically pass a regular command buffer as the first argument. Then there's some kind of boolean value called isPreprocessed, which is kind of confusing because it's the first time it appears and you don't know what it is about, but we will talk about it in a minute. And then you pass a relatively larger structure containing information about what to execute.
In that GeneratedCommandsInfo structure, you need to pass again the shader stages that will be used. You have to pass the handle for the Execution Set, if you're going to use one (if not you can use the null handle). Of course, the indirect commands layout, which is the central piece here. And then you pass the information about the buffer that you want to execute, which is the indirect address and the indirect address size as the buffer size. We are using buffer device address to pass information.
And then we have something again mentioning some kind of preprocessing thing, which is really weird: preprocess address and preprocess size which looks like a buffer of some kind (we will talk about it later). You have to pass the maximum number of sequences that you are going to execute. Optionally, you can also pass a buffer address for an actual counter of sequences. And the last thing that you need is the max draw count, but you can forget about that if you are not dispatching work using draw-with-count tokens as it only applies there. If not, you leave it as zero and it should work.
We have a couple of things here that we haven't talked about yet, which are the preprocessing things. Starting from the bottom, that preprocess address and size give us a hint that there may be a pre-processing step going on. Some kind of thing that the driver may need to do before actually executing the commands, and we need to pass information about the buffer there.
The boolean value that we pass to the command ExecuteGeneratedCommands tells us that the pre-processing step may have happened before so it may be possible to explicitly do that pre-processing instead of letting the driver do that at execution time. Let's take a look at that in more detail.
First of all, what is the pre-process buffer? The pre-process buffer is auxiliary space, a scratch buffer, because some drivers need to take a look at how the command sequence looks like before actually starting to execute things. They need to go over the sequence first and they need to write a few things down just to be able to properly do the job later to execute those commands.
Once you have the commands layout and you have the maximum number of sequences that you are going to execute, you can call this vkGetGeneratedCommandMemoryRequirementsEXT and the driver is going to tell you how much space it needs. Then, you can create a buffer, you can allocate the space for that, you need to pass a special new buffer usage flag (VK_BUFFER_USAGE_2_PREPROCESS_BUFFER_BIT_EXT) and, once you have that buffer, you pass the address and you pass a size in the previous structure.
Now the second thing is that we have the possibility of ding this preprocessing step explicitly. Explicit pre-processing is something that's optional, but you probably want to do that if you care about performance because it's the key to performance with some drivers.
When you use explicit pre-processing you don't want to (1) record the state, (2) call this vkPreProcessGeneratedCommandsEXT and (3) call vkExecuteGeneratedCommandsEXT. That is what implicit pre-processing does so this doesn't give you anything if you do it this way.
This is designed so that, if you want to do explicit pre-processing, you're going to probably want to use a separate command buffer for pre-processing. You want to batch pre-processing calls together and submit them all together to keep the GPU busy and to give you the performance that you want. While you submit the pre-processing steps you may be still preparing the rest of the command buffers to enqueue the next batch of work. That's the key to doing pre-processing optimally.
You need to decide beforehand if you are going to use explicit pre-processing or not because, if you're going to use explicit preprocessing, you need to pass a flag when you create the commands layout, and then you have to call the function to preprocess generated commands. If you don't pass that flag, you cannot call the preprocessing function, so it's an all or nothing. You have to decide, and you do what you want.
One thing that is important to note is that preprocessing needs to know and has to have the same state, the same contents of the input buffers as when you execute so it can run properly.
The video contains a cut here because the presentation laptop ran out of battery.
If the pre-processing step needs to have the same state as the execution, you need to have bound the same pipeline state, the same shaders, the same descriptor sets, the same contents. I said that explicit pre-processing is normally used using a separate command buffer that we submit before actual execution. You have a small problem to solve, which is that you would need to record state twice: once on the pre-process command buffer, so that the pre-process step knows everything, and once on the execution, the regular command buffer, when you call execute. That would be annoying.
Instead of that, the pre-process generated commands function takes an argument that is a state command buffer and the specification tells you: this is a command buffer that needs to be in the recording state, and the pre-process step is going to read the state from it. This is the first time, and I think the only time in the specification, that something like this is done. You may be puzzled about what this is exactly: how do you use this and how do we pass this?
I just wanted to get this slide out to tell you: if you're going to use explicit pre-processing, the ergonomic way of using it and how we thought about using the processing step is like you see in this slide. You take your main command buffer and you record all the state first and, just before calling execute-generated-commands, the regular command buffer contains all the state that you want and that preprocess needs. You stop there for a moment and then you prepare your separate preprocessing command buffer passing the main one as an argument to the preprocess call, and then you continue recording commands in your regular command buffer. That's the ergonomic way of using it.
You do need some synchronization at some steps. The main one is that, if you generate the contents of the DGC buffer from the GPU itself, you're going to need some synchronization: writes to that buffer need to be synchronized with something else that comes later which is executing or reading those commands from from the buffer.
Depending on if you use explicit preprocessing you can use the pipeline stage command-pre-process which is new and pre-process-read or you synchronize that with the regular device-generated-commands-execution which was considered part of the regular draw-indirect-stage using indirect-command-read access.
If you use explicit pre-processing you need to make sure that writes to the pre-process buffer happen before you start reading from that. So you use these just here (VK_PIPELINE_STAGE_COMMAND_PREPROCESS_BIT_EXT, VK_ACCESS_COMMAND_PREPROCESS_WRITE_BIT_EXT) to synchronize processing with execution (VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, VK_ACCESS_INDIRECT_COMMAND_READ_BIT) if you use explicit preprocessing.
The quick how-to: I just wanted to get this slide out for those wanting a reference that says exactly what you need to do. All the steps that I mentioned here about creating the commands layout, the execution set, allocating the preprocess buffer, etc. This is the basic how-to.
And that's it. Thanks for watching! Questions?
11 Mar 2025 4:30pm GMT