30 Oct 2024

feedplanet.freedesktop.org

Christian Gmeiner: CI-Tron: A Long Road to a Better Board Farm

I'm a big supporter of finding problems before they get into the code base. The earlier you catch issues, the easier they are to fix. One of the main tools that helps with this is a Continuous Integration (CI) farm. A CI farm allows you to run extensive tests like deqp or piglit on a merge request or even on a private git branch before any code is merged, which significantly helps catch problems early.

I'm not the first one at Igalia to think this is really important. We already have a large Raspberry Pi board farm available on freedesktop's GitLab instance that serves as a powerful tool for validating changes before they hit the main branch.

For a while, however, the etnaviv board farm has been offline. The main reason? I needed to clean up the setup: re-house it in a proper rack, redo all the wiring, and add more devices. What initially seemed like a few days' worth of work spiraled into months of delay, mostly because I wanted to transition to using ci-tron.

Getting Familiar with the Ci-Tron Setup

Before diving into my journey, let's quickly cover what makes up a ci-tron board farm.

The Long Road to a Working Farm

Over the past few months, I've been slowly preparing for the big ci-tron transition. The first step was ensuring my PDU was compatible. It wasn't initially supported, but after some hacking, I got it working and submitted a merge request (MR). After a few rounds of revisions, it was merged, expanding ci-tron's PDU support significantly.

The next and most critical step was getting a DUT to boot up correctly. Initially, ci-tron only supported iPXE as a boot method, but my devices are using U-Boot. I tried to make it work anyway, but the network initialization failed too often, and I found myself sinking hours into debugging.

Thankfully, rudimentary support for a U-Boot based boot flow was eventually added. After some tweaks, I managed to get my DUTs booting - but not without complications. A major problem was getting the correct Device Tree Blob (DTB) to load, which was needed for ci-tron's training rounds. A Device Tree Blob (DTB) is a binary representation of the hardware layout of a device. The DTB is used by the Linux kernel to understand the hardware configuration, including components like the CPU, memory, and peripherals. In my case, ensuring that the correct DTB was provided was crucial for the DUT to boot and be correctly managed by ci-tron. While integrating the DTB into U-Boot was suggested, it wasn't ideal. Updating the bootloader just to change a DTB is cumbersome, especially with multiple devices in the farm.

With the booting issue taking up too much time, I decided to put it on hold and focus on something else: gfxinfo.

Gfxinfo Integration Challenges

gfxinfo is a neat feature that automatically tags a DUT based on the GPU model in the system, avoiding the need for manually assigning tags like gc2000. In theory, it's very convenient-but in practice, there were hurdles.

gfxinfo tags Vivante GPUs using the device tree node information. However, since Vivante GPUs are quite generic, they don't have a specific model property that uniquely identifies them. The plan was to pull this information using ioctl() calls to the etnaviv kernel driver. It took a lot of back and forth in review due to the internal gfxinfo API being under-documented, but after a lot of effort, I finally got the necessary code merged. You can find all of it in this MR.

Final Push: Getting Everything to Boot

There was still one major obstacle - getting the DUT to boot reliably. Luckily, mupuf was already working on it and made a significant MR with over 80 patches to address the boot issues. Introducing "boots db," a feature designed to decouple the boot process, granting full control over DHCP, TFTP, and HTTP servers to each job. This is paired with YAML configurations to flexibly define the boot specifics for each board.

As of a few days ago, the latest official ci-tron gateway image contains everything needed to get an etnaviv DUT up and running successfully.

tags

I have to say, I'm very impressed with the end result. It took a lot longer than I had anticipated, but we finally have a plug-and-play CI farm solution for etnaviv. There are still a few missing features-like Network Block Device (NBD) support and some advanced statistics-but the ci-tron team is doing an excellent job, and I'm optimistic about what's coming next.

Conclusion: A Long Road, but Worth It

The journey to get the etnaviv board farm back online was longer than expected, full of unexpected challenges and technical hurdles. But it was worth it. The result is a robust, automated solution that makes CI testing easier and more reliable for everyone. With ci-tron, it's easier to find and fix problems before they ever make it into the code base, which is exactly what a good CI setup should be all about. There is still some work to be done on the GitLab side to switch all etnaviv jobs to the new board farm.

If you're thinking about setting up your own CI farm or migrating to ci-tron, I hope my experience helps smooth the road for you a bit. It might be a long journey, but the end results are absolutely worth it.

30 Oct 2024 12:00am GMT

28 Oct 2024

feedplanet.freedesktop.org

Maira Canal: Unleashing Power: Enabling Super Pages on the RPi

Unleashing the power of 3D graphics in the Raspberry Pi is a key commitment for Igalia through its collaboration with Raspberry Pi. The introduction of Super Pages for the Raspberry Pi 4 and 5 marks another step in this journey, offering some performance enhancements and more efficient memory usage. In this post, we'll dive deep into the technical details of Super Pages, discuss the challenges we faced during implementation, and illustrate the benefits this feature brings to the Raspberry Pi ecosystem.

What are Super Pages?

A Memory Management Unit (MMU) is a hardware component responsible for handling memory access at the system level. It translates virtual addresses used by programs into physical addresses in main memory, enabling efficient memory management and protection. The MMU allows the operating system to allocate memory dynamically, isolating processes from one another to prevent them from interfering with each other's memory.

Recommendation: đź“š Structured computer organization by Andrew Tanenbaum

The V3D MMU, which is part of the Broadcom GPU found in the Raspberry Pi 4 and 5, is responsible for translating 32-bit virtual addresses (VA) used by V3D into 40-bit physical addresses used externally to V3D. The MMU relies on a page table, stored in physical memory, which maps virtual addresses to their corresponding physical addresses. The operating system manages this page table, and the MMU uses it to perform address translation during memory access.

A fundamental principle of modern operating systems is that memory is not stored contiguously. Instead, a contiguous block of memory is divided into smaller blocks, called "pages", which are scattered across the entire address space. These pages are typically 4KB in size. This approach enables more efficient memory management and allows for features like virtual memory and memory protection.

Over the years, the amount of available memory in computers has increased dramatically. An early IBM PC had up to 640 KiB of RAM, whereas the ThinkPad I'm typing on right now has 32 GB of RAM. Naturally, memory demands have grown alongside this increase. Today, it's common for web browsers to consume several gigabytes of RAM, and a single shader can take up multiple megabytes.

As memory usage grows, a 4KB page size may become inefficient for managing large memory blocks. Handling a large number of small pages for a single block means the MMU must perform multiple address translations, which increases overhead. This can reduce the effectiveness of the Translation Lookaside Buffer (TLB), as it must store and handle more entries, potentially leading to more cache misses and reduced overall performance.

This is why many CPU manufacturers have introduced support for larger page sizes. For instance, x86 CPUs typically support 4KB and 2MB pages, with 1GB pages available if supported by the hardware. Similarly, ARM64 CPUs can support 4KB, 16KB, and 64KB page sizes. These larger page sizes help reduce the number of pages the MMU needs to manage, improving performance by reducing the overhead of address translation and making more efficient use of the TLB.

So, if CPUs are using bigger sizes, why shouldn't GPUs do the same?

By default, V3D supports 4KB pages. However, by setting specific bits in the page table entry, it is possible to create 64KB "Big Pages" and 1MB "Super Pages." The issue is that the current V3D driver available in Linux does not enable the use of Big or Super Pages, meaning this hardware feature is currently unused.

The advantage of enabling Big and Super Pages is that once an entry for any page within a Big or Super Page is cached in the MMU, it can be used to translate all virtual addresses within that page's range without needing to fetch additional entries. In theory, this should result in improved performance, especially for applications with high memory demands, such as those using multiple large buffer objects (BOs).

As Igalia continually strives to enhance the experience for Raspberry Pi users, we decided to implement this feature in the upstream kernel. But before diving into the implementation details, let's take a look at the real-world results and see if the theoretical benefits of Super Pages have translated into measurable improvements for Raspberry Pi users.

What Does This Feature Mean for RPi Users?

With Super Pages implemented, let's now explore the actual performance improvements observed on the Raspberry Pi and see how impactful this feature is for users.

Benchmarking Super Pages: Traces and FPS Improvements

To measure the impact of Super Pages, we tested a variety of games and demos traces on the Raspberry Pi 4 and 5, covering genres from action to racing. On average, we observed a +1.40% FPS improvement on the Raspberry Pi 4 and a +1.30% improvement on the Raspberry Pi 5.

For instance, on the Raspberry Pi 4, Warzone 2100 saw an 8.36% FPS increase, and on the Raspberry Pi 5, Quake II enjoyed a 3.62% boost. These examples demonstrate the benefits of Super Pages in resource-demanding applications, where optimized memory handling becomes critical.

Raspberry Pi 4 FPS Improvements

Trace Before Super Pages After Super Pages Improvement
warzone2100.30secs.1024x768.trace 56.39 61.10 +8.36%
ue4_shooter_game_shooting_low_quality_640x480.gfxr 20.71 21.47 +3.65%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr 60.88 62.50 +2.67%
supertuxkart-menus_1024x768.trace 112.62 115.61 +2.65%
ue4_shooter_game_shooting_high_quality_640x480.gfxr 20.45 20.88 +2.10%
quake2-gles3-1280x720.trace 59.76 60.84 +1.82%
ue4_sun_temple_640x480.gfxr 27.60 28.03 +1.54%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr 54.59 55.30 +1.29%
ue4_shooter_game_low_quality_640x480.gfxr 32.75 33.08 +1.00%
sponza_demo02_800x600.gfxr 20.90 21.03 +0.61%
supertuxkart-racing_1024x768.trace 8.58 8.63 +0.60%
ue4_shooter_game_high_quality_640x480.gfxr 19.62 19.74 +0.59%
serious_sam_trace02_1280x720.gfxr 44.00 44.21 +0.50%
ue4_vehicle_game-2_640x480.gfxr 12.59 12.65 +0.49%
sponza_demo01_800x600.gfxr 21.42 21.46 +0.19%
quake3e-1280x720.trace 84.45 84.52 +0.09%

Raspberry Pi 5 FPS Improvements

Trace Before Super Pages After Super Pages Improvement
quake2-gles3-1280x720.trace 151.77 157.26 +3.62%
supertuxkart-menus_1024x768.trace 306.79 313.88 +2.31%
warzone2100.30secs.1024x768.trace 140.92 144.03 +2.21%
vkQuake_capture_frames_1_through_1200_1280x720.gfxr 131.45 134.20 +2.10%
ue4_vehicle_game-2_640x480.gfxr 24.42 24.88 +1.89%
ue4_shooter_game_high_quality_640x480.gfxr 32.12 32.53 +1.29%
ue4_sun_temple_640x480.gfxr 42.05 42.55 +1.20%
ue4_shooter_game_shooting_high_quality_640x480.gfxr 52.77 53.31 +1.04%
quake3e-1280x720.trace 238.31 240.53 +0.93%
warzone2100.70secs.1024x768.trace 151.09 151.81 +0.48%
sponza_demo02_800x600.gfxr 50.81 51.05 +0.46%
supertuxkart-racing_1024x768.trace 20.91 20.98 +0.33%
ue4_shooter_game_low_quality_640x480.gfxr 59.68 59.86 +0.29%
quake3e_capture_frames_1_through_1800_1920x1080.gfxr 167.70 168.17 +0.29%
ue4_shooter_game_shooting_low_quality_640x480.gfxr 53.40 53.51 +0.22%
quake3e_capture_frames_1800_through_2400_1920x1080.gfxr 163.37 163.64 +0.17%
serious_sam_trace02_1280x720.gfxr 60.00 60.03 +0.06%
sponza_demo01_800x600.gfxr 45.04 45.04 <.01%

While an average +1% FPS improvement might seem modest, Super Pages can deliver more noticeable gains in memory-intensive 3D applications and when the GPU is under heavy usage. Let's see how the Super Pages perform on Mesa CI.

Benchmarking Super Pages: Mesa CI Job Duration

To avoid introducing regressions in user-space, I usually test my custom kernels with Mesa CI, focusing on the "broadcom-postmerge" stage to verify that all Piglit and CTS tests ran smoothly. For Super Pages, I was pleasantly surprised by the job duration results, as some job durations were reduced by several minutes.

Mesa CI Jobs Duration Improvements

Job Before Super Pages After Super Pages
v3d-rpi4-traces:arm64 ~4m30s ~3m40s
v3d-rpi5-traces:arm64 ~3m30s ~2m45s
v3d-rpi4-gl-full:arm64 */6 ~24-25 minutes ~22-23 minutes
v3d-rpi5-gl-full:arm64 ~48 minutes ~48 minutes
v3dv-rpi4-vk-full:arm64 */6 ~44 minutes ~41 minutes
v3dv-rpi5-vk-full:arm64 ~102 minutes ~92 minutes

Seeing these reductions is especially rewarding. For example, the "v3dv-rpi5-vk-full:arm64" job duration decreased by 10 minutes, meaning more FPS for users and shorter wait times for Mesa developers.

Benchmarking Super Pages: PS2 Emulation

After sharing a couple of tables, I'll admit that showcasing performance improvements solely through numbers doesn't always convey the real impact. Personally, I find it more satisfying to see performance gains in action with real-world applications.

This led me to explore PlayStation 2 (PS2) emulation on the RPi 5. From watching YouTube videos, I noticed that PS2 is a popular console for the RPi 5. While the PlayStation (PS1) emulates well even on the RPi 4, and Nintendo 64 and Sega Saturn struggle across most hardware, PS2 hits a sweet spot for testing the RPi 5's limits.

Fortunately, I still have my childhood PS2 - my second console after the Nintendo GameCube, and one of the most successful consoles worldwide, including in Brazil. With a library packed with titles like Metal Gear Solid, Resident Evil, Tomb Raider, and Shadow of the Colossus, the PS2 remains a great system for collectors and retro gamers alike.

I selected a few games from my collection to benchmark on the RPi 5 using a PS2 emulator. My emulator of choice was Aether SX2 with Vulkan support. Although AetherSX2 is no longer in development, it still performs well on the RPi.

Initially, many games were barely playable, especially those with large buffer objects, like Shadow of the Colossus and Gran Turismo 4. However, after enabling Super Pages support, I noticed immediate improvements. For example, Shadow of the Colossus wouldn't even open before Super Pages, and while it's not fully playable yet, it does load now. This isn't a silver bullet, but it's a step forward in improving the driver one piece at a time.

I ended up selecting four games for a video comparison: Burnout 3: Takedown, Metal Gear Solid 3: Snake Eater, Resident Evil 4, and Tekken 4.

Disclaimer: The BIOS used in the emulator was extracted from my own PS2, and I played only games I own, with ROMs I personally extracted. Neither I nor Igalia encourage using downloaded BIOS or ROM files from the internet.

From the video, we can see noticeable improvements in all four games. Although they aren't perfectly playable yet, the performance gains are evident, particularly in Resident Evil 4, where the gameplay saw a solid 5 FPS boost. I realize 18 FPS might not satisfy most players, but I still had a lot of fun playing Resident Evil 4 on the RPi 5.

When tracking the FPS for these games, it's clear that the performance gains go well beyond the average 1% seen in other benchmarks. Super Pages show their true potential in high-memory applications like PS2 emulation.

Having seen the performance gains Super Pages can bring to the Raspberry Pi, let's now dive into the technical aspects of the feature.

Implementing Super Pages

The first challenge was figuring out how to allocate a contiguous block of memory using shmem. The Shared Memory Virtual Filesystem (shmem) is used as a flexible memory mechanism that allows the GPU and CPU to share access to BOs through the system's temporary filesystem, tmpfs. tmpfs is a volatile filesystem that stores files in RAM, making it ideal for temporary or high-speed data that doesn't need to persist on RAM.

For example, to allocate a 256KB BO across four 64KB pages, we need four contiguous 64KB memory blocks. However, by default, tmpfs only allocates memory in PAGE_SIZE chunks (as seen in shmem_file_setup()), whereas PAGE_SIZE is 4KB on the Raspberry Pi 4 and 16KB on the Raspberry Pi 5. Since the function drm_gem_object_init() - which initializes an allocated shmem-backed GEM object - relies on shmem_file_setup() to back these objects in memory, we had to consider alternatives, as the default PAGE_SIZE would divide memory into increments that are too small to ensure the large, contiguous blocks needed by the GPU.

The solution we proposed was to create drm_gem_object_init_with_mnt(), which allows us to specify the tmpfs mountpoint where the GEM object will be created. This enables us to allocate our BOs in a mountpoint that supports larger page sizes. Additionally, to ensure that our BOs are allocated in the correct mountpoint, we introduced drm_gem_shmem_create_with_mnt(), which allows the mountpoint to be specified when creating a new DRM GEM shmem object.

[PATCH v6 04/11] drm/gem: Create a drm_gem_object_init_with_mnt() function

[PATCH v6 06/11] drm/gem: Create shmem GEM object in a given mountpoint

The next challenge was figuring out how to create a new mountpoint that would allow for different page sizes based on the allocation. Simply creating a new tmpfs mountpoint with a fixed bigger page size wouldn't suffice, as we needed flexibility for various allocations. Inspired by the i915 driver, we decided to use a tmpfs mountpoint with the "huge=within_size" flag. This flag, which requires the kernel to be configured with CONFIG_TRANSPARENT_HUGEPAGE, enables the allocation of huge pages.

Transparent Huge Pages (THP) is a kernel feature that automatically manages large memory pages to improve performance without needing changes from applications. THP dynamically combines smaller pages into larger ones, typically 2MB, reducing memory management overhead and improving cache efficiency.

To support our new allocation strategy, we created a dedicated tmpfs mountpoint for V3D, called gemfs, which provides us an ideal space for managing these larger allocations.

[PATCH v6 05/11] drm/v3d: Introduce gemfs

With everything in place for contiguous allocations, the next step was configuring V3D to enable Big/Super Page support.

We began by addressing a major source of memory pressure on the Raspberry Pi: the current 128KB alignment for allocations in the virtual memory space. This alignment wastes space when handling small BO allocations, especially since the userspace driver performs a large number of these small allocations.

As a result, we can't fully utilize the 4GB address space available for the GPU on the Raspberry Pi 4 or 5. For example, we can currently allocate up to 32,000 BOs of 4KB (~140MB) and 3,000 BOs of 400KB (~1.3GB). This becomes a limitation for memory-intensive applications. By reducing the page alignment to 4KB, we can significantly increase the number of BOs, allowing up to 1,000,000 BOs of 4KB (~4GB) and 10,000 BOs of 400KB (~4GB).

Therefore, the first change I made was reducing the VA alignment of all allocations to 4KB.

[PATCH v6 07/11] drm/v3d: Reduce the alignment of the node allocation

With the alignment issue resolved, we can now implement the code to properly set the flags on the Page Table Entries (PTE) for Big/Super Pages. Setting these flags is straightforward - a simple bitwise operation. The challenge lies in determining which BOs can be allocated in Super Pages. For a BO to be eligible for a Big Page, its virtual address must be aligned to 64KB, and the same applies to its physical address. Same thing for Super Pages, but now the addresses must be aligned to 1MB.

If the BO qualifies for a Big/Super Page, we need to iterate over 16 4KB pages (for Big Pages) or 256 4KB pages (for Super Pages) and insert the appropriate PTE.

Additionally, we modified the way we iterate through the BO's memory. This was necessary because the THP may not always allocate the entire BO contiguously. For example, it might only allocate contiguously 1MB of a 2MB block. To handle this, we now iterate over the blocks of contiguous memory scattered across the scatterlist, ensuring that each segment is properly handled during the allocation process.

What is a scatterlist? It is a Linux Kernel data structure that manages non-contiguous memory as if it were contiguous. It organizes separate memory blocks into a single logical buffer, allowing efficient data handling, especially in Direct Memory Access (DMA) operations, without needing a physically contiguous memory allocation.

[PATCH v6 08/11] drm/v3d: Support Big/Super Pages when writing out PTEs

However, the last few patches alone don't fully enable the use of Super Pages. While PATCH 08/11 technically allows for Super Pages, we're still relying on DRM GEM shmem objects, meaning allocations are still happening in PAGE_SIZE chunks. Although Big/Super Pages could potentially be used if the system naturally allocated 1MB or 64KB contiguously, this is quite rare and not our intended outcome. Our goal is to actively use Big/Super Pages as much as possible.

To achieve this, we'll utilize the V3D-specific mountpoint we created earlier for BO allocation whenever possible. By creating BOs through drm_gem_shmem_create_with_mnt(), we can ensure that large pages are allocated contiguously when possible, enabling the consistent use of Big/Super Pages.

[PATCH v6 09/11] drm/v3d: Use gemfs/THP in BO creation if available

And there you have it - Big/Super Pages are now fully enabled in V3D. The only requirement to activate this feature in any given kernel is ensuring that CONFIG_TRANSPARENT_HUGEPAGE is enabled.

Final Words

You can learn more about ongoing enhancements to the Raspberry Pi driver stack in this XDC 2024 talk by José María "Chema" Casanova Crespo. In the talk, Chema discusses the Super Pages work I developed, along with other advancements in the driver stack.

Of course, there are still plenty of improvements on the horizon at Igalia. I'm currently experimenting with 64KB CLE allocations in user-space, and I hope to share more good news soon.

Finally, I'd like to express my gratitude to Iago Toral and Tvrtko Ursulin for their invaluable support in developing Super Pages for the V3D kernel driver. Thank you both for sharing your experience with me!

28 Oct 2024 12:00pm GMT

23 Oct 2024

feedplanet.freedesktop.org

Bastien Nocera: wireless_status kernel sysfs API

(I worked on this feature last year, before being moved off desktop related projects, but I never saw it documented anywhere other than in the original commit messages, so here's the opportunity to shine a little light on a feature that could probably see more use)

The new usb_set_wireless_status() driver API function can be used by drivers of USB devices to export whether the wireless device associated with that USB dongle is turned on or not.

To quote the commit message:

This will be used by user-space OS components to determine whether the
battery-powered part of the device is wirelessly connected or not,
allowing, for example:
- upower to hide the battery for devices where the device is turned off
  but the receiver plugged in, rather than showing 0%, or other values
  that could be confusing to users
- Pipewire to hide a headset from the list of possible inputs or outputs
  or route audio appropriately if the headset is suddenly turned off, or
  turned on
- libinput to determine whether a keyboard or mouse is present when its
  receiver is plugged in.
This is not an attribute that is meant to replace protocol specific
APIs [...] but solely for wireless devices with
an ad-hoc "lose it and your device is e-waste" receiver dongle.
 

Currently, the only 2 drivers to use this are the ones for the Logitech G935 headset, and the Steelseries Arctis 1 headset. Adding support for other Logitech headsets would be possible if they export battery information (the protocols are usually well documented), support for more Steelseries headsets should be feasible if the protocol has already been reverse-engineered.

As far as consumers for this sysfs attribute, I filed a bug against Pipewire (link) to use it to not consider the receiver dongle as good as unplugged if the headset is turned off, which would avoid audio being sent to headsets that won't hear it.

UPower supports this feature since version 1.90.1 (although it had a bug that makes 1.90.2 the first viable release to include it), and batteries will appear and disappear when the device is turned on/off.

A turned-on headset

23 Oct 2024 12:06pm GMT

20 Oct 2024

feedplanet.freedesktop.org

Simon Ser: Status update, October 2024

Hi!

This month XDC 2024 took place in Montreal. I wasn't there in-person, but thanks to the organizers I could still ask questions and attend workshops remotely (thanks!). As usual, XDC has been a great reminder of many things I wanted to do but which got buried under a pile of emails. We've discussed the upcoming KMS color management uAPI again, I've taken a bit of time to send more comments and it looks like this one is getting close to completion (famous last words). We've also discussed about display muxing (switching a connector from one GPU to another one), it's quite fun how surprisingly tricky this process is. Another topic was better multi-GPU support, in particular how to avoid going through the main GPU when an application is rendered and displayed on a secondary GPU. I've sent a proposal to improve the kernel DMA-BUF uAPI.

New this year was the Wayland workshop organized by Mike Blumenkrantz, Daniel Stone and Jonas Ă…dahl. We've discussed the governance change proposals sent earlier this month. Various changes are being discussed, all have the goal to lower the barrier to entry when contributing a protocol and preventing patches from getting stuck. I'm excited to see how this turns out!

We've finally started the release candidate cycle for Sway 1.10. I've released Sway 1.10-rc4 this weekend with a bunch more fixes, I'm hoping the final release can go out soon! I've also released the long overdue cage 0.2.0, which fast forwards wlroots to version 0.18 and adds primary selection support.

I've sent a patch to add a udmabuf allocator to wlroots. This is useful for running the wlroots GLES2 and Vulkan renderers with software rendering (e.g. llvmpipe and lavapipe), which is handy for CI and exercises the same codepaths as real hardware instead of the seldom used Pixman renderer.

wlroots-rs has been updated to wlroots v0.18, and I've revamped the way the compositor state is managed. Previously the library forced the use of Rc<RefCell<T>> to hold the state, which caused issues with double mutable borrows at runtime when compositor callbacks were nested (wlroots invokes compositor callback which borrows state and calls into wlroots which invokes another compositor callback which borrows state). With the new design the compositor must pass its state as an argument to all wlroots functions which may emit signals and call back into the compositor.

delthas has contributed a whole bunch of soju patches used by his new hosted bouncer service, IRC Today. Uploaded videos and PDF files can now be viewed inline in Web browsers, a new HTTP basic authentication backend has been added, file uploads can now be delegated to a separate HTTP backend, a new soju.im/SAFERATE specification indicates when clients don't need to rate-limit their messages, and a bunch of various smaller improvements and fixes. A bunch of exciting new features are in the pipeline as well (but I won't spoil them just yet)!

Matthew Hague has contributed TLS certificate pinning to Goguma. When hitting an invalid certificate, Goguma will now offer the user a choice to trust this specific certificate (trust on first use). gamja now supports drag-and-drop for file uploads thanks to xse. Both gamja and Goguma have moved to Codeberg, I hope this lowers the barrier to entry for contributing. A tiny NPotM is soju-containers¸ a repository containing Dockerfiles for soju and gamja, for easy deployment and testing.

Both hottub and yojo now have support for build secrets. For hottub, secrets are only enabled when the owner pushes commits (and enables the feature at setup time). For yojo, the owner needs to enable the feature at setup time and can then select specific secrets to expose on specific repositories. All of this is locked down to prevent collaborators from gaining access to arbitrary secrets when pushing to a repository.

That's all for now, see you next month!

20 Oct 2024 10:00pm GMT

15 Oct 2024

feedplanet.freedesktop.org

Mike Blumenkrantz: Recovery

Struggling

Last week was XDC. I did too much Wayland, and now I've been stricken with a plague for my hubris.

I have some updates, but I lack the ability to fully capture the exploits of Mesa's most sane developer in the ten minutes I'm awake every day. In the meanwhile, let's take a look another potential example of great hubris.

Hm.

Have you ever made a decision that seemed great at the time but then you realized later it was actually maybe not that great? Like, maybe it was actually really, uh, well, not dumb since nobody reading this blog would do something like that, but not…smart. And everyone else was kinda going along with your decision and trusting that you knew what you were talking about because let's face it, you're smart. Everyone knows how smart you are. That's why they trust you to make these decisions.

Long-time SGC readers know I'm not one to make decisions of any kind, but we all remember that time Microsoft famously introduced Work Graphs to D3D and also (quietly) deprecated ExecuteIndirect. The argument was compelling: why not just move all the work to the GPU?

Haters described Work Graphs as just another attempt by the driver cartel to blame bugs on app developers by making tooling impossible. The rest of us were all in-We jumped on that bandwagon like it was the last triangle in the pipe before a crash. It wasn't long before the high-powered players were aboard:

Details were light at this stage. There were no benchmarks, no performance numbers, no games or applications using Work Graphs, but everyone trusted Microsoft. Everyone knew the idea of this tech was sound, that it had to be faster.

Microsoft doubled down: Work Graphs would support mesh nodes for drawing!

Other graphics wizards began to get involved. The developerverse was in a tizzy. Everyone wanted in on the action.

The hype train had departed the station.

Hm?

Six months after GDC, the first notable performance figures for Work Graphs were blogged about by AAA graphics rockstar, Kostas Anagnostou. I was at a Khronos F2F when it happened, and the number of laptop screens open to the post when it dropped was nonzero. Very nonzero.

At best, the figures were whelming.

Still there was no real analysis of Work Graph performance in comparison to alternative solutions. Haters will say I'm biased after recently shipping Vulkan's device generated commands extension, but this was going to ship regardless since vkd3d-proton requires cross-vendor compatibility for ExecuteIndirect functionality used in games like Halo Infinite and Starfield. I'm all about the numbers. Show me the graphs. The perf graphs, that is.

Fortunately, friend of the blog and veteran vertex wrangler, Hans-Kristian Arntzen, always has my back. He's spent the past few months heroically writing vkd3d-proton emulation for Work Graphs, and he has recently posted his findings to an obscure README in that repository.

READ IT. SERIOUSLY. YES, THIS IS A FULL PAGE-WIDTH LINK SO YOU CAN'T POSSIBLY MISS IT.

If you're just here for the quick summary (which you shouldn't be considering how much time he has spent making charts and graphs, and taking screenshots, and summing everything up in bite-sized morsels for easy consumption):

The principle of charity requires taking serious claims in the best possible light. This should have yielded robust, powerful ExecuteIndirect benchmark usage (and even base compute/mesh shader usage) to provide competitive benchmarks against Work Graph functionality. At the time of writing, those benchmarks have yet to materialize, and the only test cases are closer to strawmen that can be held up for an easy victory.

I'm not saying that Work Graphs are inherently bad.

Yet.

At this point, however, I haven't seen compelling evidence which validates the hype surrounding the tech. I haven't seen great benchmarks and demos. Maybe it's a combination of that and still-improving driver support. Maybe it's as-yet available functionality awaiting future hardware. In any case, I haven't seen a strong, fact-based technical argument which proves, beyond a doubt, that this is the future of graphics.

Before anyone else tries to jump on the Work Graph hype train, I think we owe it to ourselves to thoroughly interrogate this new paradigm and make sure it provides the value that everyone expects.

15 Oct 2024 12:00am GMT

10 Oct 2024

feedplanet.freedesktop.org

Alyssa Rosenzweig: AAA gaming on Asahi Linux

Gaming on Linux on M1 is here! We're thrilled to release our Asahi game playing toolkit, which integrates our Vulkan 1.3 drivers with x86 emulation and Windows compatibility. Plus a bonus: conformant OpenCL 3.0.

Asahi Linux now ships the only conformant OpenGL®, OpenCL™, and Vulkan® drivers for this hardware. As for gaming… while today's release is an alpha, Control runs well!

Control

Installation

First, install Fedora Asahi Remix. Once installed, get the latest drivers with dnf upgrade --refresh && reboot. Then just dnf install steam and play. While all M1/M2-series systems work, most games require 16GB of memory due to emulation overhead.

The stack

Games are typically x86 Windows binaries rendering with DirectX, while our target is Arm Linux with Vulkan. We need to handle each difference:

There's one curveball: page size. Operating systems allocate memory in fixed size "pages". If an application expects smaller pages than the system uses, they will break due to insufficient alignment of allocations. That's a problem: x86 expects 4K pages but Apple systems use 16K pages.

While Linux can't mix page sizes between processes, it can virtualize another Arm Linux kernel with a different page size. So we run games inside a tiny virtual machine using muvm, passing through devices like the GPU and game controllers. The hardware is happy because the system is 16K, the game is happy because the virtual machine is 4K, and you're happy because you can play Fallout 4.

Fallout 4

Vulkan

The final piece is an adult-level Vulkan driver, since translating DirectX requires Vulkan 1.3 with many extensions. Back in April, I wrote Honeykrisp, the only Vulkan 1.3 driver for Apple hardware. I've since added DXVK support. Let's look at some new features.

Tessellation

Tessellation enables games like The Witcher 3 to generate geometry. The M1 has hardware tessellation, but it is too limited for DirectX, Vulkan, or OpenGL. We must instead tessellate with arcane compute shaders, as detailed in today's talk at XDC2024.

The Witcher 3

Geometry shaders

Geometry shaders are an older, cruder method to generate geometry. Like tessellation, the M1 lacks geometry shader hardware so we emulate with compute. Is that fast? No, but geometry shaders are slow even on desktop GPUs. They don't need to be fast - just fast enough for games like Ghostrunner.

Ghostrunner

Enhanced robustness

"Robustness" permits an application's shaders to access buffers out-of-bounds without crashing the hardware. In OpenGL and Vulkan, out-of-bounds loads may return arbitrary elements, and out-of-bounds stores may corrupt the buffer. Our OpenGL driver exploits this definition for efficient robustness on the M1.

Some games require stronger guarantees. In DirectX, out-of-bounds loads return zero, and out-of-bounds stores are ignored. DXVK therefore requires VK_EXT_robustness2, a Vulkan extension strengthening robustness.

Like before, we implement robustness with compare-and-select instructions. A naĂŻve implementation would compare a loaded index with the buffer size and select a zero result if out-of-bounds. However, our GPU loads are vector while arithmetic is scalar. Even if we disabled page faults, we would need up to four compare-and-selects per load.

load R, buffer, index * 16
ulesel R[0], index, size, R[0], 0
ulesel R[1], index, size, R[1], 0
ulesel R[2], index, size, R[2], 0
ulesel R[3], index, size, R[3], 0

There's a trick: reserve 64 gigabytes of zeroes using virtual memory voodoo. Since every 32-bit index multiplied by 16 fits in 64 gigabytes, any index into this region loads zeroes. For out-of-bounds loads, we simply replace the buffer address with the reserved address while preserving the index. Replacing a 64-bit address costs just two 32-bit compare-and-selects.

ulesel buffer.lo, index, size, buffer.lo, RESERVED.lo
ulesel buffer.hi, index, size, buffer.hi, RESERVED.hi
load R, buffer, index * 16

Two instructions, not four.

Next steps

Sparse texturing is next for Honeykrisp, which will unlock more DX12 games. The alpha already runs DX12 games that don't require sparse, like Cyberpunk 2077.

Cyberpunk 2077

While many games are playable, newer AAA titles don't hit 60fps yet. Correctness comes first. Performance improves next. Indie games like Hollow Knight do run full speed.

Hollow Knight

Beyond gaming, we're adding general purpose x86 emulation based on this stack. For more information, see the FAQ.

Today's alpha is a taste of what's to come. Not the final form, but enough to enjoy Portal 2 while we work towards "1.0".

Portal 2

Acknowledgements

This work has been years in the making with major contributions from…

… Plus hundreds of developers whose work we build upon, spanning the Linux, Mesa, Wine, and FEX projects. Today's release is thanks to the magic of open source.

We hope you enjoy the magic.

Happy gaming.

10 Oct 2024 5:00am GMT

04 Oct 2024

feedplanet.freedesktop.org

Dave Airlie (blogspot): zinking the video

A few years ago Mike and I discussed adding video support to zink, so that we could provide vaapi on top of vulkan video implementations.

This of course got onto a long TODO list and we nerdsniped each other into moving it along, this past couple of weeks we finally dragged it over the line.

This MR adds initial support for zink video decode on top of Vulkan Video. It provides vaapi support. Currently it only support H264 decode, but I've implemented AV1 decode and I've played around a bit with H264 encode. I think adding H265 decode shouldn't be too horrible.

I've tested this mainly on radv, and a bit on anv (but there are some problems I should dig into).


04 Oct 2024 1:00am GMT

Peter Hutterer: HIOCREVOKE merged for kernel 6.12

TLDR: if you know what EVIOCREVOKE does, the same now works for hidraw devices via HIDIOCREVOKE.

The HID standard is the most common hardware protocol for input devices. In the Linux kernel HID is typically translated to the evdev protocol which is what libinput and all Xorg input drivers use. evdev is the kernel's input API and used for all devices, not just HID ones.

evdev is mostly compatible with HID but there are quite a few niche cases where they differ a fair bit. And some cases where evdev doesn't work well because of different assumptions, e.g. it's near-impossible to correctly express a device with 40 generic buttons (as opposed to named buttons like "left", "right", ...[0]). In particular for gaming devices it's quite common to access the HID device directly via the /dev/hidraw nodes. And of course for configuration of devices accessing the hidraw node is a must too (see Solaar, openrazer, libratbag, etc.). Alas, /dev/hidraw nodes are only accessible as root - right now applications work around this by either "run as root" or shipping udev rules tagging the device with uaccess.

evdev too can only be accessed as root (or the input group) but many many moons ago when dinosaurs still roamed the earth (version 3.12 to be precise), David Rheinsberg merged the EVIOCREVOKE ioctl. When called the file descriptor immediately becomes invalid, any further reads/writes will fail with ENODEV. This is a cornerstone for systemd-logind: it hands out a file descriptor via DBus to Xorg or the Wayland compositor but keeps a copy. On VT switch it calls the ioctl, thus preventing any events from reaching said X server/compositor. In turn this means that a) X no longer needs to run as root[1] since it can get input devices from logind and b) X loses access to those input devices at logind's leisure so we don't have to worry about leaking passwords.

Real-time forward to 2024 and kernel 6.12 now gained the HIDIOCREVOKE for /dev/hidraw nodes. The corresponding logind support has also been merged. The principle is the same: logind can hand out an fd to a hidraw node and can revoke it at will so we don't have to worry about data leakage to processes that should not longer receive events. This is the first of many steps towards more general HID support in userspace. It's not immediately usable since logind will only hand out those fds to the session leader (read: compositor or Xorg) so if you as application want that fd you need to convince your display server to give it to you. For that we may have something like the inputfd Wayland protocol (or maybe a portal but right now it seems a Wayland protocol is more likely). But that aside, let's hooray nonetheless. One step down, many more to go.

One of the other side-effects of this is that logind now has an fd to any device opened by a user-space process. With HID-BPF this means we can eventually "firewall" these devices from malicious applications: we could e.g. allow libratbag to configure your mouse' buttons but block any attempts to upload a new firmware. This is very much an idea for now, there's a lot of code that needs to be written to get there. But getting there we can now, so full of optimism we go[2].

[0] to illustrate: the button that goes back in your browser is actually evdev's BTN_SIDE and BTN_BACK is ... just another button assigned to nothing particular by default.
[1] and c) I have to care less about X server CVEs.
[2] mind you, optimism is just another word for naïveté

04 Oct 2024 12:27am GMT

02 Oct 2024

feedplanet.freedesktop.org

Hans de Goede: IPU6 camera support in Fedora 41

I'm happy to announce that the last tweaks have landed and that the fully FOSS libcamera software ISP based IPU6 camera support in Fedora 41 now has no known bugs left. See the Changes page for testing instructions.

Supported hardware

Unlike USB UVC cameras where all cameras work with a single kernel driver, MIPI cameras like the Intel IPU6 cameras require multiple drivers. The IPU6 input-system CSI receiver driver is common to all laptops with an IPU6 camera, but different laptops use different camera sensors and each sensor needs its own driver and then there are glue ICs like the LJCA USB IO-expander and the iVSC (Intel Visual Sensing Controller) and there also is the ipu-bridge code which translates Windows oriented ACPI tables with sensor info into the fwnodes which the Linux drivers expect.

This means that even though IPU6 support has landed in Fedora 41 not all laptops with an IPU6 camera will work. Currently the IPU6 integrated in the following CPU models works if the sensor + glue hw/sw is also supported:


Jasper Lake and Meteor Lake also have an IPU6 but there is some more integration work necessary to get things to work there. Getting Meteor Lake IPU6 cameras to work is high on my TODO list.

The mainline kernel IPU6 CSI receiver + libcamera software ISP has been successfully tested on the following models:


To see which sensor your laptop has run: "ls /sys/bus/i2c/devices" this will show e.g. "i2c-INT3474:00" if you have an ov2740, with INT3474 being the ACPI Hardware ID (HID) for the sensor. See here for a list of currently known HID to sensor mappings. Note not all of these have upstream drivers yet. In that cases chances are that there might be a sensor driver for your sensor here.

We could really use help with people submitting drivers from there upstream. So if you have a laptop with a sensor which is not in the mainline but is available there, you know a bit of C-programming and you are willing to help, then please drop me an email so that we can work together to get the driver upstream.

1) on some ThinkPads the ov2740 sensor fails to start streaming most of the time. I plan to look into this next week and hopefully I can come up with a fix.

MIPI camera Integration work done for Fedora 41

After landing the kernel IPU6 CSI receiver and libcamera software ISP support upstream early in the Fedora 41 cycle, there still was a lot of work to do with regards to integrating this into the rest of the stack so that the cameras can actually be used outside of the qcam test app.

The whole stack looks like this "kernel → libcamera → pipewire | pipewire-camera-consuming-app". Where the 2 currently supported pipewire-camera consuming apps are Firefox and GNOME Snapshot.

Once this was all up and running testing found quite a few bugs which have all been fixed now:




comment count unavailable comments

02 Oct 2024 6:09pm GMT

27 Sep 2024

feedplanet.freedesktop.org

Ricardo Garcia: Waiter, there's an IES in my DGC!

Finally! Yesterday Khronos published Vulkan 1.3.296 including VK_EXT_device_generated_commands. Thousands of engineering hours seeing the light of day, and awesome news for Linux gaming.

Device-Generated Commands, or DGC for short, are Vulkan's equivalent to ExecuteIndirect in Direct3D 12. Thanks to this extension, originally based on a couple of NVIDIA vendor extensions, it will be possible to prepare sequences of commands to run directly from the GPU, and executing those sequences directly without any data going through the CPU. Also, Proton now has a much-more official leg to stand on when it has to translate ExecuteIndirect from D3D12 to Vulkan while you run games such as Starfield.

The extension not only provides functionality equivalent to ExecuteIndirect. It goes beyond that and offers more fine-grained control like explicit preprocessing of command sequences, or switching shaders and pipelines with each sequence thanks to something called Indirect Execution Sets, or IES for short, that potentially work with ray tracing, compute and graphics (both regular and mesh shading).

As part of my job at Igalia, I've implemented CTS tests for this extension and I had the chance to work very closely with an awesome group of developers discussing specification, APIs and test needs. I hope I don't forget anybody and apologize in advance if so.

  • Mike Blumenkrantz, of course. Valve contractor, Super Good Coder and current OpenGL Working Group chair who took the initial specification work from Patrick Doane and carried it across the finish line. Be sure to read his blog post about DGC. Also incredibly important for me: he developed, and kept up-to-date, an implementation of the extension for lavapipe, the software Vulkan driver from Mesa. This was invaluable in allowing me to create tests for the extension much faster and making sure tests were in good shape when GPU driver authors started running them.

  • Spencer Fricke from LunarG. Spencer did something fantastic here. For the first time, the needed changes in the Vulkan Validation Layers for such a large extension were developed in parallel while tests and the spec were evolving. His work will be incredibly useful for app developers using the extension in their games. It also allowed me to detect test bugs and issues much earlier and fix them faster.

  • Samuel Pitoiset (Valve contractor), Connor Abbott (Valve contractor), Lionel Landwerlin (Intel) and Vikram Kushwaha (NVIDIA) providing early implementations of the extension, discussing APIs, reporting test bugs and needs, and making sure the extension works as good as possible for a variety of hardware vendors out there.

  • To a lesser degree, most others mentioned as spec contributors for the extension, such as Hans-Kristian Arntzen (Valve contractor), Baldur Karlsson (Valve contractor), Faith Ekstrand (Collabora), etc, making sure the spec works for them too and makes sense for Proton, RenderDoc, and drivers such as NVK and others.

If you've noticed, a significant part of the people driving this effort work for Valve and, from my side, the work has also been carried as part of Igalia's collaboration with them. So my explicit thanks to Valve for sponsoring all this work.

If you want to know a bit more about DGC, stay tuned for future talks about this topic. In about a couple of weeks, I'll present a lightning talk (5 mins) with an overview at XDC 2024 in Montreal. Don't miss it!

27 Sep 2024 9:42am GMT

Mike Blumenkrantz: Unsticking The Very Sticky

Day 4 of Wayland governance hacking

I wake at 5 AM. This is the perfect time to wake up in NYC TZ, as it affords me the ability to eat a whole apple in the time it takes my little internet-browsing chromebook to load all the IRC and Discord backlogs from the five hours that I snuck away for a nap when nobody was watching.

I slather the apple with a haphazard scoop of peanut butter; getting away from a keyboard for more than twenty six seconds in a given stretch is difficult, and I need protein. While entering into a fraught negotiation over the meaning of 30-day discussion period with my left hand, I carefully scoop protein powder into a shaker with my right. There's no time to waste. Not even a single second-Another argument could break out, steal a 1973 Pontiac Firebird, and go joyriding on the wrong side of the freeway.

I'm writing this blog post with my toes. They know their way around a keyboard, but they're slow and prone to mistakes. My cat is in charge of hitting an oversized backspace key when I dangle his favorite toy over it. It'll be hours before we get something together that can be read coherently.

This is my life now.

This is what it takes to do Open Source.

Final Day: Everything, Everywhere, All At Once

I've put up a couple sizable proposals to resolve longstanding issues and oversights in the governance model. Today is Friday, however, which means it's the final day. Once we hit the weekend, everyone will collectively fuck off and forget everything that happened this week, which means I have to maintain peak velocity and finish strong.

Let's fucking go.

Last proposal.

Problem 1: HOW IS THIS %#$@$#@#$%%$ PROTOCOL STILL STUCK AFTER 4 YEARS?!?!?!?!?

It's a great question. I asked it myself. The answers are myriad and nebulous, but I'm the guy who explains things, so I'm gonna break it down.

Imagine you're wayland-protocols. You've got all these puppies. And you're walking them-so you tell yourself, but really they're walking themselves. They're walking you. And they're going in whatever direction they want. And out of all these puppies you've got two, one's trying to go left to chase a car, and the other one's trying to sniff a telephone pole on the right. The other fifty seven puppies just want to keep moving because they love their walkies. But these two puppies are the biggest ones, and they're pulling the others along with them. So now your leashes are getting all tangled, and you're being dragged around, and everyone's pointing at you because you look like you don't know what you're doing.

That's where we're at now. Everyone's laughing at you.

Look at this idiot trying to walk fifty nine puppies at once. This absolute moron. Who would ever do that? Why not just walk one or maybe two puppies at a time like everyone else? That's the way you're supposed to walk them. The way people have always walked them.

But you know what? Walking fifty nine puppies individually would take all day. Nobody has the time to walk fifty nine puppies individually no matter how cute or eloquent they are. So you need some way to resolve this. Or something.

Look, you get where I'm going with this.

Wayland protocol discussions get bogged down by people throwing out hypotheticals that can't truly be resolved, or by people talking past each other, or by people disappearing, or the phase of the moon, or any number of reasons, and there's no official way to get past these blockages. That's why I'm proposing tie-breaker votes as a simple way of moving past these problems when they arise.

Everyone understands tie-breakers: you vote, and the side with the most votes wins. It's that simple.

In this context, the wayland-protocols member projects vote (with one of them representing the author for non-members) and the majority wins. If there's another tie, the author gets to break it.

Simple. Done.

Problem 2: Perfect Is the Enemy of Good

Sometimes a protocol in staging/ is "good enough". The author has checked out, people are using the protocol, and everyone is happy with it.

But it's still not a stable/ protocol.

In this scenario, after an extended period of time without changes, any staging/ protocol can be nominated by a member project for stable/ promotion. Some discussion happens, and then it becomes stable.

Simple. Done.

Problem 3: Start Times

The governance model talks about discussion periods, but it doesn't specify exactly when they begin. For example, on any of my governance MRs, does the 30-day period start when I open the MR or when the MR is approved?

Obviously it starts when I open the MR. We gotta keep things moving.

Done.

Problem 4: Project Representation

The governance document specifies that a member project may have up to two official representatives. This can be problematic, as it puts pressure on 1-2 people to be on top of every active protocol discussion.

Instead, projects should be represented by as many individuals as they want (pending the usual process for adding points-of-contact). This ensures that protocols don't get blocked waiting for a given project to take a look when all representatives are busy. It also helps more diverse projects (e.g., wlroots) ensure that opinions from more of its constituents are officially represented.

Each project still only gets one vote, but now that vote can be more readily deployed and voiced.

I think we're done here?

From what I've seen, this should cover all the major issues that have been negatively impacting Wayland development. Sure, there are other, more minor issues, but I'm not aware of anything that can't be solved through good old person-to-person discussion.

Maybe all this works, and maybe it doesn't. But at least now if we decide to throw away some puppies, nobody can question whether we really tried everything.

27 Sep 2024 12:00am GMT

26 Sep 2024

feedplanet.freedesktop.org

Mike Blumenkrantz: Gettin Nacky

Rejection

It's hard. Nobody likes that feeling, especially after putting in a bunch of work, double-especially when that work is on a Wayland protocol.

That's right, the target of today's wayland-protocols governance update: NACKs.

A NACK is intended to mean something like:

this idea does not belong in wayland-protocols for [technical reason]

It's supposed to be the last resort when all other alternatives and gentler nudges have been exhausted.

There's been a lot of confusion over this concept over the years, specifically along the lines of:

I'm glad you asked.

Definition

I've put up a comprehensive proposal to reform and define the NACK. The short of it is:

This should cover all the basic cases. It's important to remember that a NACK can always be removed, which is to say that there's always room for discussion in Open Source.

If you're considering submitting a protocol proposal, don't worry too much about this! A NACK won't ever be the first thing you see, and you'll have ample time and room to discuss your ideas before anyone even considers bringing it up.

26 Sep 2024 12:00am GMT

Mike Blumenkrantz: Device Generated Commands

Big.

While other development has been progressing, in the background I've been working on something big. Now, finally, I can talk about it.

VK_EXT_device_generated_commands is a new extension which, it's no exageration to say, is the biggest thing Vulkan has shipped since ray-tracing. I had the privilege of working with people across the industry while driving it, from both desktop and mobile hardware vendors, and despite it being EXT, we're going to see some truly broad adoption here.

Big shoutout to Patrick Doane, formerly of Activision-Blizzard and now (I think) at Deviation Games, for kickstarting this many years ago. Thanks for your work. I hope you're satisfied with the final product.

What does this do?

DGC enables applications to record commands from shaders to then be executed directly. This means no more ping-ponging back and forth between CPU and GPU, which can help to eliminate performance bottlenecks. See also the NV extension and D3D12 ExecuteIndirect as prior art.

While this functionality is used in big games such as Starfield and Halo Infinite, those examples are ETOOBIG to really comprehend. Also the code is proprietary, so I can't share it publicly. Also I don't have the code.

Fortunately, I've hacked together a small demo program for people to look over to get a feel for the functionality.

dgcgears is a rough fork of vkgears from mesa-demos (thanks to zink's own godfather, Erik Faye-Lund for the original work!) which utilizes DGC to execute draws rather than record them directly.

Now here's where the crazy stuff starts.

Changing shaders from shaders

EXT DGC adds the ability to change shaders from shaders. By creating an Indirect Execution Set, multiple sets of shaders can be bundled together and indexed into from within shaders. dgcgears uses a different vertex shader to draw each gear.

While the NV extension had this functionality, EXT takes it further, enabling it to be supported on all hardware.

Shader Objects: fully supported

Another big feature of EXT DGC is that it is agnostic to pipelines vs shader objects vs whatever new stuff comes out in the future. If you prefer one over the other, you're free to go ahead and use that.

VKD3D-proton: supported

I've already written the code, and it should land at some point.

Drivers: supported

3

Device. Generated. Commands.

Count 'em.

26 Sep 2024 12:00am GMT

25 Sep 2024

feedplanet.freedesktop.org

Melissa Wen: Reflections on 2024 Linux Display Next Hackfest

Hey everyone!

The 2024 Linux Display Next hackfest concluded in May, and its outcomes continue to shape the Linux Display stack. Igalia hosted this year's event in A Coruña, Spain, bringing together leading experts in the field. Samuel Iglesias and I organized this year's edition and this blog post summarizes the experience and its fruits.

One of the highlights of this year's hackfest was the wide range of backgrounds represented by our 40 participants (both on-site and remotely). Developers and experts from various companies and open-source projects came together to advance the Linux Display ecosystem. You can find the list of participants here.

The event covered a broad spectrum of topics affecting the development of Linux projects, user experiences, and the future of display technologies on Linux. From cutting-edge topics to long-term discussions, you can check the event agenda here.

Organization Highlights

The hackfest was marked by in-depth discussions and knowledge sharing among Linux contributors, making everyone inspired, informed, and connected to the community. Building on feedback from the previous year, we refined the unconference format to enhance participant preparation and engagement.

Structured Agenda and Timeboxes: Each session had a defined scope, time limit (1h20 or 2h10), and began with an introductory talk on the topic.

Engaging Sessions: The hackfest featured a variety of topics, including presentations and discussions on how participants were addressing specific subjects within their companies.

Strengthening Community Connections: The hackfest offered ample opportunities for networking among attendees.

Fruitful Discussions and Follow-up

The structured agenda and breaks allowed us to cover multiple topics during the hackfest. These discussions have led to new display feature development and improvements, as evidenced by patches, merge requests, and implementations in project repositories and mailing lists.

With the KMS color management API taking shape, we discussed refinements and best approaches to cover the variety of color pipeline from different hardware-vendors. We are also investigating techniques for a performant SDR<->HDR content reproduction and reducing latency and power consumption when using the color blocks of the hardware.

Color Management/HDR

Color Management and HDR continued to be the hottest topic of the hackfest. We had three sessions dedicated to discuss Color and HDR across Linux Display stack layers.

Color/HDR (Kernel-Level)

Harry Wentland (AMD) led this session.

Here, kernel Developers shared the Color Management pipeline of AMD, Intel and NVidia. We counted with diagrams and explanations from HW-vendors developers that discussed differences, constraints and paths to fit them into the KMS generic color management properties such as advertising modeset needs, IN\_FORMAT, segmented LUTs, interpolation types, etc. Developers from Qualcomm and ARM also added information regarding their hardware.

Upstream work related to this session:

Color/HDR (Compositor-Level)

Sebastian Wick (RedHat) led this session.

It started with Sebastian's presentation covering Wayland color protocols and compositor implementation. Also, an explanation of APIs provided by Wayland and how they can be used to achieve better color management for applications and discussions around ICC profiles and color representation metadata. There was also an intensive Q&A about LittleCMS with Marti Maria.

Upstream work related to this session:

Color/HDR (Use Cases and Testing)

Christopher Cameron (Google) and Melissa Wen (Igalia) led this session.

In contrast to the other sessions, here we focused less on implementation and more on brainstorming and reflections of real-world SDR and HDR transformations (use and validation) and gainmaps. Christopher gave a nice presentation explaining HDR gainmap images and how we should think of HDR. This presentation and Q&A were important to put participants at the same page of how to transition between SDR and HDR and somehow "emulating" HDR.

We also discussed on the usage of a kernel background color property.

Finally, we discussed a bit about Chamelium and the future of VKMS (future work and maintainership).

Power Savings vs Color/Latency

Mario Limonciello (AMD) led this session.

Mario gave an introductory presentation about AMD ABM (adaptive backlight management) that is similar to Intel DPST. After some discussions, we agreed on exposing a kernel property for power saving policy. This work was already merged on kernel and the userspace support is under development.

Upstream work related to this session:

Strategy for video and gaming use-cases

Leo Li (AMD) led this session.

Miguel Casas (Google) started this session with a presentation of Overlays in Chrome/OS Video, explaining the main goal of power saving by switching off GPU for accelerated compositing and the challenges of different colorspace/HDR for video on Linux.

Then Leo Li presented different strategies for video and gaming and we discussed the userspace need of more detailed feedback mechanisms to understand failures when offloading. Also, creating a debugFS interface came up as a tool for debugging and analysis.

Real-time scheduling and async KMS API

Xaver Hugl (KDE/BlueSystems) led this session.

Compositor developers have exposed some issues with doing real-time scheduling and async page flips. One is that the Kernel limits the lifetime of realtime threads and if a modeset takes too long, the thread will be killed and thus the compositor as well. Also, simple page flips take longer than expected and drivers should optimize them.

Another issue is the lack of feedback to compositors about hardware programming time and commit deadlines (the lastest possible time to commit). This is difficult to predict from drivers, since it varies greatly with the type of properties. For example, color management updates take much longer.

In this regard, we discusssed implementing a hw_done callback to timestamp when the hardware programming of the last atomic commit is complete. Also an API to pre-program color pipeline in a kind of A/B scheme. It may not be supported by all drivers, but might be useful in different ways.

VRR/Frame Limit, Display Mux, Display Control, and more… and beer

We also had sessions to discuss a new KMS API to mitigate headaches on VRR and Frame Limit as different brightness level at different refresh rates, abrupt changes of refresh rates, low frame rate compensation (LFC) and precise timing in VRR more.

On Display Control we discussed features missing in the current KMS interface for HDR mode, atomic backlight settings, source-based tone mapping, etc. We also discussed the need of a place where compositor developers can post TODOs to be developed by KMS people.

The Content-adaptive Scaling and Sharpening session focused on sharpening and scaling filters. In the Display Mux session, we discussed proposals to expose the capability of dynamic mux switching display signal between discrete and integrated GPUs.

In the last session of the 2024 Display Next Hackfest, participants representing different compositors summarized current and future work and built a Linux Display "wish list", which includes: improvements to VTTY and HDR switching, better dmabuf API for multi-GPU support, definition of tone mapping, blending and scaling sematics, and wayland protocols for advertising to clients which colorspaces are supported.

We closed this session with a status update on feature development by compositors, including but not limited to: plane offloading (from libcamera to output) / HDR video offloading (dma-heaps) / plane-based scrolling for web pages, color management / HDR / ICC profiles support, addressing issues such as flickering when color primaries don't match, etc.

After three days of intensive discussions, all in-person participants went to a guided tour at the Museum of Extrela Galicia beer (MEGA), pouring and tasting the most famous local beer.

Feedback and Future Directions

Participants provided valuable feedback on the hackfest, including suggestions for future improvements.

Thank you for joining the 2024 Display Next Hackfest

We can't help but thank the 40 participants, who engaged in-person or virtually on relevant discussions, for a collaborative evolution of the Linux display stack and for building an insightful agenda.

A big thank you to the leaders and presenters of the nine sessions: Christopher Cameron (Google), Harry Wentland (AMD), Leo Li (AMD), Mario Limoncello (AMD), Sebastian Wick (RedHat) and Xaver Hugl (KDE/BlueSystems) for the effort in preparing the sessions, explaining the topic and guiding discussions. My acknowledge to the others in-person participants that made such an effort to travel to A Coruña: Alex Goins (NVIDIA), David Turner (Raspberry Pi), Georges Stavracas (Igalia), Joan Torres (SUSE), Liviu Dudau (Arm), Louis Chauvet (Bootlin), Robert Mader (Collabora), Tian Mengge (GravityXR), Victor Jaquez (Igalia) and Victoria Brekenfeld (System76). It was and awesome opportunity to meet you and chat face-to-face.

Finally, thanks virtual participants who couldn't make it in person but organized their days to actively participate in each discussion, adding different perspectives and valuable inputs even remotely: Abhinav Kumar (Qualcomm), Chaitanya Borah (Intel), Christopher Braga (Qualcomm), Dor Askayo, Jiri Koten (RedHat), Jonas Ådahl (Red Hat), Leandro Ribeiro (Collabora), Marti Maria (Little CMS), Marijn Suijten, Mario Kleiner, Martin Stransky (Red Hat), Michel Dänzer (Red Hat), Miguel Casas-Sanchez (Google), Mitulkumar Golani (Intel), Naveen Kumar (Intel), Niels De Graef (Red Hat), Pekka Paalanen (Collabora), Pichika Uday Kiran (AMD), Shashank Sharma (AMD), Sriharsha PV (AMD), Simon Ser, Uma Shankar (Intel) and Vikas Korjani (AMD).

We look forward to another successful Display Next hackfest, continuing to drive innovation and improvement in the Linux display ecosystem!

25 Sep 2024 1:50pm GMT

Mike Blumenkrantz: My Wayland Your Wayland Our Wayland

I <3 Open Source

That should be obvious by now, right? I've been out here blogging about Open Source stuff for over a decade, and occasionally I still have time to actually write code.

Haha. But seriously.

I believe in Open Source. I believe in the collaborative development model, the teamwork, the shared vision of contributing to projects that can stand up to and even surpass big-name proprietary products.

I believe in getting shit done. I believe in discussion, in deliberation, in review, but at the end of the day I also believe that people need to be realistic and accept compromises rather than dying on every fucking hill of a gitlab review comment. I believe in it enough that I'm sleeping five or fewer hours a night this week.

And believe it or not, I'm also a Wayland developer.

Conflict: Just Part of the Process

I work for Valve. I've blogged about what it's like working for Valve, and nothing has changed since then. It's still great, and Rhys Perry (not you, the other one) is still an unsung hero.

As I mentioned in that post, one of the great things about my job is the freedom to improve projects without managerial overhead. To self-determine. To see the goal in the distance and get there in a way that suits me.

I'm not the only person working for Valve, and I'm not the only one who enjoys this freedom. By now, everyone has seen the latest happenings with regards to the efforts to improve Wayland display mechanisms around refresh rates and swapchain-related stuttering. I sympathize with the frustration radiating out of this effort. I see myself in it; this is someone who saw a problem, saw a goal in the distance, and chose a path forward that suited them.

I don't believe there was malice or ill-will involved here, just growing frustration at a long-standing problem that was defying efforts to resolve it. I've been there. We've all been there. Everyone has had at least some Open Source interaction where a review has stalled, or gotten heated, or failed to progress in finite time for some reason.

It happens to be the case that wayland-protocols, as a project, has these sorts of interactions more often than other projects. Again, I don't believe there is malice or ill-will involved. Maybe I'm optimistic, or an idealist, or whatever, but I believe everyone participating in Wayland development wants it to succeed. Maybe they tunnel-vision too much, maybe they get sidetracked and disappear, maybe they get heated because WHY CAN'T YOU JUST UNDERSTAND THAT I KNOW WHTA I'M TALKNG ABOUT??!!1, maybe they post long walls of text that nobody really wants to read but inevitably you know there's some kind of a point hidden somewhere in there so you gotta just reach in and tweeze it out with your fingers like it's stuck between the cushions of your sofa so you can tell them what an idiot they are, etc. Again, we've all been there.

This is how Open Source works, though. This is how the sausage is made. It's not always pretty. It's not always easy. It's not free as in freedom or beer, it's free as in puppies: sometimes they just do what they want, and you gotta be okay with that.

Resolutions

Open Source has disagreements, and I'm okay with that. I still believe that at the end of the day, everyone is capable of stepping back, seeing the goal, and arriving at some sort of agreement that gets us all there together.

It's for that reason I've proposed some updates to wayland-protocols governance to address some of the systemic issues that have persisted since before there was official 'governance'. I agree that the system is a little broken, but I don't think the solution is to throw out the whole system.

We can't just throw away puppies. Well, we can, but it's complicated and people might get the wrong idea. We gotta have a real good reason to throw those puppies out on the street. If we don't at least say we tried to make it work, to get them to stop chewing on our shoes, and peeing on the sofa, and eating the sandwiches we forget about because we gotta go back and argue with those idiots in Mesa who can't understand the brilliance of our latest idea to ship OpenGL ray-tracing-if we can't at least show that we cannot fundamentally coexist with these puppies, then some people might think we aren't capable of being good dog owners. They might wonder whether their puppies are safe around us, or worse: whether their GPUs are.

The problems facing wayland-protocols are many, and my blogging time is (somehow) not infinite. In short:

The first problem is more tractable, and it's the one causing the most frustration at present, so that's where I started. An inability to land new protocols, even just for broader development purposes, hurts the ecosystem, both by stifling innovation and frustrating would-be contributors. Many solutions have been proposed over the years, though few have been official. There was one from @emersion last year that sort of almost gained traction but then also wasn't quite what people wanted. There was also…

That's it, actually. That's the only time an official proposal has been made to change the governance in a substantial way. Which is to say that though everyone knew and acknowledged problems exist, nobody (else) took the action to try solving it, as plainly spelled out by section 4.1 of the governance model document.

It's Open Source all the way down: Just create a merge request*. * Yes, I know it says only members can propose changes, but surely someone could have wrangled one of the members and drafted something? Surely the governance members would react positively to a good-faith proposal made by a non-member? We don't have to act like every other person on the planet is out to get us, do we?

To be clear, I'm not blaming anyone.

I get it.

Complaining about hard problems is way easier than fixing them. I complain a lot too. That's why I have this blog, where I can complain about spaghetti, bad memes, and drivers that are faster than minethat have yet to be surpassed-you know what SGC is about by now.

At the end of the day, however, sometimes this is what peak Open Source performance looks like:

dog-walking.png

25 Sep 2024 12:00am GMT

19 Sep 2024

feedplanet.freedesktop.org

Simon Ser: Status update, September 2024

Hi!

Once again, this status update will be rather short due to limited time bandwidth. I hope to be able to allocate a bit more time slots for my open-source projects next month.

We're getting closer to a new Sway release (fingers crossed), with lots of help from Kenny and Alexander to iron out the remaining bugs. We've just shipped wlroots 0.18.1 today (thanks to Simon Zeni for leading the backporting efforts!). I've been expanding wlroots' explicit synchronization support by adapting our multi-GPU logic, the Vulkan renderer and the libliftoff backend. I've released wayland 1.23.1 with some Clang and wayland-scanner fixes. I've ported the cage kiosk compositor to wlroots 0.18. Last but not least, I've rewritten makoctl in C because shell scripts only get you so far.

I've been giving feedback and contributing to KDE's SVG cursor spec. The cursor theme landscape isn't in a great spot at the moment, because we're stuck with XCursor images. Now that the cursor-shape protocol is gaining adoption there is an opportunity to more easily switch the underlying image format. Thanks to KDE folks for pushing this forward! I'd really like to see the spec standardized under the freedesktop.org umbrella.

delthas has been contributing some nifty new features to soju: admins can now configure per-user network count limits, can now impersonate a user via SASL, and the file upload endpoint now sends back an error early when the file is too large. soju 0.8.2 has been released with a bunch of bug fixes.

The NPotM is varlinkgen (better name TBD). It's a Varlink C library and code generator. If you've been following my projects for a while, you probably know how much I love code generators producing type-safe APIs from schemas. I must admit, I appreciate Varlink's simplicity and lack of central bus. I plan to use varlinkgen in kanshi, and maybe other daemons in need of an IPC.

See you next month!

19 Sep 2024 10:00pm GMT