14 Sep 2024
planet.freedesktop.org
Hans de Goede: Fedora plymouth boot splash not showing on systems with AMD GPUs
Recently there have been a number of reports (bug 2183743, bug 2276698, bug 2283839, bug 2312355) about the plymouth boot splash not showing properly on PCs using AMD GPUs.
The problem without plymouth and AMD GPUs is that the amdgpu driver is a really really big driver, which easily takes up to 10 seconds to load on older PCs. The delay caused by this may cause plymouth to timeout while waiting for the GPU to be initialized, causing it to fallback to the 3 dot text-mode boot splash.
There are 2 workaround for this depending on the PCs configuration:
1. With older AMD GPUs the radeon driver is actually used to drive the GPU but even though it is unused the amdgpu driver still loads slowing things down.
To check if this is the case for your PC start a terminal in a graphical login session and run: "lsmod | grep -E '^radeon|^amdgpu'" this will output something like this:
amdgpu 17829888 0
radeon 2371584 37
The second number after each is the usage count. As you can see in this example the amdgpu driver is not used. In this case you can disable the loading of the amdgpu driver by adding "modprobe.blacklist=amdgpu" to your kernel commandline:
sudo grubby --update-kernel=ALL --args="modprobe.blacklist=amdgpu"
2. If the amdgpu driver is actually used on your PC then plymouth not showing can be worked around by telling plymouth to use the simpledrm drm/kms device created from the EFI framebuffer early on boot, rather then waiting for the real GPU driver to load. Note this depends on your PC booting in EFI mode. To do this run:
sudo grubby --update-kernel=ALL --args="plymouth.use-simpledrm"
After using 1 of these workarounds plymouth should show normally again on boot (and booting should be a bit faster).
comments
14 Sep 2024 1:38pm GMT
06 Sep 2024
planet.freedesktop.org
Mike Blumenkrantz: Architechair
What Am I Even Doing
It was some time ago that I created my first MR touching WSI stuff.
That was also the first time I broke Mesa.
Did I learn anything?
The answer is no, but then again it would have to be given the topic of this sleep-deprived post.
Maybe I'm The Problem
WSI has a lot of issues, but most of them stem from its mere existence. If people stopped wanting to see the triangles, we would all have easier lives and performance would go through the fucking roof. That's ignoring the raw sweat and verbiage dedicated to insane ideas like determining the precise time at which the triangles should be made visible on a display or literally how are colors.
I'm nowhere near as smart as the people arguing about these things: I'm the guy who plays jenga with the tower constructed from popsicle sticks, marshmallow fluff, and wishful thinking. That's why, a while ago, I declared war on DRI interfaces and then also definitely won that war without any issues. In fact, it works perfectly.
But why did I embark upon this journey which required absolutely no fixups?
The answer lies in architecture. In the before-times, DRI (a massively overloaded acronym that no longer means anything) allowed Xorg to plug directly into Mesa to utilize hardware-accelerated rendering. It sidestepped the GL API in favor of a contract with Mesa that certain API would never change. And that was great for Xorg since it provided an optimal path to do xserver stuff. But it was (eventually) terrible for Mesa.
Renegotiation
When an API contract is made, it remains binding forever. A case when the contract is broken is called a Bug Report. Mesa has no bugs, however, except for the ones I didn't cause, and so this DRI contract that enables Xorg to shortcut more sensible APIs like EGL remains identical to this day, decades later. What is not identical, however, is Mesa.
In those intervening years, Mesa has developed into an entire ecosystem for driver development and other, less sane ideas. Gallium was created and then became the only method for implementing GL drivers. EGL and GBM are things now. But still, that DRI contract remains binding. Xorg must work. Like that one reviewer who will suggest changes for every minuscule flaw in your stupid, idiotic, uneducated, cretinous abuse of whitespace, it is not going away.
DRIL was the method by which Mesa could finally unshackle itself. The only parts of DRI still used by Xorg are for determining rendertarget capabilities, effectively eglGetConfigs
. So @ajax and I punted out the relevant API into a stub which is mostly just a wrapper around eglGetConfigs
. This enabled change and cleanup in every part of the codebase that was previously immutable.
Bidirectional Hell
As anyone who has tried to debug Mesa's DRI frontend knows, it sucks. It's one of the worst pieces of code to debug. A significant reason for this is (was) how the DRI callback system perpetuated circular architecture.
At the time of DRIL's merge, a user of GLX/EGL/GBM would engage with this sort of control flow:
- GLX/EGL/GBM API call
- direct API internals
- function pointer into
gallium/frontends/dri
- DRI frontend
- function pointer back to GLX/EGL/GBM
- <loop back to 2 until operation completes>
- return to user
In terms of functionality, it was functional. But debugging at a glance was impossible, and trying to eyeball any execution path required the type of PhD held by fewer than five people globally. The cyclical back-and-forth function pointering was a vertical cliff of a learning curve for anyone who didn't already know how things worked, and even things as "simple" as eglInitialize
went through several impenetrable cycles of idiot-looping to determine success or failure. The absolute state of it made adding new features a nightmarish and daunting prospect, and reviewing any changes had, at best, even odds of breaking things because of how difficult it is to test this stuff.
Better Now?
Maybe.
The juiciest refactoring is over, and now function pointering only occurs when the DRI frontend needs to access API-specific data for its drawables. It's actually possible to follow execution just by reading the code. Not that it's necessarily easy, but it's possible.
There's still a lot of work to be done here. There's still some corner case bugs with DRIL, there's probably EGL issues that have yet to be discovered because much of that code is still fairly opaque, and half the codebase is still prefixed with dri2_
.
At the least, I think it's now possible to work on WSI in Mesa and have some idea what's going on. Or maybe I've just been down in the abyss for so long that I'm the one staring back.
Onward
I've been cooking. I mean like really cooking. Expect big things related to the number 3 later this month.
* UPDATE: At the urging of my legal team, I've been advised to mention that no part of this post, blog, or site has any association with, bearing on, or endorsement from Half Life 3.
06 Sep 2024 12:00am GMT
04 Sep 2024
planet.freedesktop.org
Tvrtko Ursulin: DRM scheduling cgroup controller
Introduction #
The topic of a Direct Rendering Manager (DRM) cgroup controller is something which has been proposed a few times in the past, but so far is still missing from the Linux graphics stack. Some of those attempts were focusing on controlling the GPU memory usage aspect, while some were concerned with scheduling. As I am continuing to explore this area as part of my work at Igalia, in this post we will discuss one possible way of implementing the latter.
General problem statement which we are trying to address is the fact many GPUs (and their respective kernel drivers) can simultaneously schedule workloads from different clients and that there are use-cases where having external control over scheduling decisions would be beneficial.
But first to clarify what we mean by "external control". By that term we refer to the scheduling decisions being influenced from the outside of the actual process doing the rendering. If we were to draw a parallel to CPU scheduling, that would be the difference between a process (or a thread) issuing a system call such as setpriority(2) or nice(2) itself ("internal control"), versus its scheduling priority being modified by an external entity such as the user issuing the renice(1) shell command, launching the executable via the nice(1) shell command, or even using the CPU scheduling cgroup controller ("external control").
This has two benefits. Firstly, it is the user who typically knows which tasks are higher priority and which should run in the background and therefore be as much as it is possible isolated from starving the foreground tasks from resources. Secondly, external control can be applied on any process in an unified manner, without the need for applications to individually expose the means to control their scheduling priority.
If we now return back to the world of GPU scheduling we find ourselves in a landscape where internal scheduling control is possible with many GPU drivers, but the external control is not. To improve on that there are some technical and conceptual challenges, because GPUs are not as nice and uniform in their scheduling needs and capabilities as CPUs are, but if we would be able to come up with something reasonable even if not perfect, it could bring improvements to the user experience in a variety of scenarios.
Past attempts - Priority based controllers #
The earliest attempt I can remember was from 2018, by Matt Roper[1], who proposed to implement a driver-specific priority based controller. The RFC limited itself to i915 (kernel driver for Intel GPUs) and, although the priority-based setup is well established in the world of CPU scheduling, and it is easy to understand its effects, the proposal did not gain much traction.
Because of the aforementioned advantages, when I proposed my version of the controller in 2022[2], it also included a slightly different version of a priority-based controller. In contrast to the earlier one, this proposal was in principle driver-agnostic and the priority levels were also abstracted.
The proposal was also accompanied by benchmark results showing that the approach was effective in allowing users on Linux to launch GPU tasks in the background, while leaving more GPU bandwidth to the foreground task than when not using the controller. Similarly on ChromeOS, when wired into the focused versus un-focused window cgroup management, it was able to demonstrate relatively more GPU time given to the foreground window.
Current proposal - Weight based controller #
Anticipating the potential lack of sufficient support for this approach the same RFC also included a second controller which takes a different route. It abstracts things one step further and implements a weight based controller based on GPU utilisation[3].
The basic idea is that the GPU time budget is split based on relative group weights across the cgroup hierarchy, and that the controller notifies the individual DRM drivers when their clients are over budget. From there it is left for the individual drivers to know how to best manage this situation, depending on the specific scheduling capabilities of the driver and the GPU hardware.
The user interface completely mimics the exiting CPU and IO cgroup controllers with the single drm.weight control file. The weights carry no absolute meaning and are only relative within a single group of siblings. Their only purpose is to split out the time budget between them.
Visually one potential cgroup configuration could look like this:
The DRM cgroup controller then executes a periodic scanning task which queries each DRM client for its GPU usage and notifies drivers when clients are over their allocated budget.
If we expand the concept with runtime adjustment of group weights based on window focus status, with two graphically active clients such as a game and a web browser, we can end up with the following two scenarios:
Here we show the actual GPU utilisation of each group together with their drm.weight. On the left hand side the web browser is the focused window, with the weights 100-to-10 in its favour.
The compositor is not using its full 200 / (200 + 100) so a portion is passed on to the desktop group to the extent of the full 80% required. Inside the desktop group the game is currently using 70%, while its actual allocation is 80% * (10 / (100 + 10)) = 7.27%. Therefore it is currently consuming is more than the budget and the corresponding DRM driver will be notified by the controller and will be able to do something about it.
After the user has given focus to the game window, relative weights will be adjusted and so will the budgets. Now the web browser will be over budget and therefore it can be throttled down, limiting the effect of its background activity on the foreground game window.
First driver implementation - i915 #
Back when I started developing this idea Intel GPU's were my main focus, which is why i915 was the first driver I wired up with the controller.
There I implemented a rather simple approach of dynamically adjusting the scheduling priority of the throttled contexts, to the amount proportional to how much client is over budget in relative terms.
Implementation would also cross-check against the physical engine utilisation, since in i915 we have easy access to that metric, and only throttle if the latter is close to being fully utilised. (Why this makes sense could be an interesting digression relating to the fact that a single cgroup can in theory contain multiple GPUs and multiple clients using a mix of those GPUs. But lets leave that for later.)
One of the scenarios I used to test how well this works is to run two demanding GPU clients, each in its own cgroup, tweak their relative weights, and see what happens. The results were encouraging and are shown in the following table.
We can see that, when a clients group weight was decreased, the GPU bandwidth it was receiving also went down, as a consequence of the lowered context priority after receiving the over-budget notification.
This is a suitable moment to mention how the DRM cgroup controller does not promise perfect control, that is, achieving the actual GPU sharing ratios as expressed by group-relative weights. As we have mentioned before, GPU scheduling is not nearly at the same level of quality and granularity as in the CPU world, so the goal it sets is simply to improve things - do something which has a positive impact on user experience. At the same time, the mechanism and control interface proposed does not preclude individual drivers doing as good job as they can. Or even a future possibility of replacing the inner workings with a controller with something smarter, with no need to change the user space control interface.
Going back to the initial i915 implementation, the second test I have done was attempting to wire up with the background/foreground window focus handling in ChromeOS. There I experimented with a game (Android VM) running in parallel with a WebGL demo in a browser. At a certain point after both clients were running I lowered the weight of the background game and on the below screenshot we can see how the FPS metric in a browser jumped up.
This illustrates how having the controller can indeed improve the user experience. The user's focus will be at the foreground window and therefore it does make sense to prioritise GPU access to that client for better interactiveness and smoother rendering there. In fact, in this example the actual FPS jumped from around 48-49 to 60fps. Meaning that throttling the background client has allowed the foreground one to match its rendering to display's refresh rate.
Second implementation - amdgpu #
AMD's kernel module was the next interesting driver which I wired up with the controller.
The fact that its scheduling is built on top of the DRM scheduler with only three distinct priority levels mandated a different approach to throttling. We keep a sorted list of "most offending" clients (most out of budget, or most borrowed unused budget from the sibling group), with the idea that the top client on that list gets throttled by lowering its scheduling priority. That was relatively straightforward to implement and sounded like it could potentially satisfy the most basic use case of background task isolation.
To test the runtime behaviour we set up two sibling cgroups and vary their relative scheduling weights. In one cgroup we run glxgears with vsync turned off and log its frame rate over time, while in the second group we run glmark2.
Let us first have a look on how glxgears frame rate varies during this test, depending on three different scheduling weight ratios between the cgroups. Scheduling weight ratio is expressed as glxgears:glmark2 ie. 10:1 means glxgears scheduling weight was ten times as much as configured for glmark2.
We can observe that, as the glmark2 is progressing through its various sub-benchmarks, glxgears frame rate is changing too. But it was overall higher in the runs where the scheduling weight ratio was in its favour. That is a positive result showing that even a simple implementation seems to be having the desired effect, at least to some extent.
For the second test we can look from the perspective of glmark2, checking how the benchmark score change depending on the ratio of scheduling weights.
Again we see that the scores are generally improving when the scheduling weight ratio is increased in favour of the benchmark.
However, in neither case the change of the result is proportional to actual ratios. This is because the primitive implementation is not able to precisely limit the "background" client, but is only able to achieve some throttling. Also, there is an inherent delay in how fast the controller can react given the control loop is based on periodic scanning. This period is configurable and was set to two seconds for the above tests.
Conclusion #
Hopefully this write-up has managed to demonstrate two main points:
-
First, that a generic and driver agnostic approach to DRM scheduling cgroup controller can improve user experience and enable new use cases. While at the same time following the established control interface as it exists for CPU and IO control, which makes it future-proof and extendable;
-
Secondly, that even relatively basic driver implementations can be somewhat effective in providing positive control effects.
It also probably needs to be re-iterated that neither the driver implementations or the cgroup controller implementation itself are limited by the user interface proposed. Both could be independently improved under the hood in the future.
What is next? There is more work to be done such as conducting more detailed testing, polishing the implementation and potentially attempting to wire up more drivers to the controller. Further advocacy work in the DRM community too.
References #
04 Sep 2024 12:00am GMT
30 Aug 2024
planet.freedesktop.org
Dave Airlie (blogspot): On Rust, Linux, developers, maintainers
There's been a couple of mentions of Rust4Linux in the past week or two, one from Linus on the speed of engagement and one about Wedson departing the project due to non-technical concerns. This got me thinking about project phases and developer types.
Archetypes:
1. Wayfinders/Mapmakers
2. Road builders
3. Road maintainers
Interactions:
Wayfinders and maintainers is the most difficult interaction. Wayfinders like to move freely and quickly, maintainers have other priorities that slow them down. I believe there needs to be road builders engaged between the wayfinders and maintainers.
Road builders have to be willing to expend the extra time to resolving roadblocks in the best way possible for all parties. The time it takes to resolve a single roadblock may be greater than the time expended on the whole wayfinding expedition, and this frustrates wayfinders. The builder has to understand what the maintainers concerns are and where they come from, and why the wayfinder made certain decisions. They work via education and trust building to get them aligned to move past the block. They then move down the road and repeat this process until the road is open. How this is done might change depending on the type of maintainers.
Maintainer types:
1. Positive and engaged
2. Positive with real concerns
Agrees with the road's direction, might not like some of the intersections, willing to be educated and give feedback on newer intersection designs. Moves to group 1 or trusts that others are willing to maintain intersections on their road.
3. Negative with real concerns
4. Negative and unwilling
5. Don't care/Disengaged
Where are we now?
I think my request from this is that contributors should try and identify the archetype they currently resonate with and find the next group over to interact with.
For wayfinders, it's fine to just keep wayfinding, just don't be surprised when the road building takes longer, or the road that gets built isn't what you envisaged.
For road builder, just keep building, find new techniques for bridging gaps and blowing stuff up when appropriate. Figure out when to use higher authorities. Take the high road, and focus on the big picture.
For maintainers, try and keep up with modern road building, don't say 20 year old roads are the pinnacle of innovation. Be willing to install the rumble strips, widen the lanes, add crash guardrails, and truck safety offramps. Understand that wayfinders show you opportunities for longer term success and that road builders are going to keep building the road, and the result is better if you engage positively with them.
30 Aug 2024 1:52am GMT
17 Aug 2024
planet.freedesktop.org
Simon Ser: Status update, August 2024
Hi!
After months of bikeshedding finishing touches we've finally merged ext-image-capture-source-v1 and ext-image-copy-capture-v1 in wayland-protocols! These two new protocols supersede the old wlr-screencopy-v1 protocol. They unlock some nice features such as toplevel and cursor capture, as well as improved damage tracking. Thanks a lot to Andri Yngvason! He's written a blog post about the new protocols with more details. The wlroots MR doesn't have toplevel capture implemented yet, but that's next on the TODO list.
In other Wayland news, we've merged full support for explicit synchronization in wlroots. This generally results in a better system architecture than implicit synchronization, reduces over-synchronization for complicated pipelines, and makes wlroots work correctly with drivers lacking implicit synchronization support (e.g. NVIDIA).
Alexander has implemented automatic X11 surface restacking in wlroots' scene-graph. That way, all scene-graph compositors get proper X11 stack handling for free (Sway's implementation was buggy). This should fix issues where the X11 server and the compositor don't have the same idea of the relative ordering of surfaces, resulting in clicks going "through" windows or reaching invisible windows.
Ricardo Steijn has contributed Sway support for tearing-control-v1. This allows users to opt-in to immediate page-flips which don't wait for the vertical sync point (VSync) to program new frames into the hardware. For tearing to be enabled, two conditions need to be fulfilled: tearing needs to be enabled per-output via the output allow_tearing
command, and tearing needs to be enabled per-application either via the tearing-control-v1 Wayland protocol or manually via the window allow_tearing
command. I've also pushed kernel patches from André Almeida and me to fix a few bugs around tearing page-flips with the atomic KMS API, so once these land forcing the legacy KMS API shouldn't be necessary anymore.
drm_info v2.7.0 has been released with a few new features and cleanups. Support for DRM_CLIENT_CAP_CURSOR_PLANE_HOTSPOT
and DRM_CAP_ATOMIC_ASYNC_PAGE_FLIP
has been added, and a new flag has been introduced to display information from a JSON dump.
Last, I've released a new version of go-maildir with a brand new API. Instead of referring to messages by their Maildir key and phishing back their full filename on each operation, the API exposes a Message
type. It should be much nicer to use than the previous one.
That's all for August, see you next month!
17 Aug 2024 10:00pm GMT
04 Aug 2024
planet.freedesktop.org
Matthias Klumpp: Freedesktop Specs Website Update
The Freedesktop.org Specifications directory contains a list of common specifications that have accumulated over the decades and define how common desktop environment functionality works. The specifications are designed to increase interoperability between desktops. Common specifications make the life of both desktop-environment developers and especially application developers (who will almost always want to maximize the amount of Linux DEs their app can run on and behave as expected, to increase their apps target audience) a lot easier.
Unfortunately, building the HTML specifications and maintaining the directory of available specs has become a bit of a difficult chore, as the pipeline for building the site has become fairly old and unmaintained (parts of it still depended on Python 2). In order to make my life of maintaining this part of Freedesktop easier, I aimed to carefully modernize the website. I do have bigger plans to maybe eventually restructure the site to make it easier to navigate and not just a plain alphabetical list of specifications, and to integrate it with the Wiki, but in the interest of backwards compatibility and to get anything done in time (rather than taking on a mega-project that can't be finished), I decided to just do the minimum modernization first to get a viable website, and do the rest later.
So, long story short: Most Freedesktop specs are written in DocBook XML. Some were plain HTML documents, some were DocBook SGML, a few were plaintext files. To make things easier to maintain, almost every specification is written in DocBook now. This also simplifies the review process and we may be able to switch to something else like AsciiDoc later if we want to. Of course, one could have switched to something else than DocBook, but that would have been a much bigger chore with a lot more broken links, and I did not want this to become an even bigger project than it already was and keep its scope somewhat narrow.
DocBook is a markup language for documentation which has been around for a very long time, and therefore has older tooling around it. But fortunately our friends at openSUSE created DAPS (DocBook Authoring and Publishing Suite) as a modern way to render DocBook documents to HTML and other file formats. DAPS is now used to generate all Freedesktop specifications on our website. The website index and the specification revisions are also now defined in structured TOML files, to make them easier to read and to extend. A bunch of specifications that had been missing from the original website are also added to the index and rendered on the website now.
Originally, I wanted to put the website live in a temporary location and solicit feedback, especially since some links have changed and not everything may have redirects. However, due to how GitLab Pages worked (and due to me not knowing GitLab CI well enough…) the changes went live before their MR was actually merged. Rather than reverting the change, I decided to keep it (as the old website did not build properly anymore) and to see if anything breaks. So far, no dead links or bad side effects have been observed, but:
If you notice any broken link to specifications.fd.o or anything else weird, please file a bug so that we can fix it!
Thank you, and I hope you enjoy reading the specifications in better rendering and more coherent look!
04 Aug 2024 6:54pm GMT
02 Aug 2024
planet.freedesktop.org
Mike Blumenkrantz: Juicy
REVIEWERS ARE ASLEEP POST DUMP TRUCKS
02 Aug 2024 12:00am GMT
31 Jul 2024
planet.freedesktop.org
Tomeu Vizoso: Etnaviv NPU update 20: Fast object detection on the NXP i.MX 8M Plus SoC
I'm happy to announce that my first project regarding support for the NPU in NXP's i.MX 8M Plus SoC has reached the feature complete stage.
CC BY-NC 4.0 Henrik Boye |
For the last several weeks I have been working full-time on adding support for the NPU to the existing Etnaviv driver. Most of the existing code that supports the NPU in the Amlogic A311D was reused, but NXP used a much more recent version of the NPU IP so some advancements required new code, and this in turn required reverse engineering.
This work has been kindly sponsored by the Open Source consultancy Ideas On Board, for which I am very grateful. I hope this will be useful to those companies that need full mainline support in their products, even if it is just the start.
This company is unique in working on both NPU and camera drivers in Linux mainline, so they have the best experience for products that require long term support and vision processing.
Since the last update I have fixed the last bugs in the compression of the weights tensor and implemented support for a new hardware-assisted way of executing depthwise convolutions. Some improvements on how the tensor addition operation is lowered to convolutions was needed as well.
Performance is pretty good already, allowing for detecting objects in video streams at 30 frames per second, so at a similar performance level as the NPU in the Amlogic A311D. Some performance features are left to be implemented, so I think there is still substantial room for improvement.
31 Jul 2024 1:09pm GMT
26 Jul 2024
planet.freedesktop.org
Alberto Ruiz: Booting with Rust: Chapter 2
In a previous post I gave the context for my pet project ieee1275-rs, it is a framework to build bootable ELF payloads on Open Firmware (IEEE 1275). OF is a standard developed by Sun for SPARC and aimed to provide a standardized firmware interface that was rich and nice to work with, it was later adopted by IBM, Apple for POWER and even the OLPC XO.
The crate is intended to provide a similar set of facilities as uefi-rs, that is, an abstraction over the entry point and the interfaces. I started the ieee1275-rs crate specifically for IBM's POWER platforms, although if people want to provide support for SPARC, G3/4/5s and the OLPC XO I would welcome contributions.
There are several ways the firmware takes a payload to boot, in Fedora we use a PReP partition type, which is a ~4MB partition labeld with the 41h type in MBR or 9E1A2D38-C612-4316-AA26-8B49521E5A8B as the GUID in the GPT table. The ELF is written as raw data in the partition.
Another alternative is a so called CHRP script in "ppc/bootinfo.txt", this script can load an ELF located in the same filesystem, this is what the bootable CD/DVD installer uses. I have yet to test whether this is something that can be used across Open Firmware implementations.
To avoid compatibility issues, the ELF payload has to be compiled as a 32bit big-endian binary as the firmware interface would often assume that endianness and address size.
The entry point
As I entered this problem I had some experience writing UEFI binaries, the entry point in UEFI looks like this:
#![no_main]
#![no_std]
use uefi::prelude::*;
#[entry]
fn main(_image_handle: Handle, mut system_table: SystemTable<Boot>) -> Status {
uefi::helpers::init(&mut system_table).unwrap();
system_table.boot_services().stall(10_000_000);
Status::SUCCESS
}
Basically you get a pointer to a table of functions, and that's how you ask the firmware to perform system functions for you. I thought that maybe Open Firmware did something similar, so I had a look at how GRUB does this and it used a ppc assembler snippet that jumps to grub_ieee1275_entry_fn()
, yaboot does a similar thing. I was already grumbling of having to look into how to embed an asm binary to my Rust project. But turns out this snippet conforms to the PPC function calling convention, and since those snippets mostly take care of zeroing the BSS segment but turns out the ELF Rust outputs does not generate one (although I am not sure this means there isn't a runtime one, I need to investigate this further), I decided to just create a small ppc32be ELF binary with the start function into the top of the .text section at address 0x10000.
I have created a repository with the most basic setup that you can run. With some cargo configuration to get the right linking options, and a script to create the disk image with the ELF payload on the PReP partition and run qemu, we can get this source code being run by Open Firmware:
#![no_std]
#![no_main]
use core::{panic::PanicInfo, ffi::c_void};
#[panic_handler]
fn _handler (_info: &PanicInfo) -> ! {
loop {}
}
#[no_mangle]
#[link_section = ".text"]
extern "C" fn _start(_r3: usize, _r4: usize, _entry: extern "C" fn(*mut c_void) -> usize) -> isize {
loop {}
}
Provided we have already created the disk image (check the run_qemu.sh script for more details), we can run our code by executing the following commands:
$ cargo +nightly build --release --target powerpc-unknown-linux-gnu
$ dd if=target/powerpc-unknown-linux-gnu/release/openfirmware-basic-entry of=disk.img bs=512 seek=2048 conv=notrunc
$ qemu-system-ppc64 -M pseries -m 512 --drive file=disk.img
[...]
Welcome to Open Firmware
Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
This program and the accompanying materials are made available
under the terms of the BSD License available at
http://www.opensource.org/licenses/bsd-license.php
Trying to load: from: /vdevice/v-scsi@71000003/disk@8000000000000000 ... Successfully loaded
Ta da! The wonders of getting your firmware to run an infinite loop. Here's where the fun begins.
Doing something actually useful
Now, to complete the hello world, we need to do something useful. Remeber our _entry
argument in the _start()
function? That's our gateway to the firmware functionality. Let's look at how the IEEE1275 spec tells us how we can work with it.
This function is a universal entry point that takes a structure as an argument that tells the firmware what to run, depending on the function it expects some extra arguments attached. Let's look at how we can at least print "Hello World!" on the firmware console.
The basic structure looks like this:
#[repr(C)]
pub struct Args {
pub service: *const u8, // null terminated ascii string representing the name of the service call
pub nargs: usize, // number of arguments
pub nret: usize, // number of return values
}
This is just the header of every possible call, nargs and nret determine the size of the memory of the entire argument payload. Let's look at an an example to just exit the program:
#[no_mangle]
#[link_section = ".text"]
extern "C" fn _start(_r3: usize, _r4: usize, entry: extern "C" fn(*mut Args) -> usize) -> isize {
let mut args = Args {
service: "exit\0".as_ptr(),
nargs: 0,
nret: 0
};
entry (&mut args as *mut Args);
0 // The program will exit in the line before, we return 0 to satisfy the compiler
}
When we run it in qemu we get the following output:
Trying to load: from: /vdevice/v-scsi@71000003/disk@8000000000000000 ... Successfully loaded
W3411: Client application returned.
Aha! We successfully called firmware code!
To be continued…
To summarize, we've learned that we don't really need assembly code to produce an entry point to our OF bootloader (tho we need to zero our bss segment if we have one), we've learned how to build a valid OF ELF for the PPC architecture and how to call a basic firmware service.
In a follow up post I intend to show a hello world text output and how the ieee1275 crate helps to abstract away most of the grunt to access common firmware services. Stay tuned!
26 Jul 2024 3:06pm GMT
Mike Blumenkrantz: Aftermath
After Action Report
The DRIL merge is done, and things are mostly working again after a tumultuous week. To recap, here's everything that went wrong leading up to 24.2-rc1, the reason why it went wrong, and the potential steps that could be taken (but almost certainly won't) to avoid future issues.
Library Paths
One of the big changes that went in last-minute was a MR linking all the GL frontend libs to Gallium, which is a huge improvement to the old way of using dlopen
to directly trigger version mismatch errors.
It had some problems, like how it broke Steam. As some readers may have inferred, this was Very Bad, as my employer has some interest in ensuring that Steam does not break.
The core problem in this case has to do with library paths, distro policies, and Steam's own library handling:
- Mesa's libGLX/libEGL/libgbm all link directly to
libgallium.so
now, which means this library must be in the library path - Traditionally,
libgallium.so
has been installed to${libdir}/dri
- I initially suggested installing it to
${libdir}
to avoid library pathing issues, but the criticism I received was that distros would not be friendly towards shipping an unstable library here - Thus, I came upon the decision to use rpath to ensure the
dri
directory was appended to the library path forlibgallium.so
Unfortunately, there are lots of things that don't fully handle all variations of rpath
, chief among them Steam. Furthermore, some distros don't use the most optimal implementation of rpath
(i.e., they use DT_RPATH
instead of DT_RUNPATH
), which hits those unimplemented parts of Steam.
The reason(s) this managed to land without issues?
- I was juggling the MR across multiple repos during final testing and CI mashing when I was trying to get it landed, and an intermediate version of the MR landed which updated all the CI
LD_LIBRARY_PATH
variables to include${libdir}/dri
which I had used for a test run but did not intend to land with the final version - My test machines all add a lot of extra directories to my
LD_LIBRARY_PATH
to avoid random issues when testing obscure apps
Combined, I wasn't getting adequate testing, so it appeared everything was fine when really nothing was fine.
Lucky for me, Simon McVittie wrote a full textbook analysis of the issue and possible solutions, so this is now fixed.
Ideally in the future I'll have better testing environments and won't be trying to hammer in big MRs minutes before a RC goes out.
FB Configs Went Missing
DRI is (now) a simple interface that tells Xorg which rendering formats can be used for drawables. This is dependent on the device and driver, but fbconfigs aren't typically something that should vary too much between driver versions. DRIL is meant to split this functionality out of the rest of Mesa so that all the internal interfaces don't have to be a Gordian Knot.
Unfortunately, this means if DRIL has problems determining which formats are usable, the xserver also has problems. There were a lot of problems:
- The original implementation used some pretty suboptimal looping to calculate valid configs (If you're ever using
eglChooseConfigs
, you're probably fucking up) which made it hard to adequately review - The hardcoded list of valid configs was very limited; basically you got 8/8/8/8 with a couple ZS variants, or you got 10/10/10/2
- No double-buffered configs
- No sRGB variants
This is why there was a sudden deluge of issues about broken colors
On my end, I didn't check glxinfo
output thoroughly enough, nor did I do an exceptionally thorough testing of desktop apps. All the unit tests passed along with CI, which seemed like it should have been enough. Too bad there are no piglit tests which check to see whether various fbconfigs are supported. Maybe I'll write one to ensure there's a CI baseline and catch any future regressions.
Drivers Stopped Loading
This is a pretty dumb issue, but it was an issue nonetheless: drivers simply stopped loading. This affected any number of embedded (etnaviv) devices and was fixed by a pretty trivial MR. Also I broke KMSRO, which broke even more devices.
Whoops.
The problem here is there's no CI testing, and I have no such devices for testing. Hard to evaluate these types of things when they silently fail.
But Now We're All Good.
I promise.
26 Jul 2024 12:00am GMT
25 Jul 2024
planet.freedesktop.org
Alberto Ruiz: Booting with Rust: Chapter 1
I have been doing random coding experiments with my spare time that I never got to publicize much outside of my inner circles. I thought I would undust my blog a bit to talk about what I did in case it is useful for others.
For some background, I used to manage the bootloader team at Red Hat a few years ago alongside Peter Jones and Javier Martinez. I learned a great deal from them and I fell in love with this particular problem space and I have come to enjoy tinkering with experiments in this space.
There many open challenges in this space that we could use to have a more robust bootpath across Linux distros, from boot attestation for initramfs and cmdline, A/B rollbacks, TPM LUKS decryption (ala BitLocker)…
One that particularly interests me is unifying the firmware-kernel boot interface across implementations in the hypothetical absence of GRUB.
Context: the issue with GRUB
The priority of the team was to support RHEL boot path on all the architectures we supported. Namely x86_64 (legacy BIOS & UEFI), aarch64 (UEFI), s390x and ppc64le (Open Power and PowerVM).
These are extremely heterogeneous firmware interfaces, some are on their way to extinction (legacy PC BIOS) and some will remain weird for a while.
GRUB, (GRand Unified Bootloader) as it names stands, intends to be a unified bootloader for all platforms. GRUB has to support a supersetq of firmware interfaces, some of those, like legacy BIOS do not support much other than some rudimentary support disk or network access and basic graphics handling.
To get to load a kernel and its initramfs, this means that GRUB has to implement basic drivers for storage, networking, TCP/IP, filesystems, volume management… every time there is a new device storage technology, we need to implement a driver twice, once in the kernel and once in GRUB itself. GRUB is, for all intent and purposes, an entire operating system that has to be maintained.
The maintenance burden is actually quite big, and recently it has been a target for the InfoSec community after the Boot Hole vulnerability. GRUB is implemented in C and it is an extremely complex code base and not as well staffed as it should. It implements its own scripting language (parser et al) and it is clear there are quite a few CVEs lurking in there.
So, we are basically maintaining code we already have to write, test and maintain in the Linux kernel in a different OS whose whole purposes (in the context of RHEL, CentOS and Fedora) its main job is to boot a Linux kernel.
This realization led to the initiative that these days are taking shape in the discussions around nmbl (no more boot loader). You can read more about that in that blog post, I am not actively participating in that effort but I encourage you to read about it. I do want to focus on something else and very specific, which is what you do before you load the nmble kernel.
Booting from disk
I want to focus on the code that goes from the firmware interface to loading the kernel (nmbl or otherwise) from disk. We want some sort of A/B boot protocol that is somewhat normalized across the platforms we support, we need to pick the kernel from the disk.
The systemd community has led some of the boot modernization initiatives, vocally supporting the adoption of UKI and signed pre-built initarmfs images, developing the Boot Loader Spec, and other efforts.
At some point I heard Lennart making the point that we should standardize on using the EFI System Partition as /boot to place the kernel as most firmware implementations know how to talk to a FAT partition.
This proposal caught my attention and I have been pondering if we could have a relatively small codebase written in a safe language (you know which) that could support a well define protocol for A/B booting a kernel in Legacy BIOS, S390 and OpenFirmware (UEFI and Open Power already support BLS snippets so we are covered there).
My modest inroad into testing this hypothesis so far has been the development of ieee1275-rs, a Rust module to write programs for the Open Firmware interface, so far I have not been able to load a kernel by myself but I still think the lessons learned and some of the code could be useful to others. Please note this is a personal experiment and nothing Red Hat is officially working on.
I will be writing more about the technical details of this crate in a follow up blog post where I get into some of the details of writing Rust code for a firmware interface, this post is long enough already. Stay tuned.
25 Jul 2024 2:10pm GMT
17 Jul 2024
planet.freedesktop.org
Mike Blumenkrantz: Long Road To DRIL
I'm Cookin
Lot of stuff happening. I can't talk about much of it yet, but trust me when I say the following:
It's happening.
When it happens, you'll know what I meant.
Today Is A Great Day
Remember way back when I put DRI interfaces on notice?
Now, only four months later, DRI interfaces are finally going away.
Begun by @ajax and then finished off by me and Pavel (ghostwritten by @daniels), the DRIL (DRI Legacy) interface is a tiny shim which matches Xorg's ABI expectations to provide a list of sensible fbconfig formats during startup. Then it does nothing. And by doing nothing, it saves the rest of Mesa from being shackled to ancient ABI constraints.
Let the refactoring begin.
But Wait, There's More!
Obviously I'm not going to stop here. SGC leaves no code half-krangled. That's why, as soon as DRIL lands, I'll also be hammering in this followup MR which finally makes all the GL frontends link directly to the Gallium backend driver.
Why is this so momentous, you ask? How many of you have gotten the error DRI driver not from this Mesa build
when trying to use your custom Mesa build?
With this MR, that error is going away. Permanently. Now you can have as many Mesa builds on your system as you want. No longer do you need to set LIBGL_DRIVERS_PATH
for any reason.
The future is here.
17 Jul 2024 12:00am GMT
15 Jul 2024
planet.freedesktop.org
Simon Ser: Status update, July 2024
Hi!
This month wlroots 0.18.0 has been released! This new version includes a fair share of niceties: ICC profiles, GPU reset recovery, less black screens when plugging in a monitor on Intel, a whole bunch of new protocol implementations, and much more. Thanks a lot to all contributors! Two recent merge requests made it in the release: Kenny's Vulkan renderer optimizations, and support for the SIZE_HINTS
KMS property to use a smaller cursor plane on Intel to save power. For the next release we'll be trying out release candidates to formally focus on bugfixing and leave time for compositors and language bindings to update and report issues.
I've continued working on various graphics-related topics, for instance the wlroots implementation of the upcoming ext-screencopy-v1 protocol is now complete and the protocol itself is almost ready (still figuring out the most difficult part: how to name it). I also sent out a kernel patch to fix tearing page-flips when cursor/overlay planes don't change (and are included in the atomic commit). I reviewed patches by Enrico Weigelt to improve libdrm's portability to OpenBSD and Solaris. Last, I've released libdisplay-info 0.2.0 with a new high-level API for colorimetry and support for more EDID/CTA/DisplayID blocks.
To get the releases over with, let's briefly mention Goguma 0.7.0. This one unlocks file uploads, a new look based on Material You with an adaptive color scheme, many improvements to the iOS port, and text/media can be shared to Goguma from other apps. slingamn has played with a gamja/Ergo setup configured with Forgejo as an OAuth server, and it worked nicely after fixing a gamja SASL-related bug and implementing a missing feature in Forgejo's OAuth token introspection endpoint!
Last, I also added a new libscfg API to write files - this can be useful to auto-generate some configuration files for instance. And I also performed some more boring X.Org Foundation sysadmin stuff, such as dealing with domain-related issues, recovering a server running out of disk space again, and convincing Postfix to start up.
See you next month!
15 Jul 2024 10:00pm GMT
12 Jul 2024
planet.freedesktop.org
Madeeha Javed: Igalia's Latest Contributions to Graphics
The Igalia Graphics team has been expanding and making significant contributions in the space of open source graphics. An earlier blog post by our team member Lucas provides an excellent insight in to the team's evolution over the past years. The following series of posts will attempt to summarize the team's recent engagements:
- This post covers our updates on GPU color management, Turnip, V3DV, DRM/KMS, Etnaviv and community events we have been participating in.
- The next post will cover news from our CTS, Vulkan Video, Mesa CI, GPU reset work and talks about some new initiatives that recently we got involved in.
Before dwelling in to details, it is worth mentioning the recent highlights; Igalia hosted 2024 Linux Display Next Hackfest in May this year and X.org Developers Conference 2023 in October last year, both in the beautiful city of A Coruña. These events were a huge success in creating a hub for graphics experts to foster open innovation. Continue reading for more details on these events.
A Vibrant Linux #
Last year brought great news for AMD GPU color management: the AMD driver-specific color management properties reached the upstream linux-next! My Igalia colleague Melissa Wen has been spearheading this effort for some time now and has journalled every detail in a series of blog posts.
AMD has been improving its display color management pipeline with each new hardware generation. The new color capabilities, before and after plane composition, can be used by compositors and userspace applications to provide a vibrant experience to the end-user. Exposing AMD driver-specific color properties is a step towards advanced color management on Linux, allowing gamut mapping, HDR rendering, HDR on SDR, and SDR on HDR.
On a very high level, there are 2 parts of this support:
-
Upgrading the DRM/KMS Linux interface to expose the new features to the user-space. One major challenge was the limited DRM/KMS interface, which only exposed a small set of post-blending color properties. Latest AMD Display Core Next hardware has many more post-blending and pre-blending capabilities. Melissa's work involved mapping these capabilities to the AMD driver's display core interface and then to the DRM interface. Her blog post provides a brief overview of this extensive mapping effort.
-
Updating the AMD's Linux display driver to expose the new hardware features. AMD DCN 3.0 comes with cutting edge color capabilities described by Melissa here and this blog post also talks about the AMD's Linux display subsystem components and about the new properties.
I quote here some of Melissa's write-ups that helped me get some understanding about this vast subject:
- Navigating the Linux display subsystem
- Melissa's XDC2023 talk
Turnip Upgrades #
Turnip, the open-source Vulkan driver for Qualcomm Adreno GPUs, has been receiving major upgrades this year for Qualcomm's Adreno 7XX GPUs.
From my colleague Danylo Piliaiev's Turnip update at FOSDEM 2024, Turnip seems to be in a great state; major Vulkan extensions and better debug support, AAA desktop games can now run via FEX + Turnip on Linux, with some from the Termux community even running desktop games on Android with Box64/FEX + Turnip.
The highlight of Danylo's talk is the A7XX support. The team started the year with A7XX bring up and now ramping on adding support for the new features introduced in A7XX:
-
Mark Collins, who also represents Igalia at the Khronos Vulkan WG, implemented GMEM rendering for A7XX, which can be considerably faster and more power efficient than sysmem rendering depending on what's being rendered. Followed up by support for unidirectional LRZ, bringing A7XX to parity with A6XX's GMEM rendering feature set and further boosting performance, with more performance improvements for A7XX on the horizon.
-
Our colleague Amber Harmonia added support for allowing a shader to contain 64-bit atomic operations on signed and unsigned integers and support for allowing rasterizing wide lines while Fixed Stride Draw Table support is work-in-progress.
In addition to new feature support, we are committed to providing a robust and performant driver.
Recently, Job Noorman has joined our Turnip team to improve the IR3 compiler. He improved handling of predicate registers and added support for predication. Adreno GPUs have special registers that store the result of a condition called predicate registers, utilizing these registers can eliminate branches in the generated code thereby improving performance. Similarly, more than 10% code size reduction was observed in shader-db with his patch for using rptN instructions.
Turnip has come far and has been giving competition to the Adreno's proprietary driver recently. Here is Assassin's Creed running on Adreno + Turnip. Check the FPS on that screen!
Turnip Development Resources #
Danylo usually talks about analyzing some of the major Turnip issues in his series of blog posts "Turnips in the wild" with part 3 being the latest addition. This is exactly what you need to jump start Turnip development.
As always, the team also discovered many new techniques of debugging GPU issues. GPU driver developers want to modify the GPU command stream on run-time to see the outcome of editing it in different ways. Danylo implemented this highly sought out feature as a tool for Adreno and describes how this tool can be used.
DRM/KMS Improvements #
The management of the display, graphics and composition in Linux lies in the kernel DRM/KMS framework. Igalian Maíra Canal provides full disclosure on our notable contributions authoring, reviewing and testing kernel DRM patches while I privide a few highlights here:
-
My Igalia colleague André Almeida and Simon Ser have been working on Asynchronous Page Flips, an optimization that allows applications to flip a plane for immediate presentation. The support for this feature is now available in the atomic API. Plus, with André's patch, it is enabled for all planes including the primary plane if the hardware supports it.
-
Maíra has been working on feature crucial to graphics development on RPi. She supplied per client GPU usage statistics as well as global GPU utilization.
-
In order to ensure continuous job submission to the GPU, CPU jobs submitted from userspace must be prevented. With a series of patches from Maíra moved CPU jobs mechanisms from the V3DV driver to the V3D kernel driver.
We want more Pi! #
After achieving Vulkan 1.2 conformance on V3DV, the Igalia team working on V3DV have been focusing on instrumental enhancements of the driver. V3DV is Broadcom Video Core GPU's Vulkan driver on the
RPi 5 was launched in October last year with a new BCM GPU. Alejandro provided an overview of the team's journey through V3DV development since RPi 4 and then talks about challenges of RPi 5 support in V3DV:
More improvements and new Vulkan extensions were supported last year.
This year Iago landed support for Vulkan dynamic rendering extension. VK_KHR_dynamic_rendering is a popular Vulkan extension that has added flexibility to the Vulkan API by allowing users to skip render pass and frame buffer objects and start immediate rendering. And now its available on the Pi.
As mentioned in the DRM/KMS improvements above, Maíra together with José María Casanova (Chema) and Melissa supported GPU utilization stats and CPU jobs optimization. Here is a snapshot of collection of GPU stats on Pi5:
RPi 5 continues to use OpenGL/Wayland based Wayfire compositor on these devices. Christopher was therefore tasked with enabling Wayfire to run on RPi 3 and 4 as well. He achieved this by software rendering implementing by a Pixman back-end. Check out the demo:
Iago also made some interesting observations while experimenting with SuperTuxKart on the Pi. You will be pleasantly surprised to know how Vulkan out-performed OpenGL.
The team has been working towards Vulkan 1.3 and we will hopefully be able to share more news on that front very soon.
Etnaviv #
Christian Gmeiner, one of the maintainers of Etnaviv (open-source graphics driver for Vivante GPUs), joined our team last year. We are very excited to have him on-board because it is a testament to Igalia's dedication towards open source graphics software development.
Christian is also enjoying being at Igalia as he discusses in blog post and also reveals his plans for Etnaviv:
- Improving Etnaviv's Gallium driver.
- Exposing GLES3.
- Moving towards a new back-end compiler.
One of his latest updates is the user-space hardware database. He explains that a user-space driver HW database has been introduced to obtain GPU specific information like GPU features and limits, corresponding to the introduction of an in-kernel hardware database. I am sure this will be super helpful for the reverse engineers out there!
News & Community Events #
Igalians are always eager to share their knowledge and expertise with the open source community by participating in key organizations and events.
Good bye 'Xorg' and Hello 'Linux Foundation' #
There is quite a trend in Igalians serving on the X.Org Foundation's Board of Directors. Samuel Iglesias took on this responsibility for a number of terms but this year he is stepping down. He reminisced about his role in this blog post.
Ricardo was, however, elected as one of the board of directors in 2022 and stayed on the board till Q1 2024, leaving Christopher Michael as the only Igalian currently on the board. In his blog post, Ricardo introduces the X.Org Foundation but also tackles some questions about its future.
Samuel was invited to join the Linux Foundation (Europe) advisory board and he has accepted the invitation. This is a huge milestone for the whole graphics team. Congratulations Sam!
2024 Linux Display Hackfest #
This is a rather new event that has materialized in the Linux community to enhance the Linux display stack.
Melissa's work on HDR and AMD color management together with interesting discussions during XDC 2023 Color Management workshop paved the way for the event this year and therefore, Igalia graciously offered to host it.
The event attracted key participants from Linux community, AMD, Nvidia, Google, Fedora, and Gnome, focusing on topics like HDR/color Management, variable refresh rate, tearing, multiplane/hardware overlay for video and gaming, real-time scheduling, async KMS API, power saving vs. color/latency, content-adaptive scaling and sharpening, and display control. The success of this event has highlighted the need for future editions.
Embedded Open Source Summit 2024 #
At EOSS this year, we presented the following talks:
- Alejandro Piñeiro, Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driver for a New GPU
FOSDEM 2024 #
At FOSDEM this year, we presented the following talks:
- Danylo Piliaiev, "turnip: Update on Open Source Vulkan Driver for Adreno GPUs"
- José María Casanova Crespo, Juan A. Suarez, "Graphics stack updates for Raspberry Pi devices"
Vukanised 2024 #
At Vukanised this year, we presented the following talks:
Stéphane Cerveau & Hyunjun Ko, "Implementing a Vulkan Video Encoder From Mesa to Streamer" Iago Toral, Faith Ekstrand, "8 Years of Open Drivers, including the State of Vulkan in Mesa"
Igalians who attended the event found it quite informative on the subject.
XDC 2023 #
Igalia hosted XDC 2023 in the city of their headquarters, A Coruña. We also presented many talks and demos.
- Melissa Wen, "The rainbow treasure map: advanced color management on Linux with AMD/Steam Deck"
- Danylo Piliaiev, "Debugging GPU faults: QoL tools for your driver"
- Eric Engestrom with Martin Roukala and David Heidelberg, "Hosting a CI system at home - Slaying the regression dragon to bring stability to driver kingdom"
- Iago Toral, Juan A. Suarez, Maíra Canal, "On-going challenges in the Raspberry Pi driver stack: OpenGL 3, Vulkan and more"
- Maíra Canal, Melissa Wen, "Status Update of the VKMS DRM driver"
- André Almeida, "Having fun with GPU resets in Linux"
- Lucas Fryzek, "Freedreno on Android"
- Christian Gmeiner, "etnaviv: status update"
The lightning talks and demos had an equally active participation from Igalia:
- Christopher Michael, "Wayfire - Making an OpenGL Wayland compositor render using Pixman"
- Guilherme G. Piccoli, "To crash or not to crash: if you do, at least recover fast!"
- Charles Turner, "Status of the Vulkan Video ecosystem"
- Alejandro Piñeiro, "v3dv: experience using gfxreconstruct/apitrace traces for performance evaluation"
- Eric Engestrom "Being a Mesa release maintainer"
Workshops were organized for discussion on larger subjects like advance color management (discussion summary) and continuous integration (discussion summary).
The Future #
Igalia graphics team has profound expertise in Mesa, Vulkan, OpenGL and Linux kernel. We have also embraced new and really interesting graphics technologies that I talk about in my next post.
12 Jul 2024 12:00am GMT
11 Jul 2024
planet.freedesktop.org
Christian Gmeiner: It All Started With a Nop - Part I
Note
This blog post is part 1 of a series of blog posts about isaspec and its usage in the etnaviv GPU stack.
I will add here links to the other blog posts, once they are published.
The first time I heard about isaspec, I was blown away by the possibilities it opens. I am really thankful that Igalia made it possible to complete this crucial piece of core infrastructure for the etnaviv GPU stack.
If isaspec is new to you, here is what the Mesa docs have to tell about it:
isaspec provides a mechanism to describe an instruction set in XML, and generate a disassembler and assembler. The intention is to describe the instruction set more formally than hand-coded assembler and disassembler, and better decouple the shader compiler from the underlying instruction encoding to simplify dealing with instruction encoding differences between generations of GPU.
Benefits of a formal ISA description, compared to hand-coded assemblers and disassemblers, include easier detection of new bit combinations that were not seen before in previous generations due to more rigorous description of bits that are expect to be '0' or '1' or 'x' (dontcare) and verification that different encodings don't have conflicting bits (i.e. that the specification cannot result in more than one valid interpretation of any bit pattern).
If you are interested in more details, I highly recommend Rob Clark's introduction to isaspec presentation.
Target ISA
Vivante uses a fixed-size (128 bits), predictable instruction format with explicit inputs and outputs.
As of today, there are three different encodings seen in the wild:
- Base Instruction Set
- Extended Instruction Set
- Enhanced Vision Instruction Set (EVIS)
Why do I want to switch to isaspec
There are several reasons..
The current state
The current ISA documentation is not very explicit and leaves lot of room for interpretation and speculation. One thing that it provides, are some nice explanations what an instruction does. isaspec does not support <doc>
tags yet, but I there is a PoC MR that generates really nice looking and information ISA documentation based on the xml.
I think soon you might find all etnaviv's isaspec documentation at docs.mesa3d.org.
No unit tests
There are no unit tests based on instructions generated by the blob driver. This might not sound too bad, but it opens the door to generating 'bad' encoded instructions that could trigger all sorts of weird and hard-to-debug problems. Such breakages could be caused by some compiler rework, etc.
In an ideal world, there would be a unit test that does the following:
- Disassembles the binary representation of an instruction from the blob to a string representation.
- Verifies that it matches our expectation.
- Assembles the string representation back to 128 bits.
- Verifies that it matches the binary representation from the blob driver.
This is our ultimate goal, which we really must reach. etnaviv will not be the only driver that does such deep unit testing - e.g. freedreno does it too.
Easier to understand code
Do you remember the rusticl OpenCL attempt for etnaviv? It contains lines like:
if (nir_src_is_const(intr->src[1])) {
inst.tex.swiz = 128;
}
if (rmode == nir_rounding_mode_rtz)
inst.tex.amode = 0x4 + INST_ROUND_MODE_RTZ;
else /*if (rmode == nir_rounding_mode_rtne)*/
inst.tex.amode = 0x4 + INST_ROUND_MODE_RTNE;
Do you clearly see what is going on? Why do we need to set tex.amode for an ALU instruction?
I always found it quite disappointing to see such code snippets. Sure, they mimic what the blob driver is doing, but you might lose all the knowledge about why these bits are used that way days after you worked on it. There must be a cleaner, more understandable, and thus more maintainable way to document the ISA better.
This situation might become even worse if we want to support the other encodings and could end up with more of these bad patterns, resulting in a maintenance nightmare.
Oh, and if you wonder what happened to OpenCL and etnaviv - I promise there will be an update later this year.
Python opens the door to generate lot of code
As isaspec is written in Python, it is really easy to extend it and add support for new functionality.
At its core, we can generate a disassembler and an assembler based on isaspec. This alone saves us from writing a lot of code that needs to be kept in sync with all the ISA reverse engineering findings that happen over time.
As isaspec is just an ordinary XML file, you can use any programming language you like to work with it.
One source of truth
I really fell in love with the idea of having one source of truth that models our target ISA, contains written documentation, and extends each opcode with meta information that can be used in the upper layers of the compiler stack.
Missing Features
I think I have sold you the idea quite well, so it must be a matter of some days to switch to it. Sadly no, as there are some missing features:
- Only max 64 bits width ISAs are supported
- Its home in src/freedreno
- Alignment support is missing
- No
<meta>
tags are supported
Add support for 128 bit wide instructions
The first big MR I worked on, extended BITSET APIs with features needed for isaspec. Here we are talking about bitwise AND, OR, and NOT, and left shifts.
The next step was to switch isaspec to use the BITSET API to support wider ISAs. This resulted in a lot of commits, as there was a need for some new APIs to support handling this new feature. After these 31 commits, we were able to start looking into isaspec support for etnaviv.
Decode Support
Now it is time to start writing an isaspec XML for etnaviv, and the easiest opcode to start with is the nop
. As the name suggests, it does nothing and has no src's, no dst, or any other modifier.
As I do not have this initial version anymore, I tried to recreate it - it might have looked something like this:
<?xml version="1.0" encoding="UTF-8"?>
<isa>
<bitset name="#instruction">
<display>
{NAME} void, void, void, void
</display>
<pattern low="6" high="10">00000</pattern>
<pattern pos="11">0</pattern>
<pattern pos="12">0</pattern>
<pattern low="13" high="26">00000000000000</pattern>
<pattern low="27" high="31">00000</pattern>
<pattern pos="32">0</pattern>
<pattern pos="33">0</pattern>
<pattern pos="34">0</pattern>
<pattern low="35" high="38">0000</pattern>
<pattern pos="39">0</pattern>
<pattern low="40" high="42">000</pattern>
<!-- SRC0 -->
<pattern pos="43">0</pattern> <!-- SRC0_USE -->
<pattern low="44" high="52">000000000</pattern> <!-- SRC0_REG -->
<pattern pos="53">0</pattern>
<pattern low="54" high="61">00000000</pattern> <!-- SRC0_SWIZ -->
<pattern pos="62">0</pattern> <!-- SRC0_NEG -->
<pattern pos="63">0</pattern> <!-- SRC0_ABS -->
<pattern low="64" high="66">000</pattern> <!-- SRC0_AMODE -->
<pattern low="67" high="69">000</pattern> <!-- SRC0_RGROUP -->
<!-- SRC1 -->
<pattern pos="70">0</pattern> <!-- SRC1_USE -->
<pattern low="71" high="79">000000000</pattern> <!-- SRC1_REG -->
<pattern low="81" high="88">00000000</pattern> <!-- SRC1_SWIZ -->
<pattern pos="89">0</pattern> <!-- SRC1_NEG -->
<pattern pos="90">0</pattern> <!-- SRC1_ABS -->
<pattern low="91" high="93">000</pattern> <!-- SRC1_AMODE -->
<pattern pos="94">0</pattern>
<pattern pos="95">0</pattern>
<pattern low="96" high="98">000</pattern> <!-- SRC1_RGROUP -->
<!-- SRC2 -->
<pattern pos="99">0</pattern> <!-- SRC2_USE -->
<pattern low="100" high="108">000000000</pattern> <!-- SRC2_REG -->
<pattern low="110" high="117">00000000</pattern> <!-- SRC2_SWIZ -->
<pattern pos="118">0</pattern> <!-- SRC2_NEG -->
<pattern pos="119">0</pattern> <!-- SRC2_ABS -->
<pattern pos="120">0</pattern>
<pattern low="121" high="123">000</pattern> <!-- SRC2_AMODE -->
<pattern low="124" high="126">000</pattern> <!-- SRC2_RGROUP -->
<pattern pos="127">0</pattern>
</bitset>
<!-- opcocdes sorted by opc number -->
<bitset name="nop" extends="#instruction">
<pattern low="0" high="5">000000</pattern> <!-- OPC -->
<pattern pos="80">0</pattern> <!-- OPCODE_BIT6 -->
</bitset></isa>
With the knowledge of the old ISA documentation, I went fishing for instructions. I only used instructions from the binary blob for this process. It is quite important for me to have as many unit tests as I can write to not break any decoding with some isaspec XML changes I do. And it was a huge lifesaver at that time.
After I reached almost feature parity with the old disassembler, I thought it was time to land etnaviv.xml and replace the current handwritten disassembler with a generated one - yeah, so I submitted an MR to make the switch.
As this is only a driver internal disassembler used by maybe 2-3 human beings, it would not be a problem if there were some regressions.
Today I would say the isaspec disassembler is superior to the handwritten one.
Encode Support
The next item on my list was to add encoding support. As you can imagine, there was some work needed upfront to support ISAs that are bigger than 64 bits. This time the MR only contains two commits 😄.
With everything ready it is time to add isaspec based encoding support to etnaviv.
The goal is to drop our custom (and too simple) assembler and switch to one that is powered by isaspec.
This opens the door to:
- Modeling special cases for instructions like a branch with no src's to a new jump instruction.
- Doing the NIR src -> instruction src mapping in isaspec.
- Supporting different instruction encodings.
- Adding meta information to instructions.
Supporting special instructions that are used in compiler unit tests
In the end, all the magic that is needed is shown in the following diff:
diff --git a/src/etnaviv/isa/etnaviv.xml b/src/etnaviv/isa/etnaviv.xml
index eca8241a2238a..c9a3ebe0a40c2 100644
--- a/src/etnaviv/isa/etnaviv.xml
+++ b/src/etnaviv/isa/etnaviv.xml
@@ -125,6 +125,13 @@ SPDX-License-Identifier: MIT
<field name="AMODE" low="0" high="2" type="#reg_addressing_mode"/>
<field name="REG" low="3" high="9" type="uint"/>
<field name="COMPS" low="10" high="13" type="#wrmask"/>
+
+ <encode type="struct etna_inst_dst *">
+ <map name="DST_USE">p->DST_USE</map>
+ <map name="AMODE">src->amode</map>
+ <map name="REG">src->reg</map>
+ <map name="COMPS">p->COMPS</map>
+ </encode>
</bitset>
<bitset name="#instruction" size="128">
@@ -137,6 +144,46 @@ SPDX-License-Identifier: MIT
<derived name="TYPE" type="#type">
<expr>{TYPE_BIT2} << 2 | {TYPE_BIT01}</expr>
</derived>
+
+ <encode type="struct etna_inst *" case-prefix="ISA_OPC_">
+ <map name="TYPE_BIT01">src->type & 0x3</map>
+ <map name="TYPE_BIT2">(src->type & 0x4) > 2</map>
+ <map name="LOW_HALF">src->sel_bit0</map>
+ <map name="HIGH_HALF">src->sel_bit1</map>
+ <map name="COND">src->cond</map>
+ <map name="RMODE">src->rounding</map>
+ <map name="SAT">src->sat</map>
+ <map name="DST_USE">src->dst.use</map>
+ <map name="DST">&src->dst</map>
+ <map name="DST_FULL">src->dst_full</map>
+ <map name="COMPS">src->dst.write_mask</map>
+ <map name="SRC0">&src->src[0]</map>
+ <map name="SRC0_USE">src->src[0].use</map>
+ <map name="SRC0_REG">src->src[0].reg</map>
+ <map name="SRC0_RGROUP">src->src[0].rgroup</map>
+ <map name="SRC0_AMODE">src->src[0].amode</map>
+ <map name="SRC1">&src->src[1]</map>
+ <map name="SRC1_USE">src->src[1].use</map>
+ <map name="SRC1_REG">src->src[1].reg</map>
+ <map name="SRC1_RGROUP">src->src[1].rgroup</map>
+ <map name="SRC1_AMODE">src->src[1].amode</map>
+ <map name="SRC2">&src->src[2]</map>
+ <map name="SRC2_USE">rc->src[2].use</map>
+ <map name="SRC2_REG">src->src[2].reg</map>
+ <map name="SRC2_RGROUP">src->src[2].rgroup</map>
+ <map name="SRC2_AMODE">src->src[2].amode</map>
+
+ <map name="TEX_ID">src->tex.id</map>
+ <map name="TEX_SWIZ">src->tex.swiz</map>
+ <map name="TARGET">src->imm</map>
+
+ <!-- sane defaults -->
+ <map name="PMODE">1</map>
+ <map name="SKPHP">0</map>
+ <map name="LOCAL">0</map>
+ <map name="DENORM">0</map>
+ <map name="LEFT_SHIFT">0</map>
+ </encode>
</bitset>
<bitset name="#src-swizzle" size="8">
@@ -148,6 +195,13 @@ SPDX-License-Identifier: MIT
<field name="SWIZ_Y" low="2" high="3" type="#swiz"/>
<field name="SWIZ_Z" low="4" high="5" type="#swiz"/>
<field name="SWIZ_W" low="6" high="7" type="#swiz"/>
+
+ <encode type="uint8_t">
+ <map name="SWIZ_X">(src & 0x03) >> 0</map>
+ <map name="SWIZ_Y">(src & 0x0c) >> 2</map>
+ <map name="SWIZ_Z">(src & 0x30) >> 4</map>
+ <map name="SWIZ_W">(src & 0xc0) >> 6</map>
+ </encode>
</bitset>
<enum name="#thread">
@@ -272,6 +326,13 @@ SPDX-License-Identifier: MIT
</expr>
</derived>
</override>
+
+ <encode type="struct etna_inst_src *">
+ <map name="SRC_SWIZ">src->swiz</map>
+ <map name="SRC_NEG">src->neg</map>
+ <map name="SRC_ABS">src->abs</map>
+ <map name="SRC_RGROUP">p->SRC_RGROUP</map>
+ </encode>
</bitset>
<bitset name="#instruction-alu-no-src" extends="#instruction-alu">
One nice side effect of this work is the removal of isa.xml.h file that has been part of etnaviv since day one. We are able to generate all the file contents with isaspec and some custom python3 scripts. The move of instruction src swizzling from the driver into etnaviv.xml was super easy - less code to maintain!
Summary
I am really happy with the end result, even though it took quite some time from the initial idea to the point when everything was integrated into Mesa's main git branch.
There is so much more to share - I can't wait to publish parts II and III.
11 Jul 2024 12:00am GMT
28 Jun 2024
planet.freedesktop.org
Tomeu Vizoso: Etnaviv NPU update 19: Ideas On Board sponsors support for the NXP i.MX 8M Plus SoC
Last week I started work on adding support to the Etnaviv driver for the NPU inside the NXP i.MX 8M Plus SoC (VeriSilicon's VIPNano-SI+).
This work is sponsored by the open source consultancy Ideas On Boards, and will include the same level of support as for the Amlogic A311D SoC, which means full acceleration for the SSDLite MobileDet object detection model.
Right now all kinds of basic convolutions are supported, and work is well on its way for strided convolutions.
For basic convolutions, most of the work was switching to a totally different way of encoding weights. At the low-level, the weights are encoded with Huffman, and zero run length encoding on top. This low level encoding has been already reverse engineered and implemented by Philipp Zabel of Pengutronix, as mentioned in my previous update on the variant of this NPU shipped inside the Amlogic S905D3.
How weights are laid on top of the encoding is also different, so I had to reverse engineer that and implement it in the Mesa driver. That plus some changes on how tiling is computed got basic convolutions working, then I moved to strided convolutions. Pointwise convolutions got supported at the same time as basic convolutions, as they are not any different on this particular hardware.
Strided convolutions are still not natively supported by the hardware, so I reused the code that lowers them to basic convolutions. But the existing jobs that use the tensor manipulation cores to transform the input tensor for strides contained many assumptions that don't hold valid in this hardware.
So I have been reverse engineering these differences and now I have all kinds of strided convolutions supported up to 32 output channels. I feel that these will be done after addressing a couple of details about how the tensor reshuffle jobs are distributed among the available TP cores.
Afterwards I will look at depthwise convolutions, which may be supported natively by the hardware, while on the A311D these were lowered to basic convolutions.
Then on to tensor addition operations, and that should be all that is needed to get SSDLite MobileDet running, hopefully close to the performance of the closed source driver.
I'm very grateful to Ideas On Board for sponsoring this work, for their trust on me to get it done, and for their vision of a fully featured mainline platform that all companies can base their products on without being held captive by any single vendor.
I'm testing all this on a Verdin iMX8M Plus board that was kindly offered by Daniel Lang at Toradex, thanks!
28 Jun 2024 7:08am GMT