14 Sep 2024
planet.freedesktop.org
Hans de Goede: Fedora plymouth boot splash not showing on systems with AMD GPUs
Recently there have been a number of reports (bug 2183743, bug 2276698, bug 2283839, bug 2312355) about the plymouth boot splash not showing properly on PCs using AMD GPUs.
The problem without plymouth and AMD GPUs is that the amdgpu driver is a really really big driver, which easily takes up to 10 seconds to load on older PCs. The delay caused by this may cause plymouth to timeout while waiting for the GPU to be initialized, causing it to fallback to the 3 dot text-mode boot splash.
There are 2 workaround for this depending on the PCs configuration:
1. With older AMD GPUs the radeon driver is actually used to drive the GPU but even though it is unused the amdgpu driver still loads slowing things down.
To check if this is the case for your PC start a terminal in a graphical login session and run: "lsmod | grep -E '^radeon|^amdgpu'" this will output something like this:
amdgpu 17829888 0
radeon 2371584 37
The second number after each is the usage count. As you can see in this example the amdgpu driver is not used. In this case you can disable the loading of the amdgpu driver by adding "modprobe.blacklist=amdgpu" to your kernel commandline:
sudo grubby --update-kernel=ALL --args="modprobe.blacklist=amdgpu"
2. If the amdgpu driver is actually used on your PC then plymouth not showing can be worked around by telling plymouth to use the simpledrm drm/kms device created from the EFI framebuffer early on boot, rather then waiting for the real GPU driver to load. Note this depends on your PC booting in EFI mode. To do this run:
sudo grubby --update-kernel=ALL --args="plymouth.use-simpledrm"
After using 1 of these workarounds plymouth should show normally again on boot (and booting should be a bit faster).
comments
14 Sep 2024 1:38pm GMT
06 Sep 2024
planet.freedesktop.org
Mike Blumenkrantz: Architechair
What Am I Even Doing
It was some time ago that I created my first MR touching WSI stuff.
That was also the first time I broke Mesa.
Did I learn anything?
The answer is no, but then again it would have to be given the topic of this sleep-deprived post.
Maybe I'm The Problem
WSI has a lot of issues, but most of them stem from its mere existence. If people stopped wanting to see the triangles, we would all have easier lives and performance would go through the fucking roof. That's ignoring the raw sweat and verbiage dedicated to insane ideas like determining the precise time at which the triangles should be made visible on a display or literally how are colors.
I'm nowhere near as smart as the people arguing about these things: I'm the guy who plays jenga with the tower constructed from popsicle sticks, marshmallow fluff, and wishful thinking. That's why, a while ago, I declared war on DRI interfaces and then also definitely won that war without any issues. In fact, it works perfectly.
But why did I embark upon this journey which required absolutely no fixups?
The answer lies in architecture. In the before-times, DRI (a massively overloaded acronym that no longer means anything) allowed Xorg to plug directly into Mesa to utilize hardware-accelerated rendering. It sidestepped the GL API in favor of a contract with Mesa that certain API would never change. And that was great for Xorg since it provided an optimal path to do xserver stuff. But it was (eventually) terrible for Mesa.
Renegotiation
When an API contract is made, it remains binding forever. A case when the contract is broken is called a Bug Report. Mesa has no bugs, however, except for the ones I didn't cause, and so this DRI contract that enables Xorg to shortcut more sensible APIs like EGL remains identical to this day, decades later. What is not identical, however, is Mesa.
In those intervening years, Mesa has developed into an entire ecosystem for driver development and other, less sane ideas. Gallium was created and then became the only method for implementing GL drivers. EGL and GBM are things now. But still, that DRI contract remains binding. Xorg must work. Like that one reviewer who will suggest changes for every minuscule flaw in your stupid, idiotic, uneducated, cretinous abuse of whitespace, it is not going away.
DRIL was the method by which Mesa could finally unshackle itself. The only parts of DRI still used by Xorg are for determining rendertarget capabilities, effectively eglGetConfigs
. So @ajax and I punted out the relevant API into a stub which is mostly just a wrapper around eglGetConfigs
. This enabled change and cleanup in every part of the codebase that was previously immutable.
Bidirectional Hell
As anyone who has tried to debug Mesa's DRI frontend knows, it sucks. It's one of the worst pieces of code to debug. A significant reason for this is (was) how the DRI callback system perpetuated circular architecture.
At the time of DRIL's merge, a user of GLX/EGL/GBM would engage with this sort of control flow:
- GLX/EGL/GBM API call
- direct API internals
- function pointer into
gallium/frontends/dri
- DRI frontend
- function pointer back to GLX/EGL/GBM
- <loop back to 2 until operation completes>
- return to user
In terms of functionality, it was functional. But debugging at a glance was impossible, and trying to eyeball any execution path required the type of PhD held by fewer than five people globally. The cyclical back-and-forth function pointering was a vertical cliff of a learning curve for anyone who didn't already know how things worked, and even things as "simple" as eglInitialize
went through several impenetrable cycles of idiot-looping to determine success or failure. The absolute state of it made adding new features a nightmarish and daunting prospect, and reviewing any changes had, at best, even odds of breaking things because of how difficult it is to test this stuff.
Better Now?
Maybe.
The juiciest refactoring is over, and now function pointering only occurs when the DRI frontend needs to access API-specific data for its drawables. It's actually possible to follow execution just by reading the code. Not that it's necessarily easy, but it's possible.
There's still a lot of work to be done here. There's still some corner case bugs with DRIL, there's probably EGL issues that have yet to be discovered because much of that code is still fairly opaque, and half the codebase is still prefixed with dri2_
.
At the least, I think it's now possible to work on WSI in Mesa and have some idea what's going on. Or maybe I've just been down in the abyss for so long that I'm the one staring back.
Onward
I've been cooking. I mean like really cooking. Expect big things related to the number 3 later this month.
* UPDATE: At the urging of my legal team, I've been advised to mention that no part of this post, blog, or site has any association with, bearing on, or endorsement from Half Life 3.
06 Sep 2024 12:00am GMT
04 Sep 2024
planet.freedesktop.org
Tvrtko Ursulin: DRM scheduling cgroup controller
Introduction #
The topic of a Direct Rendering Manager (DRM) cgroup controller is something which has been proposed a few times in the past, but so far is still missing from the Linux graphics stack. Some of those attempts were focusing on controlling the GPU memory usage aspect, while some were concerned with scheduling. As I am continuing to explore this area as part of my work at Igalia, in this post we will discuss one possible way of implementing the latter.
General problem statement which we are trying to address is the fact many GPUs (and their respective kernel drivers) can simultaneously schedule workloads from different clients and that there are use-cases where having external control over scheduling decisions would be beneficial.
But first to clarify what we mean by "external control". By that term we refer to the scheduling decisions being influenced from the outside of the actual process doing the rendering. If we were to draw a parallel to CPU scheduling, that would be the difference between a process (or a thread) issuing a system call such as setpriority(2) or nice(2) itself ("internal control"), versus its scheduling priority being modified by an external entity such as the user issuing the renice(1) shell command, launching the executable via the nice(1) shell command, or even using the CPU scheduling cgroup controller ("external control").
This has two benefits. Firstly, it is the user who typically knows which tasks are higher priority and which should run in the background and therefore be as much as it is possible isolated from starving the foreground tasks from resources. Secondly, external control can be applied on any process in an unified manner, without the need for applications to individually expose the means to control their scheduling priority.
If we now return back to the world of GPU scheduling we find ourselves in a landscape where internal scheduling control is possible with many GPU drivers, but the external control is not. To improve on that there are some technical and conceptual challenges, because GPUs are not as nice and uniform in their scheduling needs and capabilities as CPUs are, but if we would be able to come up with something reasonable even if not perfect, it could bring improvements to the user experience in a variety of scenarios.
Past attempts - Priority based controllers #
The earliest attempt I can remember was from 2018, by Matt Roper[1], who proposed to implement a driver-specific priority based controller. The RFC limited itself to i915 (kernel driver for Intel GPUs) and, although the priority-based setup is well established in the world of CPU scheduling, and it is easy to understand its effects, the proposal did not gain much traction.
Because of the aforementioned advantages, when I proposed my version of the controller in 2022[2], it also included a slightly different version of a priority-based controller. In contrast to the earlier one, this proposal was in principle driver-agnostic and the priority levels were also abstracted.
The proposal was also accompanied by benchmark results showing that the approach was effective in allowing users on Linux to launch GPU tasks in the background, while leaving more GPU bandwidth to the foreground task than when not using the controller. Similarly on ChromeOS, when wired into the focused versus un-focused window cgroup management, it was able to demonstrate relatively more GPU time given to the foreground window.
Current proposal - Weight based controller #
Anticipating the potential lack of sufficient support for this approach the same RFC also included a second controller which takes a different route. It abstracts things one step further and implements a weight based controller based on GPU utilisation[3].
The basic idea is that the GPU time budget is split based on relative group weights across the cgroup hierarchy, and that the controller notifies the individual DRM drivers when their clients are over budget. From there it is left for the individual drivers to know how to best manage this situation, depending on the specific scheduling capabilities of the driver and the GPU hardware.
The user interface completely mimics the exiting CPU and IO cgroup controllers with the single drm.weight control file. The weights carry no absolute meaning and are only relative within a single group of siblings. Their only purpose is to split out the time budget between them.
Visually one potential cgroup configuration could look like this:
The DRM cgroup controller then executes a periodic scanning task which queries each DRM client for its GPU usage and notifies drivers when clients are over their allocated budget.
If we expand the concept with runtime adjustment of group weights based on window focus status, with two graphically active clients such as a game and a web browser, we can end up with the following two scenarios:
Here we show the actual GPU utilisation of each group together with their drm.weight. On the left hand side the web browser is the focused window, with the weights 100-to-10 in its favour.
The compositor is not using its full 200 / (200 + 100) so a portion is passed on to the desktop group to the extent of the full 80% required. Inside the desktop group the game is currently using 70%, while its actual allocation is 80% * (10 / (100 + 10)) = 7.27%. Therefore it is currently consuming is more than the budget and the corresponding DRM driver will be notified by the controller and will be able to do something about it.
After the user has given focus to the game window, relative weights will be adjusted and so will the budgets. Now the web browser will be over budget and therefore it can be throttled down, limiting the effect of its background activity on the foreground game window.
First driver implementation - i915 #
Back when I started developing this idea Intel GPU's were my main focus, which is why i915 was the first driver I wired up with the controller.
There I implemented a rather simple approach of dynamically adjusting the scheduling priority of the throttled contexts, to the amount proportional to how much client is over budget in relative terms.
Implementation would also cross-check against the physical engine utilisation, since in i915 we have easy access to that metric, and only throttle if the latter is close to being fully utilised. (Why this makes sense could be an interesting digression relating to the fact that a single cgroup can in theory contain multiple GPUs and multiple clients using a mix of those GPUs. But lets leave that for later.)
One of the scenarios I used to test how well this works is to run two demanding GPU clients, each in its own cgroup, tweak their relative weights, and see what happens. The results were encouraging and are shown in the following table.
We can see that, when a clients group weight was decreased, the GPU bandwidth it was receiving also went down, as a consequence of the lowered context priority after receiving the over-budget notification.
This is a suitable moment to mention how the DRM cgroup controller does not promise perfect control, that is, achieving the actual GPU sharing ratios as expressed by group-relative weights. As we have mentioned before, GPU scheduling is not nearly at the same level of quality and granularity as in the CPU world, so the goal it sets is simply to improve things - do something which has a positive impact on user experience. At the same time, the mechanism and control interface proposed does not preclude individual drivers doing as good job as they can. Or even a future possibility of replacing the inner workings with a controller with something smarter, with no need to change the user space control interface.
Going back to the initial i915 implementation, the second test I have done was attempting to wire up with the background/foreground window focus handling in ChromeOS. There I experimented with a game (Android VM) running in parallel with a WebGL demo in a browser. At a certain point after both clients were running I lowered the weight of the background game and on the below screenshot we can see how the FPS metric in a browser jumped up.
This illustrates how having the controller can indeed improve the user experience. The user's focus will be at the foreground window and therefore it does make sense to prioritise GPU access to that client for better interactiveness and smoother rendering there. In fact, in this example the actual FPS jumped from around 48-49 to 60fps. Meaning that throttling the background client has allowed the foreground one to match its rendering to display's refresh rate.
Second implementation - amdgpu #
AMD's kernel module was the next interesting driver which I wired up with the controller.
The fact that its scheduling is built on top of the DRM scheduler with only three distinct priority levels mandated a different approach to throttling. We keep a sorted list of "most offending" clients (most out of budget, or most borrowed unused budget from the sibling group), with the idea that the top client on that list gets throttled by lowering its scheduling priority. That was relatively straightforward to implement and sounded like it could potentially satisfy the most basic use case of background task isolation.
To test the runtime behaviour we set up two sibling cgroups and vary their relative scheduling weights. In one cgroup we run glxgears with vsync turned off and log its frame rate over time, while in the second group we run glmark2.
Let us first have a look on how glxgears frame rate varies during this test, depending on three different scheduling weight ratios between the cgroups. Scheduling weight ratio is expressed as glxgears:glmark2 ie. 10:1 means glxgears scheduling weight was ten times as much as configured for glmark2.
We can observe that, as the glmark2 is progressing through its various sub-benchmarks, glxgears frame rate is changing too. But it was overall higher in the runs where the scheduling weight ratio was in its favour. That is a positive result showing that even a simple implementation seems to be having the desired effect, at least to some extent.
For the second test we can look from the perspective of glmark2, checking how the benchmark score change depending on the ratio of scheduling weights.
Again we see that the scores are generally improving when the scheduling weight ratio is increased in favour of the benchmark.
However, in neither case the change of the result is proportional to actual ratios. This is because the primitive implementation is not able to precisely limit the "background" client, but is only able to achieve some throttling. Also, there is an inherent delay in how fast the controller can react given the control loop is based on periodic scanning. This period is configurable and was set to two seconds for the above tests.
Conclusion #
Hopefully this write-up has managed to demonstrate two main points:
-
First, that a generic and driver agnostic approach to DRM scheduling cgroup controller can improve user experience and enable new use cases. While at the same time following the established control interface as it exists for CPU and IO control, which makes it future-proof and extendable;
-
Secondly, that even relatively basic driver implementations can be somewhat effective in providing positive control effects.
It also probably needs to be re-iterated that neither the driver implementations or the cgroup controller implementation itself are limited by the user interface proposed. Both could be independently improved under the hood in the future.
What is next? There is more work to be done such as conducting more detailed testing, polishing the implementation and potentially attempting to wire up more drivers to the controller. Further advocacy work in the DRM community too.
References #
04 Sep 2024 12:00am GMT