23 Nov 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Don’t Call It A Comeback

I've Been Here For…

I guess I never left, really, since I've been vicariously living the life of someone who still writes zink patches through reviewing and discussing some great community efforts that are ongoing.

But now I'm back living that life of someone who writes zink patches.

Valve has generously agreed to sponsor my work on graphics-related projects.

For the time being, that work happens to be zink.

Ambition

I don't want to just make a big post about leaving and then come back after a couple weeks like nothing happened.

It's 2020.

We need some sort of positive energy and excitement here.

As such, I'm hereby announcing Operation Oxidize, an ambitious endeavor between me and the formidably skillful Erik Faye-Lund of Collabora.

We're going to land 99% of zink-wip into mainline Mesa by the end of the year, bringing the driver up to basic GL 4.6 and ES 3.2 support with vastly improved performance.

Or at least, that's the goal.

Will we succeed?

Stay tuned to find out!

23 Nov 2020 12:00am GMT

22 Nov 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Roundup 20201122

Another Brief Review

This was a (relatively) quiet week in zink-world. Here's some updates, once more in no particular order:

Stay tuned for further updates.

22 Nov 2020 12:00am GMT

21 Nov 2020

feedplanet.freedesktop.org

Hans de Goede: Acer Aspire Switch 10 E SW3-016's and SW5-012's and S1002's horrible EFI firmware

Recently I acquired an Acer Aspire Switch 10 E SW3-016, this device was the main reason for writing my blog post about the shim boot loop. The EFI firmware of this is bad in a number of ways:


  1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file.
  2. But it will actually boot EFI/Boot/bootx64.efi ! (wait what? yes really)
  3. It will only boot from an USB disk connected to its micro-USB connector, not from the USB-A connector on the keyboard-dock.
  4. You must first set a BIOS admin password before you can disable secure-boot (which is necessary to boot home-build kernels without doing your own signing)
  5. Last but not least it has one more nasty "feature", it detect if the OS being booted is Windows, Android or unknown and it updates the ACPI DSDT based in this!

Some more details on the OS detection mis feature. The ACPI "Device (SDHB) node for the MMC controller connected to the SDIO wifi module contains:

Name (WHID, "80860F14")
Name (AHID, "INT33BB")


Depending on what OS the BIOS thinks it is booting it renames one of these 2 to _HID. This is weird given that it will only boot if EFI/Microsoft/Boot/bootmgfw.efi exists, but it still does this. Worse it looks at the actual contents of EFI/Boot/bootx64.efi for this. It seems that that file must be signed, otherwise it goes in OS unknown mode and keeps the 2 above DSDT bits as is, so there is no _HID defined for the wifi's mmc controller and thus no wifi. I hit this issue when I replaced EFI/Boot/bootx64.efi with grubx64.efi to break the bootloop. grubx64.efi is not signed so the DSDT as Linux saw it contained the above AML code. Using the proper workaround for the bootloop from my previous blog post this bit of the DSDT morphes into:

Name (_HID, "80860F14")
Name (AHID, "INT33BB")


And the wifi works.

The Acer Aspire Switch 10 E SW3-016's firmware also triggers an actual bug / issue in Linux' ACPI implementation, causing the bluetooth to not work. This is discussed in much detail here. I have a patch series fixing this here.

And the older Acer Aspire Switch 10 SW5-012's and S1002's firmware has some similar issues:


  1. It considers its eMMC unbootable unless its ESP contains an EFI/Microsoft/Boot/bootmgfw.efi file
  2. These models will actually always boot the EFI/Microsoft/Boot/bootmgfw.efi file, so that is somewhat more sensible.
  3. On the SW5-012 you must first set a BIOS admin password before you can disable secure-boot.
  4. The SW5-012 is missing an ACPI device node for the PWM controller used for controlling the backlight brightness. I guess that the Windows i915 gfx driver just directly pokes the registers (which are in a whole other IP block), rather then relying on a separate PWM driver as Linux does. Unfortunately there is no way to fix this, other then using a DSDT overlay. I have a DSDT overlay for the V1.20 BIOS and only for the v1.20 BIOS available for this here.

Because of 1. and 2. you need to take the following steps to get Linux to boot on the Acer Aspire Switch 10 SW5-012 or the S1002:


  1. Rename the original bootmgfw.efi (so that you can chainload it in the multi-boot case)
  2. Replace bootmgfw.efi with shimia32.efi
  3. Copy EFI/fedora/grubia32.efi to EFI/Microsoft/Boot

This assumes that you have the files from a 32 bit Windows install in your ESP already.

21 Nov 2020 8:58pm GMT

20 Nov 2020

feedplanet.freedesktop.org

Pekka Paalanen: Developing Wayland Color Management and High Dynamic Range

(This post was first published with Collabora on Nov 19, 2020.)

Wayland (the protocol and architecture) is still lacking proper consideration for color management. Wayland also lacks support for high dynamic range (HDR) imagery which has been around in movie and broadcasting industry for a while now (e.g. Netflix HDR UI).

While there are well established tools and workflows for how to do color management on X11, even X11 has not gained support for HDR. There were plans for it (Alex Goins, DeepColor Visuals), but as far as I know nothing really materialized from them. Right now, the only way to watch HDR content on a HDR monitor in Linux is to use the DRM KMS API directly, in other words, not use any window system, which means not using any desktop environment. Kodi is one of the very few applications that can do this at all.

This is a story about starting the efforts to fix the situation on Wayland.

History and People

Color management for Wayland has been talked about on and off for many years by dozens of people. To me it was obvious from the start that color management architecture on Wayland must be fundamentally different from X11. I thought the display server must be part of the color management stack instead of an untrusted, unknown entity that must be bypassed and overridden by applications that fight each other for who gets to configure the display. This opinion was wildly controversial and it took a long time to get my point across, but over the years some color management experts started to open up to new ideas and other people joined in the opinion as well. Whether these new ideas are actually better than the ways of old remains to be seen, though. I think the promise of getting everything and more to work better is far too great to not try it out.

The discussions started several times over the years, but they always dried out mostly without any tangible progress. Color management is a wide, deep and difficult topic, and the required skills, knowledge, interest, and available time did not come together until fairly recently. People did write draft protocol extensions, but I would claim that it was not really until Sebastian Wick started building on top of them that things started moving forward. But one person cannot push such a huge effort alone even for the simple reason that there must be at least one reviewer before anything can be merged upstream. I was very lucky that since summer 2020 I have been able to work on Wayland color management and HDR for improving ChromeOS, letting me support Sebastian's efforts on a daily basis. Vitaly Prosyak joined the effort this year as well, researching how to combine the two seemingly different worlds of ICC and HDR, and how tone-mapping could be implemented.

I must also note the past efforts of Harish Krupo, who submitted a major Weston merge request, but unfortunately at the time reviewers in Weston upstream were not much available. Even before that, there were experiments by Ville Syrjälä. All these are now mostly superseded by the on-going work.

Currently the active people around the topic are me (Collabora), Vitaly Prosyak (AMD), and Naveen Kumar (Intel). Sebastian Wick (unaffilated) is still around as well. None of us is a color management or HDR expert by trade, so we are all learning things as we go.

Design

The foundation for the color management protocol are ICC profile files for describing both output and content color spaces. The aim is for ICCv4, also allowing ICCv2, as these are known and supported well in general. Adding iccMAX support or anything else will be possible any time in the future.

As color management is all about color spaces and gamuts, and high dynamic range (HDR) is also very much about color spaces and gamuts plus extended luminance range, Sebastian and I decided that Wayland color management extension should cater for both from the beginning. Combining traditional color management and HDR is a fairly new thing as far as I know, and I'm not sure we have much prior art to base upon, so this is an interesting research journey as well. There is a lot of prior art on HDR and color management separately, but they tend to have fundamental differences that makes the combination not obvious.

To help us keep focused and explain to the community about what we actually intend with Wayland color management and HDR support, I wrote the section "Wayland Color Management and HDR Design Goals" in color.rst (draft). I very much recommend you to read it so that you get a picture what we (or I, at least) want to aim for.

Elle Stone explains in their article how color management should work on X11. As I wanted to avoid repeating the massive email threads that were had on the wayland-devel mailing list, I wrote the section "Color Pipeline Overview" in color.rst (draft) more or less as a response to her article, trying to explain in what ways Wayland will be different from X11. I think that understanding that section is paramount before anyone makes any comment on our efforts with the Wayland protocol extension.

HDR brings even more reasons to put color space conversions in the display server than just the idea that all applications should be color managed if not explicitly then implicitly. Most of the desktop applications (well, literally all right now) are using Standard Dynamic Range (SDR). SDR is a fuzzy concept referring to all traditional, non-HDR image content. Therefore, your desktop is usually 100% SDR. You run your fancy new HDR monitor in SDR mode, which means it looks just like any old monitor with nothing fancy. What if you want to watch a HDR video? The monitor won't display HDR in SDR mode. If you simply switch the monitor to HDR mode, you will be blinded by all the over-bright SDR applications. Switching monitor modes may also cause flicker and take a bit of time. That would be a pretty bad user experience, right?

A solution is to run your monitor in HDR mode all the time, and have the window system compositor convert all SDR application windows appropriately to the HDR luminance, so that they look normal in spite of the HDR mode. There will always be applications that will never support HDR at all, so the compositor doing the conversion is practically the only way.

For the protocol, we are currently exploring the use of relative luminance. The reason is that people look at monitors in wildly varying viewing environments, under standard office lighting for example. The environment and personal preferences affect what monitor brightness you want. Also monitors themselves can be wildly different in their capabilities. Most prior art on HDR uses absolute luminance, but absolute luminance has the problem that it assumes a specific viewing environment, usually a dark room, similar to a movie theatre. If a display server would show a movie with the absolute luminance it was mastered for, in most cases it would be far too dark to see. Whether using relative luminance at the protocol level turns out to be a good idea or not, we shall see.

Development

The Wayland color management and HDR protocol extension proposal is known as wayland/wayland-protocols!14 (MR14). Because it is a very long running merge request (the bar for landing a new protocol into wayland-protocols is high) and there are several people working on it, we started using sub-merge-requests to modify the proposal. You can find the sub-MRs in Sebastian's fork. If you have a change to propose, that is how to do it.

Obviously using sub-MRs also splits the review discussions into multiple places, but in this case I think it is a good thing, because the discussion threads in Gitlab are already massive.

There are several big and small open questions we haven't had the time to tackle yet even among the active group; questions that I feel we should have some tentative answers before asking for wider community comments. There is also no set schedule, so don't hold your breath. This work is likely to take months still before there is a complete tentative protocol, and probably years until these features are available in your favourite Wayland desktop environments.

If you are an expert on the topics of color management or HDR displays and content, you are warmly welcome to join the development.

If you are an interested developer or an end user looking to try out things, sorry, there is nothing really for you yet.

20 Nov 2020 5:44pm GMT

18 Nov 2020

feedplanet.freedesktop.org

Hans de Goede: How to fix Linux EFI secure-boot shim bootloop issue

How to fix the Linux EFI secure-boot shim bootloop issue seen on some systems.

Quite a few Bay- and Cherry-Trail based systems have bad firmware which completely ignores any efibootmgr set boot options. They basically completely reset the boot order doing some sort of auto-detection at boot. Some of these even will given an error about their eMMC not being bootable unless the ESP has a EFI/Microsoft/Boot/bootmgfw.efi file!

Many of these end up booting EFI/Boot/bootx64.efi unconditionally every boot. This will cause a boot loop since when Linux is installed EFI/Boot/bootx64.efi is now shim. When shim is started with a path of EFI/Boot/bootx64.efi, shim will add a new efibootmgr entry pointing to EFI/fedora/shimx64.efi and then reset. The goal of this is so that the firmware's F12 bootmenu can be used to easily switch between Windows and Linux (without chainloading which breaks bitlocker). But since these bad EFI implementations ignore efibootmgr stuff, EFI/Boot/bootx64.efi shim will run again after the reset and we have a loop.

There are 2 ways to fix this loop:

1. The right way: Stop shim from trying to add a bootentry pointing to EFI/fedora/shimx64.efi:

rm EFI/Boot/fbx64.efi
cp EFI/fedora/grubx64.efi EFI/Boot


The first command will stop shim from trying to add a new efibootmgr entry (it calls fbx64.efi to do that for it) instead it will try to execute grubx64.efi from the from which it was executed, so we must put a grubx64.efi in the EFI/Boot dir, which the second command does. Do not use the livecd EFI/Boot/grubx64.efi file for this as I did at first, that searches for its config and env under EFI/Boot which is not what we want.

Note that upgrading shim will restore EFI/Boot/fbx64.efi. To avoid this you may want to backup EFI/Boot/bootx64.efi, then do "sudo rpm -e shim-x64" and then restore the backup.

2. The wrong way: Replace EFI/Boot/bootx64.efi with a copy of EFI/fedora/grubx64.efi

This is how I used to do this until hitting the scenario which caused me to write this blog post. There are 2 problems with this:

2a) This requires disabling secure-boot (which I could live with sofar)
2b) Some firmwares change how they behave, exporting a different DSDT to the OS dependending on if EFI/Boot/bootx64.efi is signed or not (even with secure boot disabled) and their behavior is totally broken when it is not signed. I will post another rant ^W blogpost about this soon. For now lets just say that you should use workaround 1. from above since it simply is a better workaround.

Note for better readability the above text uses bootx64, shimx64, fbx64 and grubx64 throughout. When using a 32 bit EFI (which is typical on Bay Trail systems) you should replace these with bootia32, shimia32, fbia32 and grubia32. Note 32 bit EFI Bay Trail systems should still use a 64 bit Linux distro, the firmware being 32 bit is a weird Windows related thing.

Also note that your system may use another key then F12 to show the firmware's bootmenu.

18 Nov 2020 9:05am GMT

15 Nov 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Roundup 20201115

A Brief Review

As time/sanity permit, I'll be trying to do roundup posts for zink happenings each week. Here's a look back at things that happened, in no particular order:

15 Nov 2020 12:00am GMT

13 Nov 2020

feedplanet.freedesktop.org

Dave Airlie (blogspot): lavapipe: a *software* swrast vulkan layer FAQ

(project was renamed from vallium to lavapipe)

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.

tl;dr DO NOT USE LAVAPIPE OVER A GALLIUM HW DRIVER,

What is lavapipe?

The lavapipe layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
In order to bridge the gap, the lavapipe layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?

Yes.

Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the lavapipe layer.

The lavapipe execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.


13 Nov 2020 2:16am GMT

Adam Jackson: on abandoning the X server

There's been some recent discussion about whether the X server is abandonware. As the person arguably most responsible for its care and feeding over the last 15 years or so, I feel like I have something to say about that.

The thing about being the maintainer of a public-facing project for nearly the whole of your professional career is it's difficult to separate your own story from the project. So I'm not going to try to be dispassionate, here. I started working on X precisely because free software had given me options and capabilities that really matter, and I feel privileged to be able to give that back. I can't talk about that without caring about it.

So here's the thing: X works extremely well for what it is, but what it is is deeply flawed. There's no shame in that, it's 33 years old and still relevant, I wish more software worked so well on that kind of timeframe. But using it to drive your display hardware and multiplex your input devices is choosing to make your life worse.

It is, however, uniquely well suited to a very long life as an application compatibility layer. Though the code happens to implement an unfortunate specification, the code itself is quite well structured, easy to hack on, and not far off from being easily embeddable.

The issue, then, is how to get there. And I don't have any real desire to get there while still pretending that the xfree86 hardware-backed server code is a real thing. Sorry, I guess, but I've worked on xfree86-derived servers for very nearly as long as XFree86-the-project existed, and I am completely burnt out on that on its own merits, let alone doing that and also being release manager and reviewer of last resort. You can only apply so much thrust to the pig before you question why you're trying to make it fly at all.

So, is Xorg abandoned? To the extent that that means using it to actually control the display, and not just keep X apps running, I'd say yes. But xserver is more than xfree86. Xwayland, Xwin, Xephyr, Xvnc, Xvfb: these are projects with real value that we should not give up. A better way to say it is that we can finally abandon xfree86.

And if that sounds like a world you'd like to see, please, come talk to us, let's make it happen. I'd be absolutely thrilled to see someone take this on, and I'm happy to be your guide through the server internals.

13 Nov 2020 12:03am GMT

12 Nov 2020

feedplanet.freedesktop.org

Dave Airlie (blogspot): Linux graphics, why sharing code with Windows isn't always a win.

A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.

I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.

tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.

The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.

This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.

The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.

A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.

Why is it a bad idea?

I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.

As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.

Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.

One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.

How would I do it?

If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:

a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).

b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.

A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.


12 Nov 2020 12:05am GMT

06 Nov 2020

feedplanet.freedesktop.org

Jason Ekstrand: Getting the most out of your Intel integrated GPU on Linux

About a year ago ago, I got a new laptop: a late 2019 Razer Blade Stealth 13. It sports an Intel i7-1065G7 with the best Intel's Ice Lake graphics along with an NVIDIA GeForce GTX 1650. Apart from needing an ACPI lid quirk and the power management issues described here, it's been a great laptop so far and the Linux experience has been very smooth.

Unfortunately, the out-of-the-box integrated graphics performance of my new laptop was less than stellar. My first task with the new laptop was to debug a rendering issue in the Linux port of Shadow of the Tomb Raider which turned out to be a bug in the game. In the process, I discovered that the performance of the game's built-in benchmark was almost half of Windows. We've had some performance issues with Mesa from time to time on some games but half seemed a bit extreme. Looking at system-level performance data with gputop revealed that GPU clock rate was unable to get above about 60-70% of the maximum in spite of the GPU being busy the whole time. Why? The GPU wasn't able to get enough power. Once I sorted out my power management problems, the benchmark went from about 50-60% the speed of Windows to more like 104% the speed of windows (yes, that's more than 100%).

This blog post is intended to serve as a bit of a guide to understanding memory throughput and power management issues and configuring your system properly to get the most out of your Intel integrated GPU. Not everything in this post will affect all laptops so you may have to do some experimentation with your system to see what does and does not matter. I also make no claim that this post is in any way complete; there are almost certainly other configuration issues of which I'm not aware or which I've forgotten.

Update your drivers

This should go without saying but if you want the best performance out of your hardware, running the latest drivers is always recommended. This is especially true for hardware that has just been released. Generally, for graphics, most of the big performance improvements are going to be in Mesa but your Linux kernel version can matter as well. In the case of Intel Ice Lake processors, some of the power management features aren't enabled until Linux 5.4.

I'm not going to give a complete guide to updating your drivers here. If you're running a distro like Arch, chances are that you're already running something fairly close to the latest available. If you're on Ubuntu, the padoka PPA provides versions of the userspace components (Mesa, X11, etc.) that are usually no more than about a week out-of-date but upgrading your kernel is more complicated. Other distros may have something similar but I'll leave as an exercise to the reader.

This doesn't mean that you need to be obsessive about updating kernels and drivers. If you're happy with the performance and stability of your system, go ahead and leave it alone. However, if you have brand new hardware and want to make sure you have new enough drivers, it may be worth attempting an update. Or, if you have the patience, you can just wait 6 months for the next distro release cycle and hope to pick up with a distro update.

Make sure you have dual-channel RAM

One of the big bottleneck points in 3D rendering applications is memory bandwidth. Most standard monitors run at a resolution of 1920x1080 and a refresh rate of 60 Hz. A 1920x1080 RGBA (32bpp) image is just shy of 8 MiB in size and, if the GPU is rendering at 60 FPS, that adds up to about 474 MiB/s of memory bandwidth to write out the image every frame. If you're running a 4K monitor, multiply by 4 and you get about 1.8 GiB/s. Those numbers are only for the final color image, assume we write every pixel of the image exactly once, and don't take into account any other memory access. Even in a simple 3D scene, there are other images than just the color image being written such as depth buffers or auxiliary gbuffers, each pixel typically gets written more than once depending on app over-draw, and shading typically involves reading from uniform buffers and textures. Modern 3D applications typically also have things such as depth pre-passes, lighting passes, and post-processing filters for depth-of-field and/or motion blur. The result of this is that actual memory bandwidth for rendering a 3D scene can be 10-100x the bandwidth required to simply write the color image.

Because of the incredible amount of bandwidth required for 3D rendering, discrete GPUs use memories which are optimized for bandwidth above all else. These go by different names such as GDDR6 or HBM2 (current as of the writing of this post) but they all use extremely wide buses and access many bits of memory in parallel to get the highest throughput they can. CPU memory, on the other hand, is typically DDR4 (current as of the writing of this post) which runs on a narrower 64-bit bus and so the over-all maximum memory bandwidth is lower. However, as with anything in engineering, there is a trade-off being made here. While narrower buses have lower over-all throughput, they are much better at random access which is necessary for good CPU memory performance when crawling complex data structures and doing other normal CPU tasks. When 3D rendering, on the other hand, the vast majority of your memory bandwidth is consumed in reading/writing large contiguous blocks of memory and so the trade-off falls in favor of wider buses.

With integrated graphics, the GPU uses the same DDR RAM as the CPU so it can't get as much raw memory throughput as a discrete GPU. Some of the memory bottlenecks can be mitigated via large caches inside the GPU but caching can only do so much. At the end of the day, if you're fetching 2 GiB of memory to draw a scene, you're going to blow out your caches and load most of that from main memory.

The good news is that most motherboards support a dual-channel ram configurations where, if your DDR units are installed in identical pairs, the memory controller will split memory access between the two DDR units in the pair. This has similar benefits to running on a 128-bit bus but without some of the drawbacks. The result is about a 2x improvement in over-all memory throughput. While this may not affect your CPU performance significantly outside of some very special cases, it makes a huge difference to your integrated GPU which cares far more about total throughput than random access. If you are unsure how your computer's RAM is configured, you can run "dmidecode -t memory" and see if you have two identical devices reported in different channels.

Power management 101

Before getting into the details of how to fix power management issues, I should explain a bit about how power management works and, more importantly, how it doesn't. If you don't care to learn about power management and are just here for the system configuration tips, feel free to skip this section.

Why is power management important? Because the clock rate (and therefore the speed) of your CPU or GPU is heavily dependent on how much power is available to the system. If it's unable to get enough power for some reason, it will run at a lower clock rate and you'll see that as processes taking more time or lower frame rates in the case of graphics. There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop. However, there are some things which you can do from a system configuration perspective which can greatly affect power management and your performance.

First, we need to talk about thermal design power or TDP. There is a lot of misunderstanding on the internet about TDP and we need to clear some of them up. Wikipedia defines TDP as "the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload." The Intel Product Specifications site defines TDP as follows:

Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload. Refer to Datasheet for thermal solution requirements.

In other words, the TDP value provided on the Intel spec sheet is a pretty good design target for OEMs but doesn't provide nearly as many guarantees as one might hope. In particular, there are several things that the TDP value on the spec sheet is not:

If you look at the Intel Product Specifications page for the i7-1065G7, you'll see three TDP values: the nominal TDP of 15W, a configurable TDP-up value of 25W and a configurable TDP-down value of 12W. The nominal TDP (simply called "TDP") is the base TDP which is enough for the CPU to run all of its cores at the base frequency which, given sufficient cooling, it can do in the steady state. The TDP-up and TDP-down values provide configurability that gives the OEM options when they go to make a laptop based on the i7-1065G7. If they're making a performance laptop like Razer and are willing to put in enough cooling, they can configure it to 25W and get more performance. On the other hand, if they're going for battery life, they can put the exact same chip in the laptop but configure it to run as low as 12W. They can also configure the chip to run at 12W or 15W and then ship software with the computer which will bump it to 25W once Windows boots up. We'll talk more about this reconfiguration later on.

Beyond just the numbers on the spec sheet, there are other things which may affect how much power the chip can get. One of the big ones is cooling. The law of conservation of energy dictates that energy is never created or destroyed. In particular, your CPU doesn't really consume energy; it turns that electrical energy into heat. For every Watt of electrical power that goes into the CPU, a Watt of heat has to be pumped out by the cooling system. (Yes, a Watt is also a measure of heat flow.) If the CPU is using more electrical energy than the cooling system can pump back out, energy gets temporarily stored in the CPU as heat and you see this as the CPU temperature rising. Eventually, however, the CPU has to back off and let the cooling system catch up or else that built up heat may cause permanent damage to the chip.

Another thing which can affect CPU power is the actual power delivery capabilities of the motherboard itself. In a desktop, the discrete GPU is typically powered directly by the power supply and it can draw 300W or more without affecting the amount of power available to the CPU. In a laptop, however, you may have more power limitations. If you have multiple components requiring significant amounts of power such as a CPU and a discrete GPU, the motherboard may not be able to provide enough power for both of them to run flat-out so it may have to limit CPU power while the discrete GPU is running. These types of power balancing decisions can happen at a very deep firmware level and may not be visible to software.

The moral of this story is that the TDP listed on the spec sheet for the chip isn't what matters; what matters is how the chip is configured by the OEM, how much power the motherboard is able to deliver, and how much power the cooling system is able to remove. Just because two laptops have the same processor with the same part number doesn't mean you should expect them to get the same performance. This is unfortunate for laptop buyers but it's the reality of the world we live in. There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop. However, there are some things which you can do from a system configuration perspective and that's what we'll talk about next.

If you want to experiment with your system and understand what's going on with power, there are two tools which are very useful for this: powertop and turbostat. Both are open-source and should be available through your distro package manager. I personally prefer the turbostat interface for CPU power investigations but powertop is able to split your power usage up per-process which can be really useful as well.

Update GameMode to at least version 1.5

About a two and a half years ago (1.0 was released in may of 2018), Feral Interactive released their GameMode daemon which is able to tweak some of your system settings when a game starts up to get maximal performance. One of the settings that GameMode tweaks is your CPU performance governor. By default, GameMode will set it to "performance" when a game is running. While this seems like a good idea ("performance" is better, right?), it can actually be counterproductive on integrated GPUs and cause you to get worse over-all performance.

Why would the "performance" governor cause worse performance? First, understand that the names "performance" and "powersave" for CPU governors are a bit misleading. The powersave governor isn't just for when you're running on battery and want to use as little power as possible. When on the powersave governor, your system will clock all the way up if it needs to and can even turbo if you have a heavy workload. The difference between the two governors is that the powersave governor tries to give you as much performance as possible while also caring about power; it's quite well balanced. Intel typically recommends the powersave governor even in data centers because, even though they have piles of power and cooling available, data centers typically care about their power bill. The performance governor, on the other hand, doesn't care about power consumption and only cares about getting the maximum possible performance out of the CPU so it will typically burn significantly more power than needed.

So what does this have to do with GPU performance? On an integrated GPU, the GPU and CPU typically share a power budget and every Watt of power the CPU is using is a Watt that's unavailable to the GPU. In some configurations, the TDP is enough to run both the GPU and CPU flat-out but that's uncommon. Most of the time, however, the CPU is capable of using the entire TDP if you clock it high enough. When running with the performance governor, that extra unnecessary CPU power consumption can eat into the power available to the GPU and cause it to clock down.

This problem should be mostly fixed as of GameMode version 1.5 which adds an integrated GPU heuristic. The heuristic detects when the integrated GPU is using significant power and puts the CPU back to using the powersave governor. In the testing I've done, this pretty reliably chooses the powersave governor in the cases where the GPU is likely to be TDP limited. The heuristic is dynamic so it will still use the performance governor if the CPU power usage way overpowers the GPU power usage such as when compiling shaders at a loading screen.

What do you need to do on your system? First, check what version of GameMode you have installed on your system (if any). If it's version 1.4 or earlier)and you intend to play games on an integrated GPU, I recommend either upgrading GameMode or disabling or uninstalling the GameMode daemon.

Use thermald

In "power management 101" I talked about how sometimes OEMs will configure a laptop to 12W or 15W in BIOS and then re-configure it to 25W in software. This is done via the "Intel Dynamic Platform and Thermal Framework" driver on Windows. The DPTF driver manages your over-all system thermals and keep the system within its thermal budget. This is especially important for fanless or ultra-thin laptops where the cooling may not be sufficient for the system to run flat-out for long periods. One thing the DPTF driver does is dynamically adjust the TDP of your CPU. It can adjust it both up if the laptop is running cool and you need the power or down if the laptop is running hot and needs to cool down. Some OEMs choose to be very conservative with their TDP defaults in BIOS to prevent the laptop from overheating or constantly running hot if the Windows DPTF driver is not available.

On Linux, the equivalent to this is thermald. When installed and enabled on your system, it reads the same OEM configuration data from ACPI as the windows DPTF driver and is also able to scale up your package TDP threshold past the BIOS default as per the OEM configuration. You can also write your own configuration files if you really wish but you do so at your own risk.

Most distros package thermald but it may not be enabled nor work quite properly out-of-the-box. This is because, historically, it has relied on the closed-source dptfxtract utility that's provided by Intel as a binary. It requires dptfxtract to fetch the OEM provided configuration data from the ACPI tables. Since most distros don't usually ship closed-source software in their main repositories and since thermald doesn't do much without that data, a lot of distros don't bother to ship or enable it by default. You'll have to turn it on manually.

To fix this, install both thermald and dptfxtract and ensure that thermald is enabled. On most distros, thermald is packaged normally even if it isn't enabled by default because it is open-source. The dptfxtract utility is usually available in your distro's non-free repositories. On Ubuntu, dptfxtract is available as a package in multiverse. For Fedora, dptfxtract is available via RPM Fusion's non-free repo. There are also packages for Arch and likely others as well. If no one packages it for your distro, it's just one binary so it's pretty easy to install manually.

Some of this may change going forward, however. Recently, however, Matthew Garrett did some work to reverse-engineer the DPTF framework and provide support for fetching the DPTF data from ACPI without the need for the binary blob. When running with a recent kernel and Matthew's fork of thermald, you should be able to get OEM-configured thermals without the need for the dptfxtract blob at least on some hardware. Whether or not you get the right configuration will depend on your hardware, your kernel version, your distro, and whether they ship the Intel version of thermald or Matthew's fork. Even there, your distro may leave it uninstalled or disabled by default. It's still disabled by default in Fedora 33, for instance.

It should be noted at this point that, if thermald and dptfxtract are doing their job, your laptop is likely to start running much hotter when under heavy load than it did before. This is because thermald is re-configuring your processor with a higher thermal budget which means it can now run faster but it will also generate more heat and may drain your battery faster. In theory, thermald should keep your laptop's thermals within safe limits; just not within the more conservative limits the OEM programmed into BIOS. If all the additional heat makes you uncomfortable, you can just disable thermald and it should go back to the BIOS defaults.

Enable NVIDIA's dynamic power-management

On my laptop (the late 2019 Razer Blade Stealth 13), the BIOS has the CPU configured to 35W out-of-the-box. (Yes, 35W is higher than TDP-up and I've never seen it burn anything close to that much power; I have no idea why it's configured that way.) This means that we have no need for DPTF and the cooling is good enough that I don't really need thermald on it either. Instead, its power management problems come from the power balancing that the motherboard does between the CPU and the discrete NVIDIA GPU.

If the NVIDIA GPU is powered on at all, the motherboard configures the CPU to the TDP-down value of 12W. I don't know exactly how it's doing this but it's at a very deep firmware level that seems completely opaque to software. To make matters worse, it doesn't just restrict CPU power when the discrete GPU is doing real rendering; it restricts CPU power whenever the GPU is powered on at all. In the default configuration with the NVIDIA proprietary drivers, that's all the time.

Fortunately, if you know where to find it, there is a configuration option available in recent drivers for Turing and later GPUs which lets the NVIDIA driver completely power down the discrete GPU when it isn't in use. You can find this documented in Chapter 22 of the NVIDIA driver README. The runtime power management feature is still beta as of the writing of this post and does come with some caveats such as that it doesn't work if you have audio or USB controllers (for USB-C video) on your GPU. Fortunately, with many laptops with a hybrid Intel+NVIDIA graphics solution, the discrete GPU exists only for render off-loading and doesn't have any displays connected to it. In that case, the audio and USB-C can be disabled and don't cause any problems. On my laptop, as soon as I properly enabled runtime power management in the NVIDIA driver, the motherboard stopped throttling my CPU and it started running at the full TDP-up of 25W.

I believe that nouveau has some capabilities for runtime power management. However, I don't know for sure how good they are and whether or not they're able to completely power down the GPU.

Look for other things which might be limiting power

In this blog post, I've covered some of the things which I've personally seen limit GPU power when playing games and running benchmarks. However, it is by no means an exhaustive list. If there's one thing that's true about power management, it's that every machine is a bit different. The biggest challenge with my laptop was the NVIDIA discrete GPU draining power. On some other laptop, it may be something else.

You can also look for background processes which may be using significant CPU cycles. With a discrete GPU, a modest amount of background CPU work will often not hurt you unless the game is particularly CPU-hungry. With an integrated GPU, however, it's far more likely that a background task such as a backup or software update will eat into the GPU's power budget. Just this last week, a friend of mine was playing a game on Proton and discovered that the game launcher itself was burning enough power with the CPU to prevent the GPU from running at full power. Once he suspended the game launcher, his GPU was able to run at full power.

Especially with laptops, you're also likely to be affected by the computer's cooling system as was mentioned earlier. Some laptops such as my Razer are designed with high-end cooling systems that let the laptop run at full power. Others, particularly the ultra-thin laptops, are far more thermally limited and may never be able to hit the advertised TDP for extended periods of time.

Conclusion

When trying to get the most performance possible out of a laptop, RAM configuration and power management are key. Unfortunately, due to the issues documented above (and possibly others), the out-of-the-box experience on Linux is not what it should be. Hopefully, we'll see this situation improve in the coming years but for now this post will hopefully give people the tools they need to configure their machines properly and get the full performance out of their hardware.

06 Nov 2020 6:27pm GMT

Mike Blumenkrantz: Last Day

This Is The End

…of my full-time hobby work on zink.

At least for a while.

More on that at the end of the post.

Before I get to that, let's start with yesterday's riddle. Anyone who chose this pic

1.png

with 51 fps as being zink, you were correct.

That's right, zink is now at around 95% of native GL performance for this benchmark, at least on my system.

I know there's been a lot of speculation about the capability of the driver to reach native or even remotely-close-to-native speeds, and I'm going to say definitively that it's possible, and performance is only going to increase further from here.

A bit of a different look on things can also be found on my Fall roundup post here.

A Big Boost From Threads

I've long been working on zink using a single-thread architecture, and my goal has been to make it as fast as possible within that constraint. Part of my reasoning is that it's been easier to work within the existing zink architecture than to rewrite it, but the main issue is just that threads are hard, and if you don't have a very stable foundation to build off of when adding threading to something, it's going to get exponentially more difficult to gain that stability afterwards.

Reaching a 97% pass rate on my piglit tests at GL 4.6 and ES 3.2 gave me a strong indicator that the driver was in good enough shape to start looking at threads more seriously. Sure, piglit tests aren't CTS; they fail to cover a lot of areas, and they're certainly less exhaustive about the areas that they do cover. With that said, CTS isn't a great tool for zink at the moment due to the lack of provoking vertex compatibility support in the driver (I'm still waiting on a Vulkan extension for this, though it's looking likely that Erik will be providing a fallback codepath for this using a geometry shader in the somewhat near future) which will fail lots of tests. Given the sheer number of CTS tests, going through the failures and determining which ones are failing due to provoking vertex issues and which are failing due to other issues isn't a great use of my time, so I'm continuing to wait on that. The remaining piglit test failures are mostly due either to provoking vertex issues or some corner case missing features such as multisampled ZS readback which are being worked on by other people.

With all that rambling out of the way, let's talk about threads and how I'm now using them in zink-wip.

At present, I'm using u_threaded_context, aka glthread, making zink the only non-radeon driver to implement it. The way this works is by using Gallium to write the command stream to a buffer that is then processed asynchronously, freeing up the main thread for application use and avoiding any sort of blocking from driver overhead. For systems where zink is CPU-bound in the driver thread, this massively increases performance, as seen from the ~40% fps improvement that I gained after the implementation.

This transition presented a number of issues, the first of which was that u_threaded_context required buffer invalidation and rebinding. I'd had this on my list of targets for a while, so it was a good opportunity to finally hook it up.

Next up, u_threaded_context was very obviously written to work for the existing radeon driver architecture, and this was entirely incompatible with zink, specifically in how the batch/command buffer implementation is hardcoded like I talked about yesterday. Switching to monotonic, dynamically scaling command buffer usage resolved that and brought with it some other benefits.

The other big issue was, as I'm sure everyone expected, documentation.

I certainly can't deny that there's lots of documentation for u_threaded_context. It exists, it's plentiful, and it's quite detailed in some cases.

It's also written by people who know exactly how it works with the expectation that it's being read by other people who know exactly how it works. I had no idea going into the implementation how any of it worked other than a general knowledge of the asynchronous command stream parts that are common to all thread queue implementations, so this was a pretty huge stumbling block.

Nevertheless, I persevered, and with the help of a lot of RTFC, I managed to get it up and running. This is a more general overview post rather than a more in-depth, technical one, so I'm not going to go into any deep analysis of the (huge amounts of) code required to make it work, but here's some key points from the process in case anyone reading this hits some of the same issues/annoyances that I did:

All told, fixing all the regressions took much longer than the actual implementation, but that's just par for the course with driver work.

Anyone interested in testing should take note that, as always, this has only been used on Intel hardware (and if you're on Intel, this post is definitely worth reading), and so on systems which were not CPU-bound previously or haven't been worked on by me, you may not yet see these kinds of gains.

But you will eventually.

And That's It

This is a sort of bittersweet post as it marks the end of my full-time hobby work with zink. I've had a blast over the past ~6 months, but all things change eventually, and such is the case with this situation.

Those of you who have been following me for a long time will recall that I started hacking on zink while I was between jobs in order to improve my skills and knowledge while doing something productive along the way. I succeeded in all regards, at least by my own standards, and I got to work with some brilliant people at the same time.

But now, at last, I will once again become employed, and the course of that employment will take me far away from this project. I don't expect that I'll have a considerable amount of mental energy to dedicate to hobbyist Open Source projects, at least for the near term, so this is a farewell of sorts in that sense. This means (again, for at least the near term):

This does not mean that zink is dead, or the project is stalling development, or anything like that, so don't start overreaching on the meaning of this post.

I still have 450+ patches left to be merged into mainline Mesa, and I do plan to continue driving things towards that end, though I expect it'll take a good while. I'll also be around to do patch reviews for the driver and continue to be involved in the community.

I look forward to a time when I'll get to write more posts here and move the zink user experience closer to where I think it can be.

This is Mike, signing off for now.

Happy rendering.

06 Nov 2020 12:00am GMT

05 Nov 2020

feedplanet.freedesktop.org

Iago Toral: V3DV + Zink

During my presentation at the X Developers Conference I stated that we had been mostly using the Khronos Vulkan Conformance Test suite (aka Vulkan CTS) to validate our Vulkan driver for Raspberry Pi 4 (aka V3DV). While the CTS is an invaluable resource for driver testing and validation, it doesn't exactly compare to actual real world applications, and so, I made the point that we should try to do more real world testing for the driver after completing initial Vulkan 1.0 support.

To be fair, we had been doing a little bit of this already when I worked on getting the Vulkan ports of all 3 Quake game classics to work with V3DV, which allowed us to identify and fix a few driver bugs during development. The good thing about these games is that we could get the source code and compile them natively for ARM platforms, so testing and debugging was very convenient.

Unfortunately, there are not a plethora of Vulkan applications and games like these that we can easily test and debug on a Raspberry Pi as of today, which posed a problem. One way to work around this limitation that was suggested after my presentation at XDC was to use Zink, the OpenGL to Vulkan layer in Mesa. Using Zink, we can take existing OpenGL applications that are currently available for Raspberry Pi and use them to test our Vulkan implementation a bit more thoroughly, expanding our options for testing while we wait for the Vulkan ecosystem on Raspberry Pi 4 to grow.

So last week I decided to get hands on with that. Zink requires a few things from the underlying Vulkan implementation depending on the OpenGL version targeted. Currently, Zink only targets desktop OpenGL versions, so that limits us to OpenGL 2.1, which is the maximum version of desktop OpenGL that Raspbery Pi 4 can support (we support up to OpenGL ES 3.1 though). For that desktop OpenGL version, Zink required a few optional Vulkan 1.0 features that we were missing in V3DV, namely:

The first two were trivial: they were already implemented and we only had to expose them in the driver. Notably, when I was testing these features with the relevant CTS tests I found a bug in the alpha to one tests, so I proposed a fix to Khronos which is currently in review.

I also noticed that Zink was also implicitly requiring support for timestamp queries, so I also implemented that in V3DV and then also wrote a patch for Zink to handle this requirement better.

Finally, Zink doesn't use Vulkan swapchains, instead it creates presentable images directly, which was problematic for us because our platform needs to handle allocations for presentable images specially, so a patch for Zink was also required to address this.

As of the writing of this post, all this work has been merged in Mesa and it enables Zink to run OpenGL 2.1 applications over V3DV on Raspberry Pi 4. Here are a few screenshots of Quake3 taken with the native OpenGL driver (V3D), with the native Vulkan driver (V3DV) and with Zink (over V3DV). There is a significant performance hit with Zink at present, although that is probably not too unexpected at this stage, but otherwise it seems to be rendering correctly, which is what we were really interested to see:


Quake3 Vulkan renderer (V3DV)

Quake3 OpenGL renderer (V3D)

Quake3 OpenGL renderer (Zink + V3DV)

Note: you'll notice that the Vulkan screenshot is darker than the OpenGL versions. As I reported in another post, that is a feature of the Vulkan port of Quake3 and is unrelated to the driver.

Going forward, we expect to use Zink to test more applications and hopefully identify driver bugs that help us make V3DV better.

05 Nov 2020 10:14am GMT

Mike Blumenkrantz: Architecture

It's Time.

I've been busy cramming more code than ever into the repo this week in order to finish up my final project for a while by Friday. I'll talk more about that tomorrow though. Today I've got two things for all of you.

First, A Riddle

Of these two screenshots, one is zink+ANV and one is IRIS. Which is which?

2.png

1.png

Second, Queue Architecture

Let's talk a bit at a high level about how zink uses (non-compute) command buffers.

Currently in the repo zink works like this:

In short, there's a huge bottleneck around the flushing mechanism, and then there's a lesser-reached bottleneck for cases where an application flushes repeatedly before a command buffer's ops are completed.

Some time ago I talked about some modifications I'd done to the above architecture, and then things looked more like this:

The major difference after this work was that the flushing was reduced, which then greatly reduced the impact of that bottleneck that exists when all the command buffers are submitted and the driver wants to continue recording commands.

A lot of speculation has occurred among the developers over "how many" command buffers should be used, and there's been some talk of profiling this, but for various reasons I'll get into tomorrow, I opted to sidestep the question entirely in favor of a more dynamic solution: monotonically-identified command buffers.

Monotony

The basic idea behind this strategy, which is used by a number of other drivers in the tree, is that there's no need to keep a "ring" of command buffers to cycle through, as the driver can just continually allocate new command buffers on-the-fly and submit them as needed, reusing them once they've naturally completed instead of forcibly stalling on them. Here's a visual comparison:

The current design:

Here's the new version:

This way, there's no possibility of stalling based on application flushes (or the rare driver-internal flush which does still exist in a couple places).

The architectural change here had two great benefits:

The latter of these is due to the way that the queue in zink is split between gfx and compute command buffers; with the hardcoded batch system, the compute queue had its own command buffer while the gfx queue had four, but they all had unique IDs which were tracked using bitfields all over the place, not to mention it was frustrating never being able to just "know" which command buffer was currently being recorded to for a given command without indexing the array.

Now it's easy to know which command buffer is currently being recorded to, as it'll always be the one associated with the queue (gfx or compute) for the given operation.

This had further implications, however, and I'd done this to pave the way for a bigger project, one that I've spent the past few days on. Check back tomorrow for that and more.

05 Nov 2020 12:00am GMT

02 Nov 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Bringing The Heat

New Hotness

Quick update today, but I've got some very exciting news coming soon.

The biggest news of the day is that work is underway to merge some patches from Duncan Hopkins which enable zink to run on Mac OS using MoltenVK. This has significant potential to improve OpenGL support on that platform, so it's awesome that work has been done to get the ball rolling there.

In only slightly less monumental news though, Adam Jackson is already underway with Vulkan WSI work for zink, which is going to be huge for performance.

02 Nov 2020 12:00am GMT

30 Oct 2020

feedplanet.freedesktop.org

Dave Airlie (blogspot): llvmpipe is OpenGL 4.5 conformant.

(I just sent the below email to mesa3d developer list).

Just to let everyone know, a month ago I submitted the 20.2 llvmpipe
driver for OpenGL 4.5 conformance under the SPI/X.org umbrella, and it
is now official[1].

Thanks to everyone who helped me drive this forward, and to all the
contributors both to llvmpipe and the general Mesa stack that enabled
this.

Big shout out to Roland Scheidegger for helping review the mountain of
patches I produced in this effort.

My next plans involved submitting lavapipe for Vulkan 1.0, it's at 99%
or so CTS, but there are line drawing, sampler accuracy and some snorm
blending failure I have to work out.
I also ran the OpenCL 3.0 conformance suite against clover/llvmpipe
yesterday and have some vague hopes of driving that to some sort of
completion.

(for GL 4.6 only texture anisotropy is really missing, I've got
patches for SPIR-V support, in case someone was feeling adventurous).

Dave.

[1] https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_272

30 Oct 2020 8:25pm GMT

29 Oct 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Invalidation

Buffering

I've got a lot of exciting stuff in the pipe now, but for today I'm just going to talk a bit about resource invalidation: what it is, when it happens, and why it's important.

Let's get started.

What is invalidation?

Resource invalidation occurs when the backing buffer of a resource is wholly replaced. Consider the following scenario under zink:

On a sane/competent driver, the second glBufferData call will trigger invalidation, which means that A.buffer will be replaced entirely, while A is still the driver resource used by Gallium to represent target.

When does invalidation occur?

Resource invalidation can occur in a number of scenarios, but the most common is when unsetting a buffer's data, as in the above example. The other main case for it is replacing the data of a buffer that's in use for another operation. In such a case, the backing buffer can be replaced to avoid forcing a sync in the command stream which will stall the application's processing. There's some other cases for this as well, like glInvalidateFramebuffer and glDiscardFramebufferEXT, but the primary usage that I'm interested in is buffers.

Why is invalidation important?

The main reason is performance. In the above scenario without invalidation, the second glBufferData call will write null to the whole buffer, which is going to be much more costly than just creating a new buffer.

That's it

Now comes the slightly more interesting part: how does invalidation work in zink?

Currently, as of today's mainline zink codebase, we have struct zink_resource to represent a resource for either a buffer or an image. One struct zink_resource represents exactly one VkBuffer or VkImage, and there's some passable lifetime tracking that I've written to guarantee that these Vulkan objects persist through the various command buffers that they're associated with.

Each struct zink_resource is, as is the way of Gallium drivers, also a struct pipe_resource, which is tracked by Gallium. Because of this, struct zink_resource objects themselves cannot be invalidated in order to avoid breaking Gallium, and instead only the inner Vulkan objects themselves can be replaced.

For this, I created struct zink_resource_object, which is an object that stores only the data that directly relates to the Vulkan objects, leaving struct zink_resource to track the states of these objects. Their lifetimes are separate, with struct zink_resource being bound to the Gallium tracker and struct zink_resource_object persisting for either the lifetime of struct zink_resource or its command buffer usage-whichever is longer.

Code

The code for this mechanism isn't super interesting since it's basically just moving some parts around. Where it gets interesting is the exact mechanics of invalidation and how struct zink_resource_object can be injected into an in-use resource, so let's dig into that a bit.

Here's what the pipe_context::invalidate_resource hook looks like:

static void
zink_invalidate_resource(struct pipe_context *pctx, struct pipe_resource *pres)
{
   struct zink_context *ctx = zink_context(pctx);
   struct zink_resource *res = zink_resource(pres);
   struct zink_screen *screen = zink_screen(pctx->screen);

   if (pres->target != PIPE_BUFFER)
      return;

This only handles buffer resources, but extending it for images would likely be little to no extra work.

   if (res->valid_buffer_range.start > res->valid_buffer_range.end)
      return;

Zink tracks the valid data segments of its buffers. This conditional is used to check for an uninitialized buffer, i.e., one which contains no valid data. If a buffer has no data, it's already invalidated, so there's nothing to be done here.

   util_range_set_empty(&res->valid_buffer_range);

Invalidating means the buffer will no longer have any valid data, so the range tracking can be reset here.

   if (!get_all_resource_usage(res))
      return;

If this resource isn't currently in use, unsetting the valid range is enough to invalidate it, so it can just be returned right away with no extra work.

   struct zink_resource_object *old_obj = res->obj;
   struct zink_resource_object *new_obj = resource_object_create(screen, pres, NULL, NULL);
   if (!new_obj) {
      debug_printf("new backing resource alloc failed!");
      return;
   }

Here's the old internal buffer object as well as a new one, created using the existing buffer as a template so that it'll match.

   res->obj = new_obj;
   res->access_stage = 0;
   res->access = 0;

struct zink_resource is just a state tracker for the struct zink_resource_object object, so upon invalidate, the states are unset since this is effectively a brand new buffer.

   zink_resource_rebind(ctx, res);

This is the tricky part, and I'll go into more detail about it below.

   zink_descriptor_set_refs_clear(&old_obj->desc_set_refs, old_obj);

If this resource was used in any cached descriptor sets, the references to those sets need to be invalidated so that the sets won't be reused.

   zink_resource_object_reference(screen, &old_obj, NULL);
}

Finally, the old struct zink_resource_object is unrefed, which will ensure that it gets destroyed once its current command buffer has finished executing.

Simple enough, but what about that zink_resource_rebind() call? Like I said, that's where things get a little tricky, but because of how much time I spent on descriptor management, it ends up not being too bad.

This is what it looks like:

void
zink_resource_rebind(struct zink_context *ctx, struct zink_resource *res)
{
   assert(res->base.target == PIPE_BUFFER);

Again, this mechanism is only handling buffer resource for now, and there's only one place in the driver that calls it, but it never hurts to be careful.

   for (unsigned shader = 0; shader < PIPE_SHADER_TYPES; shader++) {
      if (!(res->bind_stages & BITFIELD64_BIT(shader)))
         continue;
      for (enum zink_descriptor_type type = 0; type < ZINK_DESCRIPTOR_TYPES; type++) {
         if (!(res->bind_history & BITFIELD64_BIT(type)))
            continue;

Something common to many Gallium drivers is this idea of "bind history", which is where a resource will have bitflags set when it's used for a certain type of binding. While other drivers have a lot more cases than zink does due to various factors, the only thing that needs to be checked for my purposes is the descriptor type (UBO, SSBO, sampler, shader image) across all the shader stages. If a given resource has the flags set here, this means it was at some point used as a descriptor of this type, so the current descriptor bindings need to be compared to see if there's a match.

         uint32_t usage = zink_program_get_descriptor_usage(ctx, shader, type);
         while (usage) {
            const int i = u_bit_scan(&usage);

This is a handy mechanism that returns the current descriptor usage of a shader as a bitfield. So for example, if a vertex shader uses UBOs in slots 0, 1, and 3, usage will be 11, and the loop will process i as 0, 1, and 3.

            struct zink_resource *cres = get_resource_for_descriptor(ctx, type, shader, i);
            if (res != cres)
               continue;

Now the slot of the descriptor type can be compared against the resource that's being re-bound. If this resource is the one that's currently bound to the specified slot of the specified descriptor type, then steps can be taken to perform additional operations necessary to successfully replace the backing storage for the resource, mimicking the same steps taken when initially binding the resource to the descriptor slot.

            switch (type) {
            case ZINK_DESCRIPTOR_TYPE_SSBO: {
               struct pipe_shader_buffer *ssbo = &ctx->ssbos[shader][i];
               util_range_add(&res->base, &res->valid_buffer_range, ssbo->buffer_offset,
                              ssbo->buffer_offset + ssbo->buffer_size);
               break;
            }

For SSBO descriptors, the only change needed is to add valid range for the bound region as . This region is passed to the shader, so even if it's never written to, it might be, and so it can be considered a valid region.

            case ZINK_DESCRIPTOR_TYPE_SAMPLER_VIEW: {
               struct zink_sampler_view *sampler_view = zink_sampler_view(ctx->sampler_views[shader][i]);
               zink_descriptor_set_refs_clear(&sampler_view->desc_set_refs, sampler_view);
               zink_buffer_view_reference(ctx, &sampler_view->buffer_view, NULL);
               sampler_view->buffer_view = get_buffer_view(ctx, res, sampler_view->base.format,
                                                           sampler_view->base.u.buf.offset, sampler_view->base.u.buf.size);
               break;
            }

Sampler descriptors require a new VkBufferView be created since the previous one is no longer valid. Again, the references for the existing bufferview need to be invalidated now since that descriptor set can no longer be reused from the cache, and then the new VkBufferView is set after unrefing the old one.

            case ZINK_DESCRIPTOR_TYPE_IMAGE: {
               struct zink_image_view *image_view = &ctx->image_views[shader][i];
               zink_descriptor_set_refs_clear(&image_view->desc_set_refs, image_view);
               zink_buffer_view_reference(ctx, &image_view->buffer_view, NULL);
               image_view->buffer_view = get_buffer_view(ctx, res, image_view->base.format,
                                                         image_view->base.u.buf.offset, image_view->base.u.buf.size);
               util_range_add(&res->base, &res->valid_buffer_range, image_view->base.u.buf.offset,
                              image_view->base.u.buf.offset + image_view->base.u.buf.size);
               break;
            }

Images are nearly identical to the sampler case, the difference being that while samplers are read-only like UBOs (and therefore reach this point already having valid buffer ranges set), images are more like SSBOs and can be written to. Thus the valid range must be set here like in the SSBO case.

            default:
               break;

Eagle-eyed readers will note that I've omitted a UBO case, and this is because there's nothing extra to be done there. UBOs will already have their valid range set and don't need a VkBufferView.

            }

            invalidate_descriptor_state(ctx, shader, type);

Finally, the incremental decsriptor state hash for this shader stage and descriptor type is invalidated. It'll be recalculated normally upon the next draw or compute operation, so this is a quick zero-setting operation.

         }
      }
   }
}

That's everything there is to know about the current state of resource invalidation in zink!

29 Oct 2020 12:00am GMT