28 Feb 2021


Roman Gilg: Curious Child

Last week we studied window children on X11 and Wayland at a high level. With this general knowledge acquired, we will quickly go through the recent changes to window children in KWinFT's new version.

All the X11 Children

As mentioned in last week's article we don't only have one kind of transient children on X11. There are for one the usual transients, defined by setting the WM_TRANSIENT_FOR window property to the window id of a toplevel parent window. But there are also group transients, defined by setting the window property to null or alternatively to the id of the root window.

Transient Leads

Normal transients and group transients were in the past handled in KWinFT by different means. In the class for managed X11 windows a field indicated that the window is a transient window. As expected for normal transients the field specified the id of the parent. For group transients it was always set to the root window id. When set to null, the window was not a transient.

Additionally there was a function mainClients() which returned a list with all transient leads. As a reminder these are other windows the window is a transient for. Obviously for normal transients the returned list only contained a single element.

This has now been unified and encapsulated into a class, simply called transient, which is supposed to be composited into other classes representing windows. With that there is a single way to check if a window is a transient. And what is more important we use this same mechanism for all kinds of window children, also as seen later on Wayland.

Around and Around

We said the notion of group transients increases the complexity when dealing with transient windows. One reason for that is the danger of cyclic relations.

When a window is a group transient, other windows in the window group are transient leads for it. This of course would lead to cyclic transient relations when there is more than one group transient in the group. There would also be cyclic transient relations when one of the other windows has set a group transient to be its transient lead in the usual way of setting its WM_TRANSIENT_FOR window property to the other window's id. Thinking further this could even be the case when there is an indirect transient relation via several windows inside or outside the group.

As we use the transient relation for stacking windows, cyclic transient relations don't make sense. Also it is likely such relations can lead to tricky bugs like infinite loops. So we should simply filter cyclic relations out.

In the old code this was ensured through different means. In the new implementation we do this here nicely packed together when updating the group or the WM_TRANSIENT_FOR value. For all affected windows the updated relations are saved into their composited transient class objects.

Same but Different Wayland Children

The grand goal of this rewrite was to unify the handling of window children not only for windows on X11 but also for windows on Wayland. This has been achieved by always recurring to the same idea of window children in general, and only in detail explicitly defining what is different.

Subsurfaces as Annexed Children

In the past subsurfaces were handled in separate trees per surface. The decision to implement them this way must have seemed natural because KWin as an X11 window manager did not handle the parent-child relations of the X11 window tree. This was done, as mentioned in the previous article, by the X Server itself.

The problem with this approach is though, that we ignore that there is already a tree in the window manager, a tree for stacking all toplevel windows. We basically double the implementation cost by providing separate trees for subsurfaces. With the rewrite this has been corrected. Subsurfaces are now tracked in the same internal stack in KWinFT like all other X11 and Wayland windows. The stacking algorithm ensures that they are always above their parent surface. Input is redirected accordingly.

This greatly simplifies the handling of subsurfaces. There is one difference though. In contrast to normal X11 transient windows, subsurfaces do not have control on their own, they are no independent entities. They are rather annexed to their parent surface. It made sense to define a property named like this on its own. When painting the final image the Scene is responsible for painting subsurfaces as part of their parent surface.

Toplevel Transients and Popups

In the last article we saw that the xdg-shell protocol defines parent-child relations between xdg-toplevels. These relations can be represented in the same way like normal transients on X11. The implementations on Wayland and X11 are therefore very similar, only different in how the information is received via each protocol.

The case of xdg-popups is more interesting. On X11 we saw that popups are basically ignored by the window manager. But on Wayland popups need to be stacked and positioned by the window manager as there is no other entity like the X Server doing that for us.

Obviously we want to interpret them as window children again to reuse our tools. The refactored implementation was designed that way. But this is also a good example of how defining the right notion often can make all the difference as we interpret them not only as normal window children.

An effect acts on a Wayland popup as it is an annexed child.

They are now in the same way like subsurfaces set to be annexed children. This way effects affecting the parent also affect them.

Restacked Perception

One interesting aspect of the unification work on window children is that the annexed children motivated an overhaul of the central stacking algorithm. The old algorithm was difficult to understand with several counters and nested loops. The new algorithm uses the transient class with its leads and children to compute the new stack recursively, keeping a child above its parent but below other windows.

It is therefore fair to say that in this case a different viewpoint on window children led to improvements in parts that at first seemed unrelated or at least of no concern. And this was not the only time that happened when I worked on this Windowing Revolution. In general one can say that fundamental progress is achieved when traditional views are challenged. The crucial step here was to find a new definition for unity and difference of window children.

28 Feb 2021 10:00pm GMT

27 Feb 2021


Pekka Paalanen: Testing 4x4 matrix inversion precision

It is extremely rare that a hobby software project of mine gets completed, but now it has happened. Behold! Fourbyfour!

Have you ever had to implement a mathematical algorithm, say, matrix inversion? You want it to be fast and measuring the speed is fairly simple, right. But what about correctness? Or precision? Behavior around inputs that are on the edge? You can hand-pick a few example inputs, put those into your test suite, and verify the result is what you expect. If you do not pick only trivial inputs, this is usually enough to guarantee your algorithm does not have fundamental mistakes. But what about those almost invalid inputs, can you trust your algorithm to not go haywire on them? How close to invalid can your inputs be before things break down? Does your algorithm know when it stops working and tell you?

Inverting a square matrix requires that the inverse matrix exists to begin with. Matrices that do not mathematically have an inverse matrix are called singular. Can your matrix inversion algorithm tell you when you are trying to invert a matrix that cannot be inverted, or does it just give you a bad result pretending it is ok?

Working with computers often means working with floating-point numbers. With floating-point, the usual mathematics is not enough, it can actually break down. You calculate something and the result a computer gives you is total nonsense, like 1+2=2 in spirit. In the case of matrix inversion, it's not enough that the input matrix is not singular mathematically, it needs to be "nice enough" numerically as well. How do you test your matrix inversion algorithm with this in mind?

These questions I tried to answer with Fourbyfour. The README has the links to the sub-pages discussing how I solved this, so I will not repeat it here. However, as the TL;DR, if there is one thing you should remember, it is this:

Do not use the matrix determinant to test if a matrix is invertible!

Yes, the determinant is zero for a singular matrix. No, close to zero determinant does not tell you how close to singular the matrix is. There are better ways.

However, the conclusion I came to is that if you want a clear answer for a specific input matrix, is it invertible, the only way to know for sure is to actually invert it, multiply the input matrix with the inverse you computed, and measure how far off from the identity matrix it is. Of course, you also need to set a threshold, how close to identity matrix is close enough for your application, because with numerical algorithms, you will almost never get the exact answer. Also, pick an appropriate matrix norm for the matrix difference.

The reason for this conclusion is what one of the tools I wrote tells me about a matrix that would be typical for a display server with two full-HD monitors. The matrix is simply the pixel offset of the second monitor on the desktop. The analysis of the matrix is the example I used to demonstrate fourbyfour-analyse. If you read through it, you should be shocked. The mathematics, as far as I can understand, seems to tell us that if you use 32-bit floating-point, inverting this matrix gives us a result that leads to no correct digits at all. Obviously this is nonsense, the inverse is trivial and algorithms should give the exact correct result. However, the math does not lie (unless I did). If I did my research right, then what fourbyfour-analyse tells us is true, with an important detail: it is the upper error bound. It guarantees that we cannot get errors larger than that (heh, zero correct digits is pretty hard to make much worse). But I also read that there is no better error bound possible for a generic matrix inversion algorithm. (If you take the obvious-to-human constraints into account that those elements must be one and those must be zero, the analysis would likely be very different.) Therefore the only thing left to do is to actually go on with the matrix inversion and then verify the result.

Here is a list of the cool things the Fourbyfour project does or has:

In this project I also tried out several project quality assurance features:

I'm really happy this project is now "done", well, version 1.0.0 so to say. One thing I have realized it is still missing is a determinant sweep mode. The precision testing mode sweeps over condition numbers and allows plotting the inversion behavior. It should have another mode where the sweep controls the determinant value, with some fixed condition number for the random test matrices. This determinant mode could point out inversion algorithms that use determinant value for matrix singularity testing and show how it leads to completely arbitrary results.

I you want to learn about numerical methods for matrices, I recommend the book Gene H. Golub, Charles F. van Loan, Matrix Computations. The Johns Hopkins University Press. I used the third edition, 1996, when implementing the Weston matrix inversion years ago.

27 Feb 2021 10:47am GMT

26 Feb 2021


Ben Widawsky: Framebuffer Modifiers Part 1


In a now pretty well established tradition on my part, I am posting on things I no longer work on!

I gave a talk on modifiers at XDC 2017 and Linux Plumbers 2017 audio only. It was always my goal to have a blog post accompany the work. Relatively shortly after the talks, I ended up leaving graphics and so it dropped on the priority list.

I'm splitting this up into two posts. This post will go over the problem, and solutions. The next post will go over the implementation details.


Each 3d computational unit in an Intel GPU is called an Execution Unit (EU). Aside from what you might expect them to do, like execute shaders, they may be used for copy operations (itself a shader), or compute operations (also, shaders). All of these things require memory bandwidth in order to complete their task in a timely manner.

Modifiers were the chosen solution in order to allow end to end renderbuffer [de]compression to work, which is itself designed to reduce memory bandwidth needs in the GPU and display pipeline. End to end renderbuffer compression simply means that through all parts of the GPU and display pipeline, assets are read and written to in a compression scheme that is capable of reducing bandwidth (more on this later).

Modifiers are relatively simple concept. They are modifications that are applied to a buffer's layout. Typically a buffer has a few properties, width, height, and pixel format to name a few. Modifiers can be thought of as ancillary information that is passed along with the pixel data. It will impact how the data is processed or displayed. One such example might be to support tiling, which is a mechanism to change how pixels are stored (not sequentially) in order for operations to make better use of locality for caching and other similar reasons. Modifiers were primarily designed to help negotiate modified buffers between the GPU rendering engine and the display engine (usually by way of the compositor). In addition other uses can crop up such as the video decode/encode engines.

My understanding is that even now, 3 years later, full modifier support isn't readily available across all corners of the graphics ecosystem. Many hardware features are being entirely unrealized. Upstreaming sweeping graphics features like this one can be very time consuming and I seriously would advise hardware designers to take that into consideration (or better yet, ask your local driver maintainer) before they spend the gates. If you can make changes that don't require software, just do it. If you need software involvement, the longer you wait, the worse it will be.

They weren't new even when I made the presentation 3.5 years ago.

commit e3eb3250d84ef97b766312345774367b6a310db8
Author: Rob Clark <robdclark@gmail.com>
Date:   6 years ago

    drm: add support for tiled/compressed/etc modifier in addfb2

I managed to land some stuff:

commit db1689aa61bd1efb5ce9b896e7aa860a85b7f1b6
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   3 years, 7 months ago

    drm: Create a format/modifier blob

Admiring the Problem

Back of the envelope requirement for a midrange Skylake GPU from the time can be calculated relatively easily. 4 years ago, at the frequencies we run our GPUs and their ISA, we can expect roughly 1GBs for each of the 24 EUs.

A 4k display:

3840px × 2160rows × 4Bpp × 60HZ = 1.85GBs

24GBs + 1.85GBs = 25.85GBs

This by itself will oversaturate single channel DDR4 bandwidth (which was what was around at the time) at the fastest possible clock. As it turns out, it gets even worse with compositing. Most laptops sporting a SKL of this range wouldn't have a 4k display, but you get the idea.

The picture (click for larger SVG) is a typical "flow" for a composited desktop using direct rendering with X or a Wayland compositor using EGL. In this case, drawing a Rubik's cube looking thing into a black window.

Admiring the problem

Using this simple Rubik's cube example I'll explain each of the steps so that we can understand where our bandwidth is going and how we might mitigate that. This is just the overview, so feel free to move on to the next section. Since the example will be trivial, and the window is small (and only a singleton) it won't hit that level of bandwidth, but it will demonstrate how where the bandwidth is being consumed, and open up a discussion on how savings can be achieved.

Rendering and Texturing

For the example, no processing happens other than texturing. In a simple world, the processing of the shader instructions doesn't increase the memory bandwidth cost. As such, we'll omit that from the details.

The main steps on how you get this Rubik's cube displayed are

More details below...

Texture Upload

Getting the texture from the application, usually from disk, into main memory, is what I'm referring to as texture upload. In terms of memory bandwidth, you are using write bandwidth to write into the memory.

Assets are transfered from persistent storage to memory

Textures may either being generated by the 3d application, which would be trivial for this example, or they may be authored using a set of offline tools and baked into the application. For any consequential use, the latter is predominantly used. Certain surface types are commonly dynamically generated though, for example, the shadow mapping technique will generate depth maps. Those dynamically generated surfaces actually will benefit even more (more later).

This is pseudo code (but close to real) to upload the texture in OpenGL:

const unsigned height = 128;
const unsigned width = 64;
const void *data = ... // rubik's cube
GLuint tex;

glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, width, height, 0, GL_RGB, GL_UNSIGNED_BYTE, data);

I'm going to punt on explaining mipmaps, which are themselves a mechanism to conserve memory bandwidth. If you have no understanding, I'd recommend reading up on mipmaps. This wikipedia article looks decent to me.

Texture Sampling

Once the texture is bound, the graphics runtime can execute shaders which can reference those textures. When the shader requests a color value (also know as sampling) from the texture, it's possiblelikely that the calculated coordinate within the texture will be in between pixels. The hardware will have to return a single color value for the sample point and the way it interpolates is chosen by the graphics runtime. This is referred to as a filter

Texture Fetch
Texture Fetch/Filtering

Here's the GLSL to fetch the texture:

#version 330

uniform sampler2D tex;
in vec2 texCoord;
out vec4 fragColor;

void main() {
    vec4 temp = texelFetch(tex, ivec2(texCoord));
    fragColor = temp;

The above actually does something that's perhaps not immediately obvious. fragColor = temp;. This actually instructs the fragment shader to write out that value to a surface which is bound for output (usually a framebuffer). In other words, there are two steps here, read and filter a value from the texture, write it back out.

The part of the overall diagram that represents this step:


In the old days of X, and even still when not using the composite extension, the graphics applications could be given a window to write the pixels directly into the resulting output. The X window manager would mediate resize and move events, letting the client update as needed. This has a lot of downsides which I'll say are out of scope here. There is one upside that is in scope though, there's no extra copy needed to create the screen composition. It just is what it is, tearing and all.

If you don't know if you're currently using a compositor, you almost certainly are using one. Wayland only composites, and the number of X window managers that don't composite is very few. So what exactly is compositing? Simply put it's a window manager that marshals frame updates from clients and is responsible for drawing them on the final output. Often the compositor may add its own effects such as the infamous wobbly windows. Those effects themselves may use up bandwidth!

Simplified compositor block diagram

Applications will write their output into what's referred to as an offscreen buffer. 👋👋 The compositor will read the output and copy it into what will become the next frame. What this means from a bandwidth consumption perspective is that the compositor will need to use both read and write bandwidth just to build the final frame. 👋👋


It's the mundane part of this whole thing. Pixels are fetched from memory and pushed out onto whatever display protocol.

Display Engine
Display Engine

Perhaps the interesting thing about the display engine is it has fairly isochronous timing requirements and can't tolerate latency very well. As such, it will likely have a dedicated port into memory that bypasses arbitration with other agents in the system that are generating memory traffic.

Out of scope here, but I'll briefly mention that this also gets a bit into tiling. Display wants to read things row by row, whereas rendering works a bit different. In short this is the difference between X-tiling (good for display), and Y-tiling (good for rendering). Until Skylake, the display engine couldn't even understand Y-tiled buffers.

Summing the Bandwidth Cost

Running through our 64x64 example...

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc (RGBX8) File to DRAM 16KB (64 × 64 × 4) W
Texel Fetch (nearest) 1Bpc DRAM to Sampler 16KB (64 × 64 × 4) R
FB Write 1Bpc GPU to DRAM 16KB (64 × 64 × 4) W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc DRAM to PHY 16KB (64 × 64 × 4) R

Total = (16 + 16 + 16 + 32 + 16) × 60Hz = 5.625MBs

But actually, Display Engine will always scanout the whole thing, so really with a 4k display:

Total = (16 + 16 + 16 + 32 + 32400) × 60Hz = 1.9GBs

Don't forget about those various filter modes though!

Filter Mode Multiplier (texel fetch) Total Bandwidth
Bilinear 4x 11.25MBs
Trilinear 8x 18.75MBs
Aniso 4x 32x 63.75MBs
Aniso 16x 128x 243.75MBs

Proposing some solutions

Without actually doing math, I think cache is probably the biggest win you can get. One spot where caching could help if software was aware is that framebuffer write step and composition step could avoid the trip to main memory. Another is the texture upload and fetch. Assuming you don't blow out your cache, you can avoid the main memory trip.

While caching can buy you some relief, ultimately you have to flush your caches to get the display engine to be able to read your buffer. At least as of 2017, I was unaware of an architecture that had a shared cache between display and 3d.

Also, cache sizes are limited...

Wait for DRAM to get faster

Instead of doing anything, why not just wait until memory gets higher bandwidth?

Here's a quick breakdown of the progression at the high end of the specs. For the DDR memory types, I took a swag at number of populated channels because for a fair comparison the expectation with DDR is you'll have at least dual channel, nowadays.


Looking at the graph it seems like the memory vendors aren't hitting Moore's Law any time soon, and if they are, they're fooling me. A similar chart should be made for execution unit counts, but I'm too lazy. A Tigerlake GT2 has 96 Eus. If you go back to our back of the envelope calculation we had a mid range GPU at 24 EUs, so that has quadrupled. In other words, the system architects will use all the bandwidth they can get.

Improving memory technologies is vitally important, it just isn't enough.


Hardware Composition

One obvious place we want to try to reduce bandwidth is composition. It was after all the biggest individual consumer of available memory bandwidth.

With composition as we described earlier, there was presumed to be a single plane. Software would arrange the various windows onto the plane, which if you recall from the section on composition added quite a bit to the bandwidth consumption, then the display engine could display from that plane.

Hardware composition is the notion that each of those windows could have a separate display plane, directly write into that, and all the compositor would have to do is make sure those display planes occupied the right part of the overall screen. It's conceptually similar to the direct scanout we described earlier in the section on composition.

Operation Color Depth Description Bandwidth R/W
Compositing 1Bpc DRAM to DRAM 32KB (64 × 64 × 4 × 2) R+W

TOTAL SAVINGS = 1.875MBs (33% savings)

Hardware Compsition Verdict

33% savings is really really good, and certainly if you have hardware with this capability, the driver should enable it, but there are some problems that come along with this that make it not so appealing.

  1. Hardware has a limited number of planes.
  2. Formats. One thing I left out about the compositor earlier is that one of things it may opt to do is convert the application's window into a format that the display hardware understands. This means some amount of negotiation has to take place so the application knows about this. Prior to this work, that wasn't in place.
  3. Doesn't reduce any other parts of process, ie. a full screen application wouldn't benefit at all.

Texture Compression

So far in order to solve the, not enough bandwidth, problem, we've tried to add more bandwidth, and reduce usage with hardware composition. The next place to go is to try to tackle texture consumption from texturing.

If you recall, we split the texturing up into two stages. Texture upload, and texture fetch. This third proposed solution attempts to reduce bandwidth by storing a compress texture in memory. Texture upload will compress it, and texture sampling can understand the compression scheme and avoid doing all the lookups. Compressing the texture usually comes with some unperceivable degradation. In terms of the sampling, it's a bit handwavy to say you reduce by the compression factor, but let's say for simplicity sake, that's what it does.

Some common formats at the time of the original materials were

Format Compression Ratio
DXT1 8:1
ETC2 4:1
ASTC Variable, 6:1

Using DXT1 as an example of the savings:

Operation Color Depth Bandwidth R/W
Texture Upload DXT1 2KB (64 × 64 × 4 / 8) W
Texel Fetch (nearest) DXT1 2KB (64 × 64 × 4 / 8) R
FB Write 1Bpc 16KB (64 × 64 × 4) W
Compositing 1Bpc 32KB (64 × 64 × 4 × 2) R+W
Scanout 1Bpc 16KB (64 × 64 × 4) R

Here's an example with the simple DXT1 format:

Texture Compression Verdict

Texture compression solves a couple of the limitations that hardware composition left. Namely it can work for full screen applications, and if your hardware supports it, there isn't a limit to how many applications can make use of it. Furthermore, it scales a bit better because an application might use many many textures but only have 1 visible window.

There are of course some downsides.

Click for SVG

For comparison, here is the same cube scaled down with an 8:1 ratio. As you can see DXT1 does a really good job.

Scaled cube

We can't ignore the degradation though as certain rendering may get very distorted as a result.

*TOTAL SAVINGS (DXT1) = 1.64MBs (30% savings)

*total savings here is kind of a theoretical max

End to end lossless compression

So what if I told you there was a way to reduce your memory bandwidth consumption without having to modify your application, without being subject to hardware limits on planes, and without having to wait for new memory technologies to arrive?

End to end loss compression attempts to provide both "end to end" and "lossless" compression transparently to software. Explanation coming up.

End to End

As mentioned in the previous section on texture compression, one of the pitfalls is that you'd have to decompress the texture in order for it to be used outside of your 3d engine. Typically this would mean for the display engine to scanout out from, but you could also envision a case where perhaps you'd like to share these surfaces with the hardware video encoder. The nice thing about this "end to end" attribute is every stage we mentioned in previous sections that required bandwidth can get the savings just by running on hardware and drivers that enable this.


Now because this is all transparent to the application running, a lossless compression scheme has to be used so that there aren't any unexpected results. While lossless might sound great on the surface (why would you want to lose quality?) it reduces the potential savings because lossless compression algorithms are always more inefficient, but it's still a pretty big win.

What's with the box, bro?

I want to provide an example of how this can be possible. Going back to our original image of the full picture, everything looks sort of the same. The only difference is there is a little display engine decompression step, and all of the sampler and framebuffer write steps now have a little purple box accompanying them

One sort of surprising aspect of this compression is it reduces bandwidth, not overall memory usage (that's also true of the Intel implementation). In order to store the compression information, hardware carves off a little bit of extra memory which is referenced for each operation on a texture (yes, that might use bandwidth too if it's not cached).

Here's an made-up implementation which tracks state in a similar way to Skylake era hardware, but the rest is entirely made up by me. It shows that even a naive implementation can get up to a lossless 2:1 compression ratio. Remember though, this comes at the cost of adding gates to the design and so you'd probably want something better performing than this.

2:1 compression

Everything is tracked as cacheline pairs. In this example we have state called "CCS". For every pair of cachelines in the image, 2b are in this state to track the current compression. When the pair of cachelines uses 12 or fewer colors (which is surprisingly often in real life), we're able to compress the data into a single cacheline (state becomes '01'). When the data is compressed, we can reassemble the image losslessly from a single cacheline, this is 2:1 compression because 1 cacheline gets us back 2 cachelines worth of pixel data.

Walking through the example we've been using of the Rubik's cube.

  1. As the texture is being uploaded, the hardware observes all the runs of the same color and stores them in this compressed manner by building the lookup table. On doing this it modifies the state bits in the CCS to be 01 for those cachelines.
  2. On texture fetch, the texture sampler checks the CCS. If the encoding is 01, then the hardware knows to use the LUT mechanism instead for all the color values.
  3. Throughout the rest of rendering, steps 1 & 2 are repeated as needed.
  4. When display is ready to scanout the next frame, it too can look at the CCS determine if there is compression, and decompress as it's doing the scanout.

The memory consumed is minimal which also means that any bandwidth usage overhead is minimal. In the example we have a 64x128 image. In total that's 512 cachelines. At 2 bits per 2 cachelines the CCS size for the example would be not even be a full cacheline: 512 / 2 / 2 = 128b = 16B

* Unless you really want to understand how hardware might actually work, ignore the 00 encoding for clear color.

* There's a caveat here that we assume texture upload and fetch use the sampler. At the time of the original presentation, this was not usually the case and so until the FB write occurred, you didn't actually get compression.

Theoretical best savings would compress everything:

Operation Color Depth Description Bandwidth R/W
Texture Upload 1Bpc compressed File to DRAM 8KB (64 × 64 × 4) / 2 W
Texel Fetch (nearest) 1Bpc compressed DRAM to Sampler 8KB (64 × 64 × 4) / 2 R
FB Write 1Bpc compressed GPU to DRAM 8KB (64 × 64 × 4) / 2 W
Compositing 1Bpc compressed DRAM to DRAM 16KB (64 × 64 × 4 × 2) / 2 R+W
Scanout 1Bpc compressed DRAM to PHY 8KB (64 × 64 × 4) / 2 R

TOTAL SAVINGS = 2.8125MBs (50% savings)

And if you use HW compositing in addition to this...

TOTAL SAVINGS = 3.75MBs (66% savings)

Ending Notes

Hopefully it's somewhat clear how 3d applications are consuming memory bandwidth, and how quickly the consumption grows when adding more applications, textures, screen size, and refresh rate.

End to end lossless compression isn't always going to be a huge win, but in many cases it can really chip away at the problem enough to be measurable. The challenge as it turns out is actually getting it hooked up in the driver and rest of the graphics software stack. As I said earlier, just because a feature seems good doesn't necessarily mean it would be worth the software effort to implement it. End to end loss compression is one feature that you cannot just turn on by setting a bit, and the fact that it's still not enabled anywhere, to me, is an indication that effort and gates may have been better spent elsewhere.

However, next section will be all about how we got it hooked up through the graphics stack.

If you've made it this far, you probably could use a drink. I know I can.

26 Feb 2021 12:00am GMT

25 Feb 2021


Mike Blumenkrantz: Delete The Code

A Losing Battle

For a long time, I've tried very, very, very, very hard to work around problems with NIR variables when it comes to UBOs and SSBOs.

Really, I have.

But the bottom line is that, at least for gallium-based drivers, they're unusable. They're so unreliable that it's only by sheer luck (and a considerable amount of it) that zink has worked at all until this point.

Don't believe me? Here's a list of just some of the hacks that are currently in use by zink to handle support for these descriptor types, along with the reason(s) why they're needed:

Hack Reason It's Used Bad Because?
iterating the list of variables backwards this indexing vaguely matches the value used by shaders to access the descriptor only works coincidentally for as long as nothing changes this ordering and explodes entirely with GL-SPIRV
skipping non-array variables with data.location > 0 these are (usually) explicit references to components of a block BO object in a shader sometimes they're the only reference to the BO, and skipping them means the whole BO interface gets skipped
using different indexing for SSBO variables depending on whether data.explicit_binding is set this (sometimes) works to fix indexing for SSBOs with random bindings and also atomic counters the value is set randomly by other optimization passes and so it isn't actually reliable
atomic counters are identified by using !strcmp(glsl_get_type_name(var->interface_type), "counters") counters get converted to SSBOs, but they require different indexing in order to be accessed correctly c'mon.
runtime arrays (array[]) are randomly tacked onto SPIRV SSBO variables based on the variable type fixes atomic counter array access and SSBO length() method not actually needed most of the time

And then there's this monstrosity that's used for linking up SSBO variable indices with their instruction's access value (comments included for posterity):

unsigned ssbo_idx = 0;
if (!is_ubo_array && var->data.explicit_binding &&
    (glsl_type_is_unsized_array(var->type) || glsl_get_length(var->interface_type) == 1)) {
    /* - block ssbos get their binding broken in gl_nir_lower_buffers,
     *   but also they're totally indistinguishable from lowered counter buffers which have valid bindings
     * hopefully this is a counter or some other non-block variable, but if not then we're probably fucked
    ssbo_idx = var->data.binding;
} else if (base >= 0)
   /* we're indexing into a ssbo array and already have the base index */
   ssbo_idx = base + i;
else {
   if (ctx->ssbo_mask & 1) {
      /* 0 index is used, iterate through the used blocks until we find the first unused one */
      for (unsigned j = 1; j < ctx->num_ssbos; j++)
         if (!(ctx->ssbo_mask & (1 << j))) {
            /* we're iterating forward through the blocks, so the first available one should be
             * what we're looking for
            base = ssbo_idx = j;
   } else
      /* we're iterating forward through the ssbos, so always assign 0 first */
      base = ssbo_idx = 0;
   assert(ssbo_idx < ctx->num_ssbos);
ctx->ssbos[ssbo_idx] = var_id;
ctx->ssbo_mask |= 1 << ssbo_idx;
ctx->ssbo_vars[ssbo_idx] = var;

Does it work?

Amazingly, yes, it does work the majority of the time.

But is this really how we should live our lives?

A Methodology To Live By

As the great compiler-warrior Jasonus Ekstrandimus once said, "Just Delete All The Code".

Truly this is a pivotal revelation, one that can induce many days of deep thinking, but how can it be applied to this scenario?

Today I present the latest in zink code deletion: a NIR pass that deletes all the broken variables and makes new ones.


Let's get into it.

uint32_t ssbo_used = 0;
uint32_t ubo_used = 0;
uint64_t max_ssbo_size = 0;
uint64_t max_ubo_size = 0;
bool ssbo_sizes[PIPE_MAX_SHADER_BUFFERS] = {false};

if (!shader->info.num_ssbos && !shader->info.num_ubos && !shader->num_uniforms)
   return false;
nir_function_impl *impl = nir_shader_get_entrypoint(shader);
nir_foreach_block(block, impl) {
   nir_foreach_instr(instr, block) {
      if (instr->type != nir_instr_type_intrinsic)

      nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr);
      switch (intrin->intrinsic) {
      case nir_intrinsic_store_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[1]));

      case nir_intrinsic_get_ssbo_size: {
         uint32_t slot = nir_src_as_uint(intrin->src[0]);
         ssbo_used |= BITFIELD_BIT(slot);
         ssbo_sizes[slot] = true;
      case nir_intrinsic_ssbo_atomic_add:
      case nir_intrinsic_ssbo_atomic_imin:
      case nir_intrinsic_ssbo_atomic_umin:
      case nir_intrinsic_ssbo_atomic_imax:
      case nir_intrinsic_ssbo_atomic_umax:
      case nir_intrinsic_ssbo_atomic_and:
      case nir_intrinsic_ssbo_atomic_or:
      case nir_intrinsic_ssbo_atomic_xor:
      case nir_intrinsic_ssbo_atomic_exchange:
      case nir_intrinsic_ssbo_atomic_comp_swap:
      case nir_intrinsic_ssbo_atomic_fmin:
      case nir_intrinsic_ssbo_atomic_fmax:
      case nir_intrinsic_ssbo_atomic_fcomp_swap:
      case nir_intrinsic_load_ssbo:
         ssbo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));
      case nir_intrinsic_load_ubo:
      case nir_intrinsic_load_ubo_vec4:
         ubo_used |= BITFIELD_BIT(nir_src_as_uint(intrin->src[0]));

The start of the pass iterates over the instructions in the shader. All UBOs and SSBOs that are used get tagged into a bitfield of their index, and any SSBOs which have the length() method called are similarly tagged.

nir_foreach_variable_with_modes(var, shader, nir_var_mem_ssbo | nir_var_mem_ubo) {
   const struct glsl_type *type = glsl_without_array(var->type);
   if (type_is_counter(type))
   unsigned size = glsl_count_attribute_slots(type, false);
   if (var->data.mode == nir_var_mem_ubo)
      max_ubo_size = MAX2(max_ubo_size, size);
      max_ssbo_size = MAX2(max_ssbo_size, size);
   var->data.mode = nir_var_shader_temp;
NIR_PASS_V(shader, nir_remove_dead_variables, nir_var_shader_temp, NULL);

Next, the existing SSBO and UBO variables get iterated over. A maximum size is stored for each type, and then the variable mode is set to temp so it can be deleted. These variables aren't actually used by the shader anymore, so this is definitely okay.


if (!ssbo_used && !ubo_used)
   return false;

Early return if it turns out that there's not actually any UBO or SSBO use in the shader, and all the variables are gone to boot.

struct glsl_struct_field *fields = rzalloc_array(shader, struct glsl_struct_field, 2);
fields[0].name = ralloc_strdup(shader, "base");
fields[1].name = ralloc_strdup(shader, "unsized");

The new variables are all going to be the same type, one which matches what's actually used during SPIRV translation: a simple struct containing an array of uints, aka base. SSBO variables which need the length() method will get a second struct member that's a runtime array, aka unsized.

if (ubo_used) {
   const struct glsl_type *ubo_type = glsl_array_type(glsl_uint_type(), max_ubo_size * 4, 4);
   fields[0].type = ubo_type;
   u_foreach_bit(slot, ubo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ubo_slot_%u", slot);
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ubo, glsl_struct_type(fields, 1, "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;

If there's a valid bitmask of UBOs that are used by the shader, the index slots get iterated over, and a variable is created for that slot using the same type for each one. The size is determined by the size of the biggest UBO variable that previously existed, which ensures that there won't be any errors or weirdness with access past the boundary of the variable. All the GLSL compiliation and NIR passes to this point have already handled bounds detection, so this is also fine.

if (ssbo_used) {
   const struct glsl_type *ssbo_type = glsl_array_type(glsl_uint_type(), max_ssbo_size * 4, 4);
   const struct glsl_type *unsized = glsl_array_type(glsl_uint_type(), 0, 4);
   fields[0].type = ssbo_type;
   u_foreach_bit(slot, ssbo_used) {
      char buf[64];
      snprintf(buf, sizeof(buf), "ssbo_slot_%u", slot);
      if (ssbo_sizes[slot])
         fields[1].type = unsized;
         fields[1].type = NULL;
      nir_variable *var = nir_variable_create(shader, nir_var_mem_ssbo,
                                              glsl_struct_type(fields, 1 + !!ssbo_sizes[slot], "struct", false), buf);
      var->interface_type = var->type;
      var->data.driver_location = slot;

SSBOs are almost the same, but as previously mentioned, they also get a bonus member if they need the length() method. The GLSL compiler has already pre-computed the adjustment for the value that will be returned by length(), so it doesn't actually matter what the size of the variable is anymore.

And that's it! The entire encyclopedia of hacks can now be removed, and I can avoid ever having to look at any of this again.

25 Feb 2021 12:00am GMT

23 Feb 2021


Robert Foss: Upstream camera support for Qualcomm platforms

Linaro has been working together with Qualcomm to enable camera support on their platformssince 2017. The Open Source CAMSS driver was written to support the ISP IP-block with the same name that is present on Qualcomm SoCs coming from the smartphone space.

The first development board targeted by this work was the DragonBoard 410C, which was followed in 2018 by DragonBoard 820C support. Recently support for the Snapdragon 660 SoC was added to the driver, which will be part of the v5.11 Linux Kernel release. These SoCs all contain the CAMSS (Camera SubSystem) version of the ISP architecture.

Currently, support for the ISP found in the Snapdragon 845 SoC and the DragonBoard 845C is in the process of being upstreamed to the mailinglists. Having …

23 Feb 2021 1:54pm GMT

19 Feb 2021


Roman Gilg: Window Kindergarten

In the last post about KWinFT's Windowing Revolution I promised follow-up articles with detailed explorations of two elements of that revolution, which due to their complexity deserve such.

One of them was a new way how Wayland subsurfaces are managed inside KWinFT. Accompanying the 5.21 release of KWinFT this week, which was made available in sync with the KDE Plasma release, let me live up to my promise and start with an exploration of that.

But since even this topic alone is overly complex with a lot of windowing history behind it, we will split it up further and in this first article only look at subsurfaces and related concepts from a high level but without yet looking at the new and improved implementation in KWinFT.

On a high level the notions we are dealing with can always be interpreted as some form of parent-child relation between windows, on Wayland just as much as on X11. We will see that this is a very powerful mental model.

Childhood Legacy

The idea that a window can have children is old. X11 has known that for a long time. It is in fact a central concept in the protocol and what it comes down to is a single, global window tree.

This tree, stored in the X Server, forms a simple hierarchical structure of all windows starting with a generic root window at the top while every child window is contained inside its parent. This means also that the root window spans the whole visible area over all screens. Child windows on the other hand can be further partitioned, moved and resized, ordered and even switch nodes in the tree.

VLC playing a video in another child window.

While the details of that windowing logic can become complex very quickly, I think the general idea behind the window tree itself is simple to grasp. You still might want to read the window tree chapter in Xplain to get a feel for it.

Now as you can imagine such a hierarchical structure of windows is not something only X11 employs. But before we come to Wayland, there is one more relation between windows on X11, similar to the parent-child relation in the window tree, that we should explore.

Distant Relatives

A second concept of window relation in X11, is that of transient windows. It is similar to the parent-child relation in the global window tree, but while an X11 window always has a parent, with the exception of the root window at the top of the tree, so in particular is a child to this parent, transient windows are much less often encountered in the wild.

Jumping the Tree

The parent-child relation in the window tree is simple to understand. Transient windows break this model up in some way and build links between windows across different branches of the tree. That sounds complicated, and indeed it sometimes is. In fact I would argue already their name is misleading.

But before coming to that, let us first hold onto what is unambiguous. Windows that are direct children of the root window are special, they are often called toplevel windows because of that. And a child of such a window is obviously - per this definition - not a toplevel window.

Transient windows in general come as toplevel windows. Seen as part of the window tree, they are siblings of the window they are transient for.

Dialog as a transient window for the Kate window.

Typical examples of such are dialog windows, shown when your application asks you if you really want to do what you just tried to do. The client sets such a helper-window as a transient for the main window of the application.

The window manager still paints decorations around the transient window and you can move it by grabbing its window bar, but the window manager normally ensures that it can't be stacked below the window, which the window is a transient for. And if for example you switch the main application window to a different virtual desktop, the transient often follows the main window to that other desktop. This is how KWinFT does it.

What They Are Transient For

You may have noticed that I've used in the last paragraphs an awful lot the word construct "transient for".

What defines a window to become transient is not specified in the X11 protocol itself, but in the Inter-Client Communication Conventions Manual (ICCCM) in the form of a window property with this very name.

This name feels weird, doesn't it? Let's try to understand what it means: A window W2 has the WM_TRANSIENT_FOR property set to another window W1 if window W2 is a transient window for window W1.

For example when a file explorer opens a dialog to ask you for a new name for a directory that it wants to create, the dialog window is W2 and the file explorer window, that was there before, is W1.

That we call such a relation at all transient comes according to Wikipedia from W2 only existing as long as W1 exists. But child windows in the window tree are also only mapped, i.e. visible and by that existing in practice, as long as their parents are mapped. So why are they not called transient windows too?

In theory also nothing is forcing the user to close the dialog and instead to keep it around forever. In the case of a dialog that wouldn't make much sense from a practical viewpoint, but there are other clients with permanent transient windows like a Plasma applet when being "pinned" to stay open. The bottom line is that this name feels wrong.

And then its indirect nature! The "transient for" property names a window that the other window with that property is the transient for. That is difficult to understand because we usually don't name things by what another thing does to it.

Alfred Krupp, an important "work for" in the 19th century.

For example we don't name the owner of a company the "work for" if there are employees working in that company. We say instead that the owner is an employer, what comes from what the owner is doing to the employees and not the other way around.

But besides the weird naming scheme the technicalities are clear: we set the property to another window and by that the relation is established.

As you can set the property only once per window, but for many windows to the same window, the relation is one-to-many. That feels natural as it reminds us of a single node with its children in a tree structure. But the X ecosystem has of course means to generalize that concept so it becomes far more difficult to understand.

Group Transients

Complexity rises with the introduction of so-called group transients. They are defined by the Extended Window Manager Hints (EWMH) specification in the most ridiculous way, what warrants a direct quotation:

If the WM_TRANSIENT_FOR property is set to None or Root window, the window should be treated as a transient for all other windows in the same group. It has been noted that this is a slight ICCCM violation, but as this behavior is pretty standard for many toolkits and window managers, and is extremely unlikely to break anything, it seems reasonable to document it as standard.


This makes transient relations between windows much more complicated, as a window might not be transient for a single other window but for many. With that window transiency is a many-to-many relation.

A rare example of a group transient: Latte Dock's settings window.

I have yet to see a use case for group transients that could not also be solved with normal transients, but as some X11 clients expect group transients to be a thing, KWinFT needs to support them.

While this was done in the past in KWin in a separate fashion basically doubling the implementation cost, the recent windowing refactor has unified it under a single mechanism together with usual X11 transients and even all kinds of child windows on Wayland.

But before we finally shift our attention over to Wayland I have to get some words off my chest about the naming of things.

Nomenclature in Perpetuity

Naming things is difficult. Especially technical minded people often underestimate the challenge just as much as its importance. In contrast humanities has a long tradition of describing and criticizing concepts of notions and language.

The importance is emphasised in the German language, where the word for "notion" is *"Begriff", what is related to *"begreifen", meaning to understand something. *"Etwas auf den Begriff bringen"* means not only to find a name for something but to understand it.

When we name something we should remember that, not only for ourselves, but also for other people who have to understand and memorize the terminology we defined. This is important for open source projects that rely on voluntary contributions just as much as for companies as it lowers costs when onboarding new employees.

But that doesn't mean a once established nomenclature has to be perfect. On the opposite I would say that is impossible and as time moves on we should verify that the terminology still make sense, and if not revise it.

The reality though is that this is not often done systematically. Instead the terminology stays frozen in time but its meaning shifts naturally over time. That also is not necessarily an issue but one should be aware of it and if the gap between intuitive and original meaning of a notion becomes too wide one must consider to redefine or at least annotate it. And the notions of child windows and transients are good examples of that.

Transiently Incomprehensible

We learned above what it means for a window to be a transient for another window. This included a short discussion why "transient for" is a silly name for the window which is not the transient in this relation. But that does not answer the question why such an unusual naming scheme was used.

The real story behind it probably only few people alive can tell as the concept seems to be very old. But I have a suspicion it went down like this:

  1. The way better notion for a transient window, being a child of some parent window, was already in use by a different concept and we know by which one: the window tree. So a different name had to be invented.
  2. The first transient windows were indeed only very short-lived windows, so it made sense to base the new name on the temporal context.
  3. And at last but definitely worst: while the "child" in that relationship was the transient, nobody had thought of a name for the counterpart. So it was just named the "transient for" later on.

That story makes sense to me as it assumes best intentions and a natural progression to the suboptimal status quo.

Better Names

The terminology in documents like ICCCM of course won't change anymore, but here is how I bent the names internally when referencing them in KWinFT's code and how I will speak about them in the future. Also that's some advice on how to name things yourself if you ever have to do it.

First of "transient for" must go. With the story above we have an idea what might have led to the creation of this abomination of a notion. Let's go back to that. We said the notion of a child window was already in use, what would have been the way better notion. Why? One reason is obviously because it already comes with a name for its counterpart - parent.

Another reason is more fundamental: we directly have an image in our mind what this notion might mean in purely logical terms. That it is a relation between a more important primary entity and a secondary, dependent one. Such images are strong and we should use them whenever possible.

To cut it short in KWinFT's code I just went with the following: I call windows with transient relationships between them transient children instead of just transients and transient lead instead of transient for.

I opted for "lead" instead of "parent" because as a group transient a window might have several transient leads while the word "parent" would rather signal that there is only a single one.

Children of Tomorrow

A lot has been said about child windows on X11 now. But the interesting technology stack today is Wayland. So is the situation similarly tricky as on X11 with normal children, transients and group transients? From my experience luckily that is not the case, albeit interestingly the basic ideas are remarkably similar.

There is directly one big difference though, in that there does not exist the concept of a single global window tree in the Wayland protocol. In particular there is no root window. This makes sense because clients don't have any knowledge about the global state of the compositor. The compositor itself might implement some form of a global tree but that is of no relevance to the protocol.

On the other side what was in the past handled inside the X Server is now the responsibility of the window manager, in particular managing local replacements for the previously global window tree and its parent-child relations in the form of subsurfaces.


The previous all-encompassing system of parent-child relations via the global window tree has no equivalence in Wayland anymore but there is still need in some cases to have such on a local basis, that means per window or rather in Wayland terminology per surface. Subsurfaces define objects in the core protocol which do exactly that.

Their use case is similar to that of child windows in X11. Clients might use them to display certain UI elements, for example a drop-down menu. But their real power comes with views that use different buffer configurations, for example the video view of a media player and other controls around it.

Weston provides demos like this one, which tests subsurfaces.

We talked a lot about language when we looked at X11. The situation is much better on Wayland. "Subsurface" is a nice name for what the objects intend to do and specific enough to not take away generic names from other concepts like the parent-child relation of the global window tree in X11 did.

But the Wayland specs and by that also other documentation luckily still uses that metaphor of a parent with children when describing subsurfaces. So to understand them we can think in that terminology without hesitation.

For more information about subsurfaces read on either the well-written descriptions in the core protocol or the subsurfaces chapter of the Wayland book linked above. If you are new to Wayland development though you should rather start with reading up on Wayland surfaces.

Toplevel Children and Foreign Surfaces

We said toplevel windows on X11 are windows that are direct children of the root window.

For an X11 window manager these windows are the only ones of importance. These are the windows it may move, resize and in general whose state it manages, while it does in general not care about all other windows. And as noted these other windows are, per definition, children of some of these toplevel windows or the root window itself.

In this light for Wayland the xdg-shell protocol extension provides with the xdg-toplevel type very similar objects. These objects behave like the toplevels in X11, as in the Wayland window manager may move and resize them on user input or following other events.

And while subsurfaces are the spiritual successors of the classical parent-child relation of X11, setting a parent on an xdg-toplevel object reminds us of the previously discussed transient windows.

Here as before with transient windows we establish relations between windows of the same kind, as in the windows are independent toplevel windows. Like with transient windows the window manager is supposed to stack these windows relative to each other with one of them above the other.

Additionally there is the extension xdg-foreign-unstable-v2, which allows setting the relation across process boundaries. This is important for example for Flatpak apps and other sandboxed applications. It builds though upon the parent-child relation in the xdg-shell protocol. Thanks to Simon Ser for pointing this out!

Coming quickly back to the discussion about terminology, it is great to see that the request in xdg-shell to establish a parent-child relation between xdg-toplevels is simply called set_parent. And because of the object-oriented design of Wayland, this name is not used up by that, but we can still use it in other protocol extensions when it makes sense.

In the case of the second protocol extension the adjective foreign is very descriptive about what the extension is meant for. And the document, which specifies the extension, keeps using the parent-child metaphor to describe its usage. This is also great, as it allows us to keep using our mental model.

What is not good is that the request, to set the child of a parent surface, is called set_parent_of. I wouldn't be surprised if the name was inspired by the ominous "transient for" construct. Old habits die hard.

But in the next version of that protocol let us call that request just set_child, ok?

You might ask now: but what about group transients? I am happy to tell you, they are not a thing on Wayland.

One Last Thing: Pop It!

There is one more type of child window on Wayland, which should get mentioned but does not have a direct equivalence on X11: popups, usually in the form of context menus.

On Wayland these can be realized with the xdg-shell protocol extension providing the xdg-popup type.

Now you might say: "Wait a minute, context menus are also a thing on X11!" That's true, but on X11 they are not really child windows of anything more than the root window. They are placed by the client directly in global coordinates and superimposed by the X Server as override redirect windows. The window manager just ignores them.

On Wayland that is different, because for one the window manager is the server and secondly clients can not place surfaces in global coordinates.

Wayland popups are placed and can be moved by the compositor, here together with the infamous wobbly windows effect.

A popup must therefore be placed relative to some other surface the client knows about. For example a right-click context menu will be opened directly at the position of the cursor. A hamburger menu opens relative to the visual boundaries of the hamburger button.

The xdg-shell protocol provides means to allow such initial placement and later correction. Admittedly that can become complex pretty quickly.

The important fact to remember though, is that on Wayland the xdg-popup type also establishes a parent-child relation with another window as its parent, while popups on X11 did not.

Recap and Next

In this introduction we took a tour through our "Window Kindergarten". We learned about the different kinds of window children by looking at their basic definitions in the X11 protocol and other X specifications and in the Wayland protocols with its extensions. We also mentioned for each of them what their use case is.

The mental model of window children and parents is powerful and transcends the mere technicalities of each individual protocol. Due to historic circumstances or just inexperience in developing intuitive terminologies, these fundamental ideas are sometimes more difficult to understand than necessary. Luckily this has improved with Wayland.

There seem to be two complementary basic types of children:

Additionally we have on X11 need for a more complex implementation because of group transients while on Wayland subsurfaces and foreign surfaces are more straightforward. On the other side on Wayland the window manager needs to take over some work for subsurfaces, what was in the past handled by the X Server.

So we learned quite a lot about child windows. In the upcoming article, that is targeted to be published next week, we use our knowledge to take an in-depth look at the recent innovations in KWinFT, when handling X11 transients, Wayland subsurfaces and foreign surfaces in a unified way.

19 Feb 2021 3:00pm GMT

Mike Blumenkrantz: Notes

Quickly: ES 3.2

I've been getting a lot of pings over the past week or two about ES 3.2 support.

Here's the deal.

It's not happening soon. Probably.

Zink currently supports every 3.2 extension except for photoshop. There's two ways to achieve support for that extension at present:

* Yes, I know that Nvidia supports advanced blend, but zink+nvidia is not currently capable of doing ES of any version, so that's not testable.

So in short, it's going to be a while.

But there's not really a technical reason to rush towards full ES 3.2 anyway other than to fill out a box on mesamatrix. If you have an app that requires 3.2, chances are that it probably doesn't really require it; few apps actually use the advanced blend extension, and so it should be possible for the app to require only 3.1 and then verify the presence of whatever 3.2-based extensions it may use in order to be more compatible.

Of course, this is unlikely to happen. It's far easier for app developers to just say "give me 3.2" if maybe they just want geometry shaders, and I wouldn't expect anyone is going to be special-casing things just to run on zink.

Nonetheless, it's really not a priority for me given the current state of the Vulkan ecosystem. As time moves on and various extensions/features become more common that may change, but for now I'm focusing on things that are going to be the most useful.

19 Feb 2021 12:00am GMT

Ben Widawsky: bwidawsk.net 2.0


After a lot of effort over short stints in the last several months, I have completed my blog migration to Lektor in the hopes that when I migrate again in the future, it won't be as painful.

Despite my efforts, many old posts might not be perfect. This is a job for the wayback machine

In case you're curious I did this primarily for one reason (and lots of smaller ones). I wanted my data back. Wordpress is an open source blogging platform with huge adoption. It has a very large plugin ecosystem and is very actively updated and maintained. While security issues have come up here and there, at some point automatic updates became an option and that helped a bit. In 2010 it was the obvious choice.

If you've gained anything from my blog posts, you should thank Wordpress. Wordpress' ease of setup and relative ease of use is a big reason I was able to author things as well as I did.

So what happened - plugins

I wanted my data back. It was a self hosted instance and I had all my information stored in a SQL database. Obviously I never lost my data, but...


I used plugins for my tables (multiple plugins). I used plugins for code highlighting. Plugins for LaTeX. Plugins for table of contents, social media integration, post tagging, image captioning and formatting, spelling. You get the idea. The result of all this was I ended up with a blog post that was entirely useless in its text only form. Plugins storing the data in non-standard places so it can be processed and look fancy.

The WYSIWYG editor interface was a huge plus for me. I spent all day in front of a terminal breaking graphics and display (meaning I really was in front of an 80x24 terminal at times). I didn't want to have to deal with fanciful layout engines or styles. Those plugins ended up destroying the WYSIWYG editor experience and I ended up doing everything in quasi markdown anyway.

Plugins themselves introduced security issues when they weren't intentionally malicious anyway.

What was next?

These static site generators seemed really appealing as a solution to this problem. Everything in markdown. Assets stored together in the filesystem. Jekyll is obviously hugely popular. Hugo, Pelican, Gatsby, and Sphinx are all generators I considered. The number of static site generators is staggering. I wish I could remember what made me choose Lektor, but I can't - python based was my only requirement.

Python because I wanted a platform that did most of what I wanted but was extendible by me if absolutely necessary.

Migrating was definitely a lot of work. I was tempted several times to abort the effort and just rely on wayback machine. Ultimately I decided that migrating the post would be a good way to learn how well the platform would meet my needs (that being an annual blog post or so)

There are definitely some features I miss that I may or may not get to.

  1. Comments. There is disqus integration. I'm not convinced this is what I want.
  2. Post grouping. There is categories. It was too complicated for me to figure out in a short time, so I'm punting on it for now.
  3. I'd really like to not have to learn CSS and jinja2. I can scrape by a bit, but changing anything drastic takes a lot of effort for me.


I followed this. I did have to make some minor changes specific to my needs and posts did still require some touchups, in large part due to plugins and my obsessive use of SVG.

See you soon

Now that I'm back, I hope to post more often. Next up will be a recap of some of the pathfinding projects I worked on after FreeBSD enabling

19 Feb 2021 12:00am GMT

18 Feb 2021


Peter Hutterer: A pre-supplied "custom" keyboard layout for X11

Last year I wrote about how to create a user-specific XKB layout, followed by a post explaining that this won't work in X. But there's a pandemic going on, which is presumably the only reason people haven't all switched to Wayland yet. So it was time to figure out a workaround for those still running X.

This Merge Request (scheduled for xkeyboard-config 2.33) adds a "custom" layout to the evdev.xml and base.xml files. These XML files are parsed by the various GUI tools to display the selection of available layouts. An entry in there will thus show up in the GUI tool.

Our rulesets, i.e. the files that convert a layout/variant configuration into the components to actually load already have wildcard matching [1]. So the custom layout will resolve to the symbols/custom file in your XKB data dir - usually /usr/share/X11/xkb/symbols/custom.

This file is not provided by xkeyboard-config. It can be created by the user though and whatever configuration is in there will be the "custom" keyboard layout. Because xkeyboard-config does not supply this file, it will not get overwritten on update.

From XKB's POV it is just another layout and it thus uses the same syntax. For example, to override the +/* key on the German keyboard layout with a key that produces a/b/c/d on the various Shift/Alt combinations, use this:

xkb_symbols "basic" {
include "de(basic)"
key <AD12> { [ a, b, c, d ] };

This example includes the "basic" section from the symbols/de file (i.e. the default German layout), then overrides the 12th alphanumeric key from left in the 4th row from bottom (D) with the given symbols. I'll leave it up to the reader to come up with a less useful example.

There are a few drawbacks:

So overall, it's a hack[2]. But it's a hack that fixes real user issues and given we're talking about X, I doubt anyone notices another hack anyway.

[1] If you don't care about GUIs, setxkbmap -layout custom -variant foobar has been possible for years.
[2] Sticking with the UNIX principle, it's a hack that fixes the issue at hand, is badly integrated, and weird to configure.

18 Feb 2021 1:57am GMT

Mike Blumenkrantz: Roadmapping

What's Next

It's been a busy week. The CTS fixes and patch drops are coming faster and faster, and progress is swift. Here's a quick note on some things that are on the horizon.

Features Landing Soon

Zink's in a tough spot right now in master. GL 4.6 is available, but there's still plenty of things that won't work, e.g., running anything at 60fps. These are things I expect (hope) to see land in the repo in the next month or so:

All told, just as an example, Unigine Heaven (which can now run in color!) should see roughly a 100% performance improvement (possibly more) once this is in, and I'd expect substantial performance gains across the board.

Will you be able to suddenly play all your favorite GL-based Steam games?


I can't even play all your favorite GL-based Steam games yet, so it's a long ways off for everyone else.

But you'll probably be able to get surprisingly good speed on what things you can run considering the amount of time that will pass between hitting 4.6 and these patchsets merging.

Features I'm Working On

I spent some time working on Wolfenstein over the past week, but there's some non-zink issues in the way, so that's on the backburner for a while. Instead, I've turned my attention to CTS and begun unloading a dumptruck of resulting fixes into the codebase.

There comes a time when performance is "good enough" for a while, and, after some intense optimizing since the start of the year, that time has come. So now it's back to stabilization mode, and I'm now aiming to have a vaguely decent pass rate in the near term.

Hopefully I'll find some time to post some of the crazy bugs I've been hunting, but maybe not. Time will tell.

18 Feb 2021 12:00am GMT

11 Feb 2021


Mike Blumenkrantz: Two With One Blow

By Now

…or in the very near future, the ol' bumperino will have landed, putting zink at GL 4.5.

But that's boring, so let's check out something very slightly more interesting.

Steam Games

What are they and how do they work?

I'm not going to answer these questions, but I am going to be looking into getting them working on zink.

To that end, as I hinted at yesterday, I began with Wolfenstein: The New Order, as chosen by Daniel Schuermann, the lucky winner of the What Steam Game Should Zink Use As Its Primary Test Case And Benchmark contest that was recently held.

Early tests of this game were unimpressive. That is to say I got an immediate crash. It turns out that having the GL compatibility context restricted to 3.0 is bad for getting AAA games running, so zink-wip now enables 4.6 compat contexts.

But then I was still getting a crash without any clear error message. Suddenly, I was back in 2004 trying to figure out how to debug wine apps.

Things are much simpler now, however. PROTON_DUMP_DEBUG_COMMANDS enables dumping scripts for debugging from steam, including one which attaches a debugger to the game. This solved the problem of getting a debugger in before the almost-immediate crash, but it didn't get me closer to a resolution.

The problem now is that I'd attached a debugger to the in-wine process, which is just a sandbox for the Windows API. What I actually wanted was to attach to the wine process itself so I could see what was going on in the driver.

gdb --pid=$(pidof WolfNewOrder_x64.exe) ended up what I needed, but this was complicated by the fact that I had to attach before the game crashed and without triggering the steam error reporter. So in the end, I had to attach using the proton script, then while it was paused, attach to the outer process for driver debugging. But then also I had to attach to the outer process after zink was loaded, so it was a real struggle.

Then, as per usual, another problem: I had no symbols loaded because proton runs a static binary. After cluelessly asking around in the DXVK discord, @Herbert helpfully provided a gdb python script for proton in-process debugging that I was able to repurpose for my needs. The gist (haha) of the script is that it scans /proc/$pid/maps and then manually loads the required library files.

At last, I had attached to the game, I had symbols, and I could see that I was hitting a zink assert I'd added to catch int overflows. A quick one-liner to change the order of a calculation fixed that, and now I'm on to an entirely new class of bugs.

11 Feb 2021 12:00am GMT

10 Feb 2021


Martin Peres: Setting up a CI system part 2: Generating and deploying your test environment

This article is part of a series on how to setup a bare-metal CI system for Linux driver development. Check out part 1 where we expose the context/high-level principles of the whole CI system, and make the machine fully controllable remotely (power on, OS to boot, keyboard/screen emulation using a serial console).

In this article, we will start demystifying the boot process, and discuss about different ways to generate and boot an OS image along with a kernel for your machine. Finally, we will introduce boot2container, a project that makes running containers on bare metal a breeze!

This work is sponsored by the Valve Corporation.

Generating a kernel & rootfs for your Linux-based testing

To boot your test environment, you will need to generate the following items:

The initramfs is optional because the drivers and their firmwares can be built in the kernel directly.

Let's not generate these items just yet, but instead let's look at the different ways one could generate them, depending on their experience.

The embedded way

Buildroot's logo

If you are used to dealing with embedded devices, you are already familiar with projects such as Yocto or Buildroot. They are well-suited to generate a tiny rootfs which can be be useful for netbooted systems such as the one we set up in part 1 of this series. They usually allow you to describe everything you want on your rootfs, then will configure, compile, and install all the wanted programs in the rootfs.

If you are wondering which one to use, I suggest you check out the presentation from Alexandre Belloni / Thomas Pettazoni which will give you an overview of both projects, and help you decide on what you need.



The Linux distribution way

Debian Logo, www.debian.org

If you are used to installing Linux distributions, your first instinct might be to install your distribution of choice in a chroot or a Virtual Machine, install the packages you want, and package the folder/virtual disk into a tarball.

Some tools such as debos, or virt-builder make this process relatively painless, although they will not be compiling an initramfs, nor a kernel for you.

Fortunately, building the kernel is relatively simple, and there are plenty of tutorials on the topic (see ArchLinux's wiki). Just make sure to compile modules and firmware in the kernel, to avoid the complication of using an initramfs. Don't forget to also compress your kernel if you decide to netboot it!



The refined distribution way: containers

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc.

Containers are an evolution of the old chroot trick, but instead made secure thanks the addition of multiple namespaces to Linux. Containers and their runtimes have been addressing pretty much all the cons of the "Linux distribution way", and became a standard way to share applications.

On top of generating a rootfs, containers also allow setting environment variables, control the command line of the program, and have a standardized transport mechanism which simplifies sharing images.

Finally, container images are constituted of cacheable layers, which can be used to share base images between containers, and also speed up the generation of the container image by only re-computing the layer that changed and all the layers applied on top of it.

The biggest draw-back of containers is that they usually are meant to be run on pre-configured hosts. This means that if you want to run the container directly, you will need to make sure to include an initscript or install systemd in your container, and set it as the entrypoint of the container. It is however possible to perform these tasks before running the container, as we'll explain in the following sections.



Deploying and booting a rootfs

Now we know how we could generate a rootfs, so the next step is to be able to deploy and boot it!

Challenge #1: Deploying the Kernel / Initramfs

There are multiple ways to deploy an operating system:

The former solution is great at preventing the bricking of a device) that depends on an Operating System to be flashed again, as it enables checking the deployment on the device itself before rebooting.

The latter solution enables diskless test machines, which is an effective way to reduce state (the enemy #1 of reproducible results). It also enables a faster deployment/boot time as the CI system would not have to boot the machine, flash it, then reboot. Instead, the machine simply starts up, requests an IP address through BOOTP/DHCP, downloads the kernel/initramfs, and executes the kernel. This was the solution we opted for in part 1 of this blog series.

Whatever solution you end up picking, you will now be presented with your next challenge: making sure the rootfs remains the same across reboots.

Challenge #2: Deploying the rootfs efficiently

If you have chosen the Flash and reboot deployment method, you may be prepared to re-flash the entire Operating System image every time you boot. This would make sure that the state of a previous boot won't leak into following boots.

This method can however become a big burden on your network when scaled to tens of machines, so you may be tempted to use a Network File System such as NFS to spread the load over a longer period of time. Unfortunately, using NFS brings its own set of challenges (how deep is this rabbit hole?):

So, instead of trying to spread the load, we could try to reduce the size of the rootfs by only sending the content that changed. For example, the rootfs could be split into the following layers:

Layers can be downloaded by the test machine, through a short-lived-state network protocol such as HTTP, as individual SquashFS images. Additionally, SquashFS provides compression which further reduces the storage/network bandwidth needs.

The layers can then be directly combined by first mounting the layers to separate folders in read-only mode (only mode supported by SquashFS), then merging them using OverlayFS. OverlayFS will store all the writes done to this file system into the workdir directory. If this work directory is backed up by a ramdisk (tmpfs) or a never-reused temporary directory, then this would guarantee that no information from previous boots would impact the new boots!

If you are familiar with containers, you may have recognized this approach as what is used by containers: layers + overlay2 storage driver. The only difference is that container runtimes depend on tarballs rather than SquashFS images, probably because this is a Linux-only filesystem.

If you are anything like me, you should now be pretty tempted to simply use containers for the rootfs generation, transport, and boot! That would be a wise move, given that thousands of engineers have been working on them over the last decade or so, and whatever solution you may come up with will inevitably have even more quirks than these industry standards.

I would thus recommend using containers to generate your rootfs, as there are plenty of tools that will generate them for you, with varying degree of complexity. Check out buildah, if Docker, or Podman are not too high/level for your needs!

Let's now brace for the next challenge, deploying a container runtime!

Challenge #3: Deploying a container runtime to run the test image

In the previous challenge, we realized that a great way to deploy a rootfs efficiently was to simply use a container runtime to do everything for us, rather than re-inventing the wheel.

This would enable us to create an initramfs which would be downloaded along with the kernel through the usual netboot process, and would be responsible for initializing the machine, connecting to the network, mounting the layer cache partition, setting the time, downloading a container, then executing it. The last two steps would be performed by the container runtime of our choice.

Generating an initramfs is way easier than one can expect. Projects like dracut are meant to simplify their creation, but my favourite has been u-root, coming from the LinuxBoot project. I generated my first initramfs in less than 5 minutes, so I was incredibly hopeful to achieve the outlined goals in no time!

Unfortunately, the first setback came quickly: container runtimes (Docker, or Podman) are huge (~150 to 300 MB), if we are to believe Alpine Linux's size of their respective packages and dependencies! While this may not be a problem for the Flash and reboot method, it is definitely a significant issue for the Netboot method which would need to download it for every boot.

Challenge #3.5: Minifying the container runtime

After spending a significant amount of time studying container runtimes, I identified the following functions:

Thus started my quest to find lightweight solutions that could do all of these steps... and wonder just how deep is this rabbit hole??

The usual executor found in the likes of Podman and Docker is runc. It is written in Golang, which compiles everything statically and leads to giant binaries. In this case, runc clocks at ~12MB. Fortunately, a knight in shining armour came to the rescue, re-implemented runc in C, and named it crun. The final binary size is ~400 KB, and it is fully compatible with runc. That's good-enough for me!

To download and unpack the rootfs from the container image, I found genuinetools/img which supports that out of the box! Its size was however much bigger than expected, at ~28.5MB. Fortunately, compiling it ourselves, stripping the symbols, then compressing it using UPX led to a much more manageable ~9MB!

What was left was to generate the container manifest according to the runtime spec. I started by hardcoding it to verify that I could indeed run the container. I was relieved to see it would work on my development machine, even thought it fails on my initramfs. After spending a couple of hours diffing straces, poking a couple of files sysfs/config files, and realizing that pivot_root does not work in an initramfs , I finally managed to run the container with crun run --no-pivot!

I was over the moon, as the only thing left was to generate the container manifest by patching genuinetools/img to generate it according to the container image manifest (like docker or podman does). This is where I started losing grip: lured by the prospect of a simple initramfs solving all my problems, being so close to the goal, I started free-falling down what felt like the deepest rabbit hole of my engineering career... Fortunately, after a couple of weeks, I emerged, covered in mud but victorious! Queue the gory battle log :)

When trying to access the container image's manifest in img, I realized that it was re-creating the layers and manifest, and thus was losing the information such as entrypoint, environment variables, and other important parameters. After scouring through its source code and its 500 kLOC of dependencies, I came to the conclusion that it would be easier to start a project from scratch that would use Red Hat's image and storage libraries to download and store the container on the cache partition. I then needed to unpack the layers, generate the container manifest, and start runc. After a couple of days, ~250 lines of code, and tons of staring at straces to get it working, it finally did! Out was img, and the new runtime's size was under 10 MB \o/!

The last missing piece in the puzzle was performance-related: use OverlayFS to merge the layers, rather than unpacking them ourselves.

This is when I decided to have another look at Podman, saw that they have their own internal library for all the major functions, and decided to compile podman to try it out. The binary size was ~50 MB, but after removing some features, setting the -w -s LDFLAGS, and compressing it using upx --best, I got the final size to be ~14 MB! Of course, Podman is more than just one binary, so trying to run a container with it failed. However, after a bit of experimentation and stracing, I realized that running the container with --privileged --network=host would work using crun... provided we force-added the --no-pivot parameter to crun. My happiness was however short-lived, replaced by a MAJOR FACEPALM MOMENT:

After a couple of minutes of constant facepalming, I realized I was also relieved, as Podman is a battle-tested container runtime, and I would not need to maintain a single line of Go! Also, I now knew how deep the rabbit was, and we just needed to package everything nicely in an initramfs and we would be good. Success, at last!

Boot2container: Run your containers from an initramfs!

If you have managed to read through the article up to this point, congratulations! For the others who just gave up and jumped straight to this section, I forgive you for teleporting yourself to the bottom of the rabbit hole directly! In both cases, you are likely wondering where is this breeze you were promised in the introduction?

     Boot2container enters the chat.

Boot2container is a lightweight (sub-20 MB) and fast initramfs I developed that will allow you to ignore the subtleties of operating a container runtime and focus on what matters, your test environment!

Here is an example of how to run boot2container, using SYSLINUX:

LABEL root
    MENU LABEL Run docker's hello world container, with caching disabled
    LINUX /vmlinuz-linux
    APPEND b2c.container=docker://hello-world b2c.cache_device=none b2c.ntp_peer=auto
    INITRD /initramfs.linux_amd64.cpio.xz

The hello-world container image will be run in privileged mode, without the host network, which is what you want when running the container for bare metal testing!

Make sure to check out the list of features and options before either generating the initramfs yourself or downloading it from the releases page. Try it out with your kernel, or the example one bundled in in the release!

With this project mostly done, we pretty much conclude the work needed to set up the test machines, and the next articles in this series will be focusing on the infrastructure needed to support a fleet of test machines, and expose it to Gitlab/Github/...

That's all for now, thanks for reading that far!

10 Feb 2021 8:11am GMT

Mike Blumenkrantz: New Order



10 Feb 2021 12:00am GMT

09 Feb 2021


Andrés Gómez Garc: Replaying 3D traces with piglit

If you don't know what is traces based rendering regression testing, read the appendix before continuing.

The Mesa community has witnessed an explosion of the Continuous Integration interest in the last two years.

In addition to checking the proper building of the project, integrating the testing of its functional correctness has become a priority. The user space graphics drivers exhibit a wide variety of types of tests and test suites. One kind of those tests are the traces based rendering regression testing.

The public effort to add this kind of tests into Mesa's CI started with this mail from Alexandros Frantzis.

At some point, we had support for replaying OpenGL, Vulkan and D3D11 traces using apitrace, RenderDoc and GFXReconstruct with the in-tree tool tracie. However, it was a very custom solution made to the needs of Mesa so I proposed to move this codebase and integrate it into the piglit test suite. It was a natural step forward.

This is how replayer was born into piglit.


The first step to test a trace is, actually, obtaining a trace. I won't go into the details about how to create one from scratch. The process is well documented on each of the tools listed above. However, the Mesa community has been collecting publicly distributable traces for a while and placing them in traces-db whose CI is copying them to Freedesktop.org's MinIO instance.

To make things simple, once we have built and installed piglit, if we would like to test an apitrace created OpenGL trace, we can download from there with:

$ replayer.py download \
         --download-url https://minio-packet.freedesktop.org/mesa-tracie-public/ \
         --db-path ./traces-db \
         --force-download \

The parameters are self explanatory. The downloaded trace will now exist at ./traces-db/glxgears/glxgears-2.trace.

The next step will be to dump an image from the trace. Since it is a .trace file we will need to have apitrace installed in the system. If we do not specify the call(s) from which to dump the image(s), we will just get the last frame of the trace:

$ replayer.py dump ./traces-db/glxgears/glxgears-2.trace

The dumped PNG image will be at ./results/glxgears-2.trace-0000001413.png. Notice, the number suffix is the snapshot id from the trace.

Dumping from a trace may result in a range of different possible images. One example is when the trace makes use of uninitialized values, leading to undefined behaviors.

However, since the original aim was performing pre-merge rendering regression testing in Mesa's CI, the idea is that replaying any of the provided traces would be quick and the dumped image will be consistent. In other words, if we would dump several times the same frame of a trace with the same GFX stack, the image will always be the same.

With this precondition, we can test whether 2 different images are the same just by doing a hash of its content. replayer can obtain the hash for the generated dumped image:

$ replayer.py checksum ./results/glxgears-2.trace-0000001413.png 

Now, if we would build a different commit of Mesa, we could check the generated image at this new point against the previously generated reference image. If everything goes well, we will see something like:

$ replayer.py compare trace \
         --download-url https://minio-packet.freedesktop.org/mesa-tracie-public/ \
         --device-name gl-vmware-llvmpipe \
         --db-path ./traces-db \
         --keep-image \
         glxgears/glxgears-2.trace f8eba0fec6e3e0af9cb09844bc73bdc8
[dump_trace_images] Info: Dumping trace ./traces-db/glxgears/glxgears-2.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./traces-db/glxgears/glxgears-2.trace
// process.name = "/usr/bin/glxgears"
1384 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

1413 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

error: drawable failed to resize: expected 1515x843, got 300x300
[dump_trace_images] Running: eglretrace --headless --snapshot=1413 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace- ./blog-traces-db/glxgears/glxgears-2.trace
Wrote ./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413.png

    actual: f8eba0fec6e3e0af9cb09844bc73bdc8
  expected: f8eba0fec6e3e0af9cb09844bc73bdc8
[check_image] Images match for:

PIGLIT: {"images": [{"image_desc": "glxgears/glxgears-2.trace", "image_ref": "f8eba0fec6e3e0af9cb09844bc73bdc8.png", "image_render": "./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413-f8eba0fec6e3e0af9cb09844bc73bdc8.png"}], "result": "pass"}

replayer's compare subcommand is the one spitting a piglit formatted test expectations output.

Putting everything together

We can make the whole process way simpler by passing the replayer a YAML tests list file. For example:

$ cat testing-traces.yml
  download-url: https://minio-packet.freedesktop.org/mesa-tracie-public/

  - path: gputest/triangle.trace
      - device: gl-vmware-llvmpipe
        checksum: c8848dec77ee0c55292417f54c0a1a49
  - path: glxgears/glxgears-2.trace
      - device: gl-vmware-llvmpipe
        checksum: f53ac20e17da91c0359c31f2fa3f401e
$ replayer.py compare yaml \
         --device-name gl-vmware-llvmpipe \
         --yaml-file testing-traces.yml 
[check_image] Downloading file gputest/triangle.trace took 5s.
[dump_trace_images] Info: Dumping trace ./replayer-db/gputest/triangle.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./replayer-db/gputest/triangle.trace
// process.name = "/home/anholt/GpuTest_Linux_x64_0.7.0/GpuTest"
397 glXSwapBuffers(dpy = 0x7f0ad0005a90, drawable = 56623106)

510 glXSwapBuffers(dpy = 0x7f0ad0005a90, drawable = 56623106)

[dump_trace_images] Running: eglretrace --headless --snapshot=510 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/gputest/triangle.trace- ./replayer-db/gputest/triangle.trace
Wrote ./results/trace/gl-vmware-llvmpipe/gputest/triangle.trace-0000000510.png

    actual: c8848dec77ee0c55292417f54c0a1a49
  expected: c8848dec77ee0c55292417f54c0a1a49
[check_image] Images match for:

[check_image] Downloading file glxgears/glxgears-2.trace took 5s.
[dump_trace_images] Info: Dumping trace ./replayer-db/glxgears/glxgears-2.trace...
[dump_trace_images] Running: apitrace dump --calls=frame ./replayer-db/glxgears/glxgears-2.trace
// process.name = "/usr/bin/glxgears"
1384 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

1413 glXSwapBuffers(dpy = 0x56060e921f80, drawable = 31457282)

error: drawable failed to resize: expected 1515x843, got 300x300
[dump_trace_images] Running: eglretrace --headless --snapshot=1413 --snapshot-prefix=./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace- ./replayer-db/glxgears/glxgears-2.trace
Wrote ./results/trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace-0000001413.png

    actual: f8eba0fec6e3e0af9cb09844bc73bdc8
  expected: f8eba0fec6e3e0af9cb09844bc73bdc8
[check_image] Images match for:

replayer features also the query subcommand, which is just a helper to read the YAML files with the tests configuration.

Testing the other kind of supported 3D traces doesn't change much from what's shown here. Just make sure to have the needed tools installed: RenderDoc, GFXReconstruct, the VK_LAYER_LUNARG_screenshot layer, Wine and DXVK. A good reference for building, installing and configuring these tools are Mesa's GL and VK test containers building scripts.

replayer also accepts several configurations to tweak how to behave and where to find the actual tracing tools needed for replaying the different types of traces. Make sure to check the replay section in piglit's configuration example file.

replayer's README.md file is also a good read for further information.


replayer is a test runner in a similar fashion to shader_runner or glslparsertest. We are now missing how does it integrate so we can do piglit runs which will produce piglit formatted results.

This is done through the replay test profile.

This profile needs a couple configuration values. Easiest is just to set the PIGLIT_REPLAY_DESCRIPTION_FILE and PIGLIT_REPLAY_DEVICE_NAME env variables. They are self explanatory, but make sure to check the documentation for this and other configuration options for this profile.

The following example features a similar run to the one done above invoking directly replayer but with piglit integration, providing formatted results:

$ PIGLIT_REPLAY_DESCRIPTION_FILE=testing-traces.yml PIGLIT_REPLAY_DEVICE_NAME=gl-vmware-llvmpipe piglit run replay -n replay-example replay-results
[2/2] pass: 2   
Thank you for running Piglit!
Results have been written to replay-results

We can create some summary based on the results:

# piglit summary console replay-results/
trace/gl-vmware-llvmpipe/glxgears/glxgears-2.trace: pass
trace/gl-vmware-llvmpipe/gputest/triangle.trace: pass
       name: replay-example
       ----  --------------
       pass:              2
       fail:              0
      crash:              0
       skip:              0
    timeout:              0
       warn:              0
 incomplete:              0
 dmesg-warn:              0
 dmesg-fail:              0
    changes:              0
      fixes:              0
regressions:              0
      total:              2
       time:       00:00:00

Creating an HTML summary may be also interesting, specially when finding failures!


Thanks a lot to the whole Mesa community for helping with the creation of this tool. Alexandros Frantzis, Rohan Garg and Tomeu Vizoso did a lot of the initial development for the in-tree tracie tool while Dylan Baker was very patient while reviewing my patches for the piglit integration.

Finally, thanks to Igalia for allowing me to work in this.


In 3D computer graphics we say "traces", for short, to name the files generated by 3D APIs capturing tools which store not only the calls to the specific 3D API but also the internal state of the 3D program during the capturing process: shaders, textures, buffers, etc.

Being able to "record" the execution of a 3D program is very useful. Usually, it will allow us to replay the execution without the need of the original program from which we generated the trace, it will also allow in-depth analysis for debugging and performance optimization, it's a very good solution for sharing with other developers, and, in some cases, will allow us to check how the replay will happen with different GPUs.

In this post, however, I focus in a specific usage: rendering regression testing.

When doing a regression test what we would do is compare a specific metric obtained by replaying the trace with a specific version of the GFX software stack against the same metric obtained from a different version of the GFX stack. If the value of the metric changes we have found a regression (or an improvement!).

To make things simpler, we would like to check changes happening just in one of the many elements of the software stack. The most relevant component is the user space driver. In particular, I care about the Mesa drivers and the GNU/Linux stack.

Mainly, there are two kinds of regression testing we can do with a trace: performance or rendering regression testing. When doing a performance one, the checked metric(s) usually are in terms of speed or memory usage. In the case of the rendering ones what we would do is comparing the rendered output at one (or many) point during the trace replay. This output, a bitmap image, is the metric that we will compare in between two different points of the Mesa driver. If the images differ, we may have found a regression; artifacts, improper colors, etc, or an enhancement, if the reference image is the one featuring any of these problems.

09 Feb 2021 8:47am GMT

Mike Blumenkrantz: Milestone

If you're on Intel…

Your zink built from git master now has GL 4.3.

Turns out having actual hardware available when doing feature support is important, so I need to do some fixups there for stencil texturing before you can enjoy things.

09 Feb 2021 12:00am GMT

08 Feb 2021


Roman Gilg: The Windowing Revolution

The beta for the upcoming 5.21 release of the KWinFT projects is now available. It contains a monumental rewrite of KWinFT's windowing logic. Read on for an overview of the changes and why this rewrite was necessary.

A Confused Heart

Let's define first what windowing logic is. In my definition this means all structures and algorithms in code to decide where a window should be stacked, placed, moved or in which other ways its geometry can be manipulated to allow the user to interact with and organize the totality of all windows.

And if you agree to such windowing logic being of central importance for a windowing manager and what distinguishes it in the end from others, we may call it the heart of KWinFT.

The KWinFT compositor is based on KWin, KDE's official compositor for the Plasma Workspace. KWin was founded over two decades ago. Necessarily some of its code is very old, does not adhere to any modern development principles and sometimes, due to changes in other levels of the graphics stack, it is just plain wrong.

It is kind of unexpected though, that this has been in particular the case for the windowing logic, the heart of KWinFT. For example at the HEAD of KWin's current master branch do a git-blame over the ludicrous code in layers.cpp responsible for all window stacking and count how many lines are older than a decade.

But old code is not necessarily bad. The reason why this old code is bad, is two-fold: for one under the leadership of the former maintainer the Wayland support was shoehorned into an already complex code base and secondly he followed a strategy to keep the old code untouched as much as possible. Instead of doing necessary incremental refactors to the old code, he tried to firewall it with an abundance of tests.

For sure one can find reasons and excuses to pick such a strategy, but ultimately one has to say it failed. This can not be judged of course from the outside, but I feel comfortable in making this assessment as someone who knows the code in detail and because I am not the only one who abandoned his strategy.

Who Does the Work Is Not Always Right

In fact I am not the first one to refactor the old windowing logic. The current de facto maintainer of KWin, Vlad Zahorodnii, has done so in the past.

The result of his work were often massive merge requests and back then, when I was still contributing directly to KWin, I had a feeling this was going into the wrong direction. But I was also working on other upstream projects and was in no position to tell someone, who worked exclusively on KWin, that his work should not go in as is.

This is actually enforced through an unwritten rule in KDE, which prescribes that "the one who does the work, decides". This sounds good when heard first, but the one who does the work is not always right and in the case of KWin, Vlad's refactors made the old code even more complicated, more fragile and less coherent.

Simple is Difficult

The problem with Vlad's work on KWin is that he likes to create solutions through the addition of new things. He still does.

I call that the "easy way" to solve a problem in an existing code base: You add new code, which you write against the problem you want to solve. You ensure the new code does not break any of the old unit tests. For compliance add another unit test for your new code.

The big downside of this approach is that the complexity of the code increases every time you do it. And KWin's windowing code has become absurdly complex over the years. As an example take a look at the different types of geometries, which describe the position and size of a window.

In contrast I chose the hard way: I made the code simpler.

This would of course be also kind of easy if I just removed features, but I was able to keep in all features of KWinFT's windowing logic while simplifying major internal concepts and algorithms.

There is one exception though: the shading of windows got removed. Sorry to the few people who used it, but it is one of these features not meant for a Wayland world and whoever had implemented it at some point in the ancient history of KWin, had done that by littering special cases and boolean traps all over the code base in order to get it done.

Battle Plans and Front Lines

After this prelude let me give you an overview of what this revolution actually contains.

Flattening the Hierarchy

To get the revolution started I drafted in the beginning, like I always do with bigger projects like this, a general plan that I published in an issue ticket.

You can see that my primary focus was to simplify the sprawling hierarchy of different window types, which have grown in numbers over the years mostly because of the Wayland changes.

The old windows hierarchy.

My first idea was to flatten the hierarchy through the use of C++ templates and replacing inheritance with composition. And while not yet fully finished, the current state absolutely reaffirms my decision to follow through with this idea.

The new windows hierarchy.

The classes AbstractClient and XwaylandClient, which represented different kinds of windows, have been removed completely. This simplifies the hierarchy to only two levels.

In the future I want to also get rid of the Toplevel class. My plan for that is to template the Workspace class over its supported window types. This would mean no more dynamic inheritance at all.

Other dependent properties that were previously stuffed into AbstractClient I carefully dissected out of it. For example everything related to Scripting is now contained in a single independent interface.

Clean Code is Comprehensible Code

While moving forward with my initial goals I realized that huge parts of the code were so outdated, so ugly, so rotten, that I could not just refactor the logic, but also had to improve the code styling. Often the internal logic was incomprehensible because of the style.

So this project became also about replacing archaic macros with modern lambdas, reducing code duplication, adding white space where it made sense and so on.

Overall I improved the readability and reduced clutter. I ensured there is a single coherent style in all refactored files. One of the largest single commits in that endeavor was the overhaul of the X11Client class.

When deciding on how to clean up code, I follow modern C++ principals in general. I orientate myself at the Standard Library and the C++ Core Guidelines instead of the outdated Qt library style. This falls in line with my long-term plan to factor out libraries that will be pure C++ and not depend on Qt anymore.

The Big Ones: Subsurfaces and Window Geometries

While my focus in the beginning of the windowing refactoring was to simplify the hierarchy of windows, that was not the initial motivation for this project. My motivation was to fix a certain issue with Wayland subsurfaces: they were not correctly transformed by effects.

There had landed a patch for that in KWin in the middle of last year, but I had a feeling it was once again a half-baked attempt at a solution, leading to more complexity instead of less and not solving the problem in a holistic way. My further analysis of the patch confirmed my initial thoughts and I decided to look at the problem from a completely different angle.

The solution I came up with I would in fact call revolutionary. In the Merge Request I described it as a "huge mental shift in what we understand under subsurfaces". I reused existing concepts from X11 and Wayland, but interpreted them in a new way, what simplified the code and unified the logic over all windows.

As there is much to say about this specific solution, I split out the discussion of it into a follow-up article. Stay tuned.

Note: the first article of that follow-up discussion is now available.

I will also write a separate article about the other big change: a total redesign of how we store and change the geometries of windows.

These geometries were a pain point for me already for a long time. Any aspiring new contributor for KWin must feel absolutely shell-shocked, when trying to understand what all the different geometry types of windows are supposed to mean and how they relate to each other.

As a reminder these are just the getters for the different kinds of geometries in the abstract top level interface class. And this is one of many ways to change a single one of them. Yes, that's a pure virtual function in a subclass, and yes, the second argument of that setter is a masked boolean trap.

To finally squash any hope that new contributor might have, show him all the different forms on how to save a geometry here, here, here and here. And until now we have looked only at header files.

To simplify all that, eradicate this glaringly unnecessary complexity, make the code actually comprehensible again, I redesigned everything about it from the ground up. This was for sure the most comprehensive and most difficult task. And I had to go through several iterations before a final overarching model emerged on how to handle all geometries of all windows, on how to save and manipulate them via clearly defined structures and processes.

Some explanation for that model can be found in the primary Merge Request of the geometries rework. But as said, like with subsurfaces, I plan to write about the reworked geometries soon in a more detailed follow-up article.

A Blossoming Heart

Why did I call this project the Windowing Revolution? Does it deserve this pathos? The project was massive, that's for sure. In sheer numbers the result is over 50 000 changed lines in over 300 commits. I sacrificed over months all my time for this project, and my health.

But size or sacrifices alone do not make this a revolution. Instead it comes through changes in our way of thinking and how this project will reshape our future: we radically redesigned the heart of our windowing manager, we broke with overcome beliefs and traditions, we simplified and reworked what was left rotten for decades.

In the end this paves the road for all future improvements, enables us to build them on solid foundations, on a rebuilt core of what defines KWinFT, the most advanced, most modern windowing compositor in the world.

That is why this revolution was necessary now, that is why I decided to push every other potential work item to afterwards. We first needed to reshape KWinFT's vibrating, pumping and now finally again blossoming heart, before work on anything else made sense again, be it features for our Wayland session or bug fixes on X11.

Silence in Between the Storms

The last months felt like in a hurricane at times. The volume of work was just that much. I have to thank several other contributors to KWinFT, who helped me throughout the whole time by testing the constantly changing feature branch of the project. This feedback was invaluable and pushed me forward in creating what will now be served to the general public with the upcoming release of 5.21.

I would like to tell you that the work on KWinFT's heart is complete now, that the windowing code is in a perfect state and there is nothing more to do. But that's not yet the case.

What has been merged now to KWinFT's master branch and will be included in the upcoming release next week, is a very well progressed intermediate state. I believe the biggest and most important objectives have been achieved, but there are still some smaller refactors to do.

For example one of these smaller refactors is representing unmanaged X11 windows by the same x11::window class like managed ones, just without compositing the control interface into them. This will further reduce the complexity and allow us afterwards to consolidate more X11-only functionality in a single place. If you are interested in helping with this small but important task take a look at its issue ticket.

Besides that there are lots of small code portions which can be moved now to their respective places in the win namespace in order to clean up further the root directory of the repo. If you want to help with that, pick one from the list I created.

The Next Revolution

While there is still some smaller work to do for this Windowing Revolution, I want to start the next one already now by setting a new focus for the upcoming release cycle.

This upcoming revolution is about a refactor of our render code. And while we called the windowing logic the heart of a window manager, we may call the render code its guts.

I will write more about this project in the future, but for now assume some of the most anticipated features on Wayland will be part of it. If you already now want to know more about it, take a look at the overview ticket.

Join the Cause

If you feel inspired, of course you are invited to take part in this next revolution. And the same holds, if you want to help with the remaining tasks of the last one, the windowing refactor.

Test the current code and give feedback. Or if you want to start contributing code, pick one of the tasks from our GitLab issues list.

And join us in our Gitter community for a friendly chat.

08 Feb 2021 8:00pm GMT