24 Apr 2018


Christian Schaller: Warming up for Fedora Workstation 28

Been some time now since my last update on what is happening in Fedora Workstation and with current plans to release Fedora Workstation 28 in early May I thought this could be a good time to write something. As usual this is just a small subset of what the team has been doing and I always end up feeling a bit bad for not talking about the avalanche of general fixes and improvements the team adds to each release.

Christian Kellner has done a tremendous job keeping everyone informed of his work making sure we have proper Thunderbolt support in Fedora Workstation 28. One important aspect for us of this improved Thunderbolt support is that a lot of docking stations coming out will be requiring it and thus without this work being done you would not be able to use a wide range of docking stations. For a lot of screenshots and more details about how the thunderbolt support is done I recommend reading this article in Christians Blog.

3rd party applications
It has taken us quite some time to get there as getting this feature right both included a lot of internal discussion about policies around it and implementation detail. But starting from Fedora Workstation 28 you will be able to find more 3rd party software listed in GNOME Software if you enable it. The way it will work is that you as part of the initial setup will be asked if you want to have 3rd party software show up in GNOME Software. If you are upgrading you will be asked inside GNOME Software if you want to enable 3rd party software. You can also disable 3rd party software after enabling it from the GNOME Software settings as seen below:

GNOME Software settings

GNOME Software settings

In Fedora Workstation 27 we did have PyCharm available, but we have now added the NVidia driver and Steam to the list for Fedora Workstation 28.

We have also been working with Google to try to get Chrome included here and we are almost there as they merged for instance the needed Appstream metadata some time ago, but the last steps requires some tweaking of how Google generates their package repository (basically adding the appstream metadata to their yum repository) and we don't have a clear timeline for when that will happen, but as soon as it does the Chrome will also appear in GNOME Software if you have 3rd party software enabled.

As we speak all 3rd party packages are RPMs, but we expect that going forward we will be adding applications packaged as Flatpaks too.

Finally if you want to propose 3rd party applications for inclusion you can find some instructions for how to do it here.

Virtualbox guest
Another major feature that got some attention that we worked on for this release was Hans de Goedes work to ensure Fedora Workstation could run as a virtualbox guest out of the box. We know there are many people who have their first experience with linux running it under Virtualbox on Windows or MacOSX and we wanted to make their first experience as good as possible. Hans worked with the virtualbox team to clean up their kernel drivers and agree on a stable ABI so that they could be merged into the kernel and maintained there from now on.

Firmware updates
The Spectre/Meltdown situation did hammer home to a lot of people the need to have firmware updates easily available and easy to update. We created the Linux Vendor Firmware service for Fedora Workstation users with that in mind and it was great to see the service paying off for many Linux users, not only on Fedora, but also on other distributions who started using the service we provided. I would like to call out to Dell who was a critical partner for the Linux Vendor Firmware effort from day 1 and thus their users got the most benefit from it when Spectre and Meltdown hit. Spectre and Meltdown also helped get a lot of other vendors off the fence or to accelerate their efforts to support LVFS and Richard Hughes and Peter Jones have been working closely with a lot of new vendors during this cycle to get support for their hardware and devices into LVFS. In fact Peter even flew down to the offices one of the biggest laptop vendors recently to help them resolve the last issues before their hardware will start showing up in the firmware service. Thanks to the work of Richard Hughes and Peter Jones you will both see a wider range of brands supported in the Linux Vendor Firmware Service in Fedora Workstation 28, but also a wider range of device classes.

Server side GL Vendor Neutral Dispatch
This is a bit of a technical detail, but Adam Jackson and Lyude Paul has been working hard this cycle on getting what we call Server side GLVND ready for Fedora Workstation 28. Currently we are looking at enabling it either as a zero-day update or short afterwards. so what is Server Side GLVND you say? Well it is basically the last missing piece we need to enable the use of the NVidia binary driver through XWayland. Currently the NVidia driver works with Wayland native OpenGL applications, but if you are trying to run an OpenGL application requiring X we need this to support it. And to be clear once we ship this in Fedora Workstation 28 it will also require a driver update from NVidia to use it, so us shipping it is just step 1 here. We do also expect there to be some need for further tuning once we got all the pieces released to get top notch performance. Of course over time we hope and expect all applications to become Wayland native, but this is a crucial transition technology for many of our users. Of course if you are using Intel or AMD graphics with the Mesa drivers things already work great and this change will not affect you in any way.

Flatpaks basically already work, but we have kept focus this time around on to fleshing out the story in terms of the so called Portals. Portals are essentially how applications are meant to be able to interact with things outside of the container on your desktop. Jan Grulich has put in a lot of great effort making sure we get portal support for Qt and KDE applications, most recently by adding support for the screen capture portal on top of Pipewire. You can ready more about that on Jan Grulichs blog. He is now focusing on getting the printing portal working with Qt.

Wim Taymans has also kept going full steam ahead of PipeWire, which is critical for us to enable applications dealing with cameras and similar on your system to be containerized. More details on that in my previous blog entry talking specifically about Pipewire.

It is also worth noting that we are working with Canonical engineers to ensure Portals also works with Snappy as we want to ensure that developers have a single set of APIs to target in order to allow their applications to be sandboxed on Linux. Alexander Larsson has already reviewed quite a bit of code from the Snappy developers to that effect.

Performance work
Our engineers have spent significant time looking at various performance and memory improvements since the last release. The main credit for the recently talked about 'memory leak' goes to Georges Basile Stavracas Neto from Endless, but many from our engineering team helped with diagnosing that and also fixed many other smaller issues along the way. More details about the 'memory leak' fix in Georges blog.

We are not done here though and Alberto Ruiz is organizing a big performance focused hackfest in Cambridge, England in May. We hope to bring together many of our core engineers to work with other members of the community to look at possible improvements. The Raspberry Pi will be the main target, but of course most improvements we do to make GNOME Shell run better on a Raspberry Pi also means improvements for normal x86 systems too.

Laptop Battery life
In our efforts to make Linux even better on laptops Hans De Goede spent a lot of time figuring out things we could do to make Fedora Workstation 28 have better battery life. How valuable these changes are will of course depend on your exact hardware, but I expect more or less everyone to have a bit better battery life on Fedora Workstation 28 and for some it could be a lot better battery life. You can read a bit more about these changes in Hans de Goedes blog.

24 Apr 2018 5:15pm GMT

23 Apr 2018


Eric Anholt: 2018-04-23

For VC5, I renamed the kernel driver to "v3d" and submitted it to the kernel. Daniel Vetter came back right away with a bunch of useful feedback, and next week I'm resolving that feedback and continuing to work on the GMP support.

On the vc4 front, I did the investigation of the HDL to determine that the OLED matrix applies before the gamma tables, so we can expose it in the DRM for Android's color correction. Stefan was also interested in reworking his fencing patches to use syncobjs, so hopefully we can merge those and get DRM HWC support in mainline soon. I also pushed Gustavo's patch for using the new core DRM infrastructure for async cursor updates. This doesn't simplify our code much yet, but Boris has a series he's working on that gets rid of a lot of custom vc4 display code by switching more code over to the new async support.

Unfortunately, the vc4 subsystem node removal patch from last week caused the DRM's platform device to not be on the SOC's bus. This caused bus address translations to be subtly wrong and broke caching (so eventually the GPU would hang). I've shelved the patches for now.

I also rebased my user QPU submission code for the Raspberry Pi folks. They keep expressing interest in it, but we'll see if it goes anywhere this time around. Unfortunately I don't see any way to expose this for general distributions: vc4 isn't capable enough for OpenCL or GL compute shaders, and custom user QPU submissions would break the security model (just like GL shaders would have without my shader validator, and I think validating user QPU submissions would be even harder than GL shaders).

23 Apr 2018 12:30am GMT

Daniel Vetter: Linux Kernel Maintainer Statistics

As part of preparing my last two talks at LCA on the kernel community, "Burning Down the Castle" and "Maintainers Don't Scale", I have looked into how the Kernel's maintainer structure can be measured. One very interesting approach is looking at the pull request flows, for example done in the LWN article "How 4.4's patches got to the mainline". Note that in the linux kernel process, pull requests are only used to submit development from entire subsystems, not individual contributions. What I'm trying to work out here isn't so much the overall patch flow, but focusing on how maintainers work, and how that's different in different subsystems.


In my presentations I claimed that the kernel community is suffering from too steep hierarchies. And worse, the people in power don't bother to apply the same rules to themselves as anyone else, especially around purported quality enforcement tools like code reviews.

For our purposes a contributor is someone who submits a patch to a mailing list, but needs a maintainer to apply it for them, to get the patch merged. A maintainer on the other hand can directly apply a patch to a subsystem tree, and will then send pull requests up the maintainer hierarchy until the patch lands in Linus' tree. This is relatively easy to measure accurately in git: If the recorded patch author and committer match, it's a maintainer self-commit, if they don't match it's a contributor commit.

There's a few annoying special cases to handle:

Also note that this is a property of each commit - the same person can be both a maintainer and a contributor, depending upon how each of their patches gets merged.

The ratio of maintainer self-commits compared to overall commits then gives us a crude, but fairly useful metric to measure how steep the kernel community overall is organized.

Measuring review is much harder. For contributor commits review is not recorded consistently. Many maintainers forgo adding an explicit Reviewed-by tag since they're adding their own Signed-off-by tag anyway. And since that's required for all contributor commits, it's impossible to tell whether a patch has seen formal review before merging. A reasonable assumption though is that maintainers actually look at stuff before applying. For a minimal definition of review, "a second person looked at the patch before merging and deemed the patch a good idea" we can assume that merged contributor patches have a review ratio of 100%. Whether that's a full formal review or not can unfortunately not be measured with the available data.

A different story is maintainer self-commits - if there is no tag indicating review by someone else, then either it didn't happen, or the maintainer felt it's not important enough work to justify the minimal effort to record it. Either way, a patch where the git author and committer match, and which sports no review tags in the commit message, strongly suggests it has indeed seen none.

An objection would be that these patches get reviewed by the next maintainer up, when the pull request gets merged. But there's well over a thousand such patches each kernel release, and most of the pull requests containing them go directly to Linus in the 2 week long merge window, when the over 10k feature patches of each kernel release land in the mainline branch. It is unrealistic to assume that Linus carefully reviews hundreds of patches himself in just those 2 weeks, while getting hammered by pull requests all around. Similar considerations apply at a subsystem level.

For counting reviews I looked at anything that indicates some kind of patch review, even very informal ones, to stay consistent with the implied oversight the maintainer's Signed-off-by line provides for merged contributor patches. I therefore included both Reviewed-by and Acked-by tags, including a plethora of misspelled and combined versions of the same.

The scripts also keep track of how pull requests percolate up the hierarchy, which allows filtering on a per-subsystem level. Commits in topic branches are accounted to the subsystem that first lands in Linus' tree. That's fairly arbitrary, but simplest to implement.

Last few years of GPU subsystem history

Since I've pitched the GPU subsystem against the kernel at large in my recent talks, let's first look at what things look like in graphics:

GPU maintainer commit statistics Fig. 1 GPU total commits, maintainer self-commits and reviewed maintainer self-commits GPU relative maintainer commit statistics Fig. 2 GPU percentage maintainer self-commits and reviewed maintainer self-commits

In absolute numbers it's clear that graphics has grown tremendously over the past few years. Much faster than the kernel at large. Depending upon the metric you pick, the GPU subsystem has grown from being 3% of the kernel to about 10% and now trading spots for 2nd largest subsystem with arm-soc and staging (depending who's got a big pull for that release).

Maintainer commits keep up with GPU subsystem growth

The relative numbers have a different story. First, commit rights and the fairly big roll out of group maintainership we've done in the past 2 years aren't extreme by historical graphics subsystem standards. We've always had around 30-40% maintainer self-commits. There's a bit of a downward trend in the years leading towards v4.4, due to the massive growth of the i915 driver, and our failure to add more maintainers and committers for a few releases. Adding lots more committers and creating bigger maintainer groups from v4.5 on forward, first for the i915 driver, then to cope with the influx of new small drivers, brought us back to the historical trend line.

There's another dip happening in the last few kernels, due to AMD bringing in a big new team of contributors to upstream. v4.15 was even more pronounced, in that release the entirely rewritten DC display driver for AMD GPUs landed. The AMD team is already using a committer model for their staging and internal trees, but not (yet) committing directly to their upstream branch. There's a few process holdups, mostly around the CI flow, that need to be fixed first. As soon as that's done I expect this recent dip will again be over.

In short, even when facing big growth like the GPU subsystem has, it's very much doable to keep training new maintainers to keep up with the increased demand.

Review of maintainer self-commits established in the GPU subsystem

Looking at relative changes in how consistently maintainer self-commits are reviewed, there's a clear growth from mostly no review to 80+% of all maintainer self-commits having seen some formal oversight. We didn't just keep up with the growth, but scaled faster and managed to make review a standard practice. Most of the drivers, and all the core code, are now consistently reviewed. Even for tiny drivers with small to single person teams we've managed to pull this off, through combining them into larger teams run with a group maintainership model.

Last few years of kernel w/o GPU history

kernel w/o GPU maintainer commit statistics Fig. 3 kernel w/o GPU maintainer self-commits and reviewed maintainer self-commits kernel w/o GPU relative maintainer commit statistics Fig. 4 kernel w/o GPU percentage maintainer self-commits and reviewed maintainer self-commits

Kernel w/o graphics is an entirely different story. Overall, review is much less a thing that happens, with only about 30% of all maintainer self-commits having any indication of oversight. The low ratio of maintainer self-commits is why I removed the total commit number from the absolute graph - it would have dwarfed the much more interesting data on self-commits and reviewed self-commits. The positive thing is that there's at least a consistent, if very small upward trend in maintainer self-commit reviews, both in absolute and relative numbers. But it's very slow, and will likely take decades until there's no longer a double standard on review between contributors and maintainers.

Maintainers are not keeping up with the kernel growth overall

Much more worrying is the trend on maintainer self-commits. Both in absolute, and much more in relative numbers, there's a clear downward trend, going from around 25% to below 15%. This indicates that the kernel community fails to mentor and train new maintainers at a pace sufficient to keep up with growth. Current maintainers are ever more overloaded, leaving ever less time for them to write patches of their own and get them merged.

Naively extrapolating the relative trend predicts that around the year 2025 large numbers of kernel maintainers will do nothing else than be the bottleneck, preventing everyone else from getting their work merged and not contributing anything of their own. The kernel community imploding under its own bureaucratic weight being the likely outcome of that.

This is a huge contrast to the "everything is getting better, bigger, and the kernel community is very healthy" fanfare touted at keynotes and the yearly kernel report. In my opinion, the kernel community is very much not looking like it is coping with its growth well and an overall healthy community. Even when ignoring all the issues around conduct that I've raised.

It is also a huge contrast to what we've experienced in the GPU subsystem since aggressively rolling out group maintainership starting with the v4.5 release; by spreading the bureaucratic side of applying patches over many more people, maintainers have much more time to create their own patches and get them merged. More crucially, experienced maintainers can focus their limited review bandwidth on the big architectural design questions since they won't get bogged down in the minutiae of every single simple patch.

4.16 by subsystem

Let's zoom into how this all looks at a subsystem level, looking at just the recently released 4.16 kernel.

Most subsystems have unsustainable maintainer ratios

Trying to come up with a reasonable list of subsystems that have high maintainer commit ratios is tricky; some rather substantial pull requests are essentially just maintainers submitting their own work, giving them an easy 100% score. But of course that's just an outlier in the larger scope of the kernel overall having a maintainer self-commit ratio of just 15%. To get a more interesting list of subsystems we need to look at only those with a group of regular contributors and more than just 1 maintainer. A fairly arbitrary cut-off of 200 commits or more in total seems to get us there, yielding the following top ten list:

subsystem total commits maintainer self-commits maintainer ratio
GPU 1683 614 36%
KVM 257 91 35%
arm-soc 885 259 29%
linux-media 422 111 26%
tip (x86, core, …) 792 125 16%
linux-pm 201 31 15%
staging 650 61 9%
linux-block 249 20 8%
sound 351 26 7%
powerpc 235 16 7%

In short there's very few places where it's easier to become a maintainer than in the already rather low, roughly 15%, the kernel scores overall. Outside of these few subsystems, the only realistic way is to create a new subsystem, somehow get it merged, and become its maintainer. In most subsystems being a maintainer is an elite status, and the historical trends suggest it will only become more so. If this trend isn't reversed, then maintainer overload will get a lot worse in the coming years.

Of course subsystem maintainers are expected to spend more time reviewing and managing other people's contribution. When looking at individual maintainers it would be natural to expect a slow decline in their own contributions in patch form, and hence a decline in self-commits. But below them a new set of maintainers should grow and receive mentoring, and those more junior maintainers would focus more on their own work. That sustainable maintainer pipeline seems to not be present in many kernel subsystems, drawing a bleak future for them.

Much more interesting is the review statistics, split up by subsystem. Again we need a cut-off for noise and outliers. The big outliers here are all the pull requests and trees that have seen zero review, not even any Acked-by tags. As long as we only look at positive examples we don't need to worry about those. A rather low cut-off of at least 10 maintainer self-commits takes care of other random noise:

subsystem total commits maintainer self-commits maintainer review ratio
f2fs 72 12 100%
XFS 105 78 100%
arm64 166 23 91%
GPU 1683 614 83%
linux-mtd 99 12 75%
KVM 257 91 74%
linux-pm 201 31 71%
pci 145 37 65%
remoteproc 19 14 64%
clk 139 14 64%
dma-mapping 63 60 60%

Yes, XFS and f2fs have their shit together. More interesting is how wide the spread in the filesystem code is; there's a bunch of substantial fs pulls with a review ratio of flat out zero. Not even a single Acked-by. XFS on the other hand insists on full formal review of everything - I spot checked the history a bit. f2fs is a bit of an outlier with 4.16, barely getting above the cut-off. Usually it has fewer patches and would have been excluded.

Everyone not in the top ten taken together has a review ratio of 27%.

Review double standards in many big subsystems

Looking at the big subsystems with multiple maintainers and huge groups of contributors - I picked 500 patches as the cut-off - there's some really low review ratios: Staging has 7%, networking 9% and tip scores 10%. Only arm-soc is close to the top ten, with 50%, at the 14th position.

Staging having no standard is kinda the point, but the other core subsystems eschewing review is rather worrisome. More than 9 out of 10 maintainer self-commits merged into these core subsystem do not carry any indication that anyone else ever looked at the patch and deemed it a good idea. The only other subsystem with more than 500 commits is the GPU subsystem, at 4th position with a 83% review ratio.

Compared to maintainers overall the review situation is looking a lot less bleak. There's a sizeable group of subsystems who at least try to make this work, by having similar review criteria for maintainer self-commits than normal contributors. This is also supported by the rather slow, but steady overall increase of reviews when looking at historical trend.

But there's clearly other subsystems where review only seems to be a gauntlet inflicted on normal contributors, entirely optional for maintainers themselves. Contributors cannot avoid review, because they can't commit their own patches. When maintainers outright ignore review for most of their patches this creates a clear double standard between maintainers and mere contributors.

One year ago I wrote "Review, not Rocket Science" on how to roll out review in your subsystem. Looking at this data here I can close with an even shorter version:

What would Dave Chinner do?

Thanks a lot to Daniel Stone, Dave Chinner, Eric Anholt, Geoffrey Huntley, Luce Carter and Sean Paul for reading and commenting on drafts of this article.

23 Apr 2018 12:00am GMT

22 Apr 2018


Robert Foss: Running Androind on the Mainline Graphics Stack @ FossNorth


If you're curious about the slides, you can download the PDF or the OTP.


This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of FossNorth, specifically @e8johan for hosting a great event.

22 Apr 2018 10:00pm GMT

Robert Foss: Running Android on the Mainline Graphics Stack @ FossNorth

Intro slide


If you're curious about the slides, you can download the PDF or the OTP.


This post has been a part of work undertaken by my employer Collabora.

I would like to thank the wonderful organizers of FossNorth, specifically @e8johan for hosting a great event.

22 Apr 2018 10:00pm GMT

Alyssa Rosenzweig: A Moving Mesa Midgard Cube

Mmm, a Moving Mesa Midgard Cube

Mmm, a Moving Mesa Midgard Cube

In the last Panfrost status update, a transitory "half-way" driver was presented, with the purpose of easing the transition from a standalone library abstracting the hardware to a full-fledged OpenGL ES driver using the Mesa and Gallium3D infrastructure.

Since then, I've completed the transition, creating such a driver, but retaining support for out-of-tree testing.

Almost everything that was exposed with the custom half-way interface is now available through Gallium3D. Attributes, varyings, and uniforms all work. A bit of rasterisation state is supported. Multiframe programs work, as do programs with multiple non-indexed, direct draws per frame.

The result? The GLES test-cube demo from Freedreno runs using the Mali T760 GPU present in my RK3288 laptop, going through the Mesa/Gallium3D stack. Of course, there's no need to rely on the vendor's proprietary compilers for shaders - the demo is using shaders from the free, NIR-based Midgard compiler.

Look ma, no blobs!

In the past three weeks since the previous update, all aspects of the project have seen fervent progress, culminating in the above demo. The change list for the core Gallium driver is lengthy but largely routine: abstracting features about the hardware which were already understood and integrating it with Gallium, resolving bugs which are discovered in the process, and repeating until the next GLES test passes normally. Enthusiastic readers can read the code of the driver core on GitLab.

Although numerous bugs were solved in this process, one in particular is worthy of mention: the "tile flicker bug", notorious to lurkers of our Freenode IRC channel, #panfrost. Present since the first render, this bug resulted in non-deterministic rendering glitches, where particular tiles would display the background colour in lieu of the render itself. The non-deterministic nature had long suggested it was either the result of improper memory management or a race condition, but the precise cause was unknown. Finally, the cause was narrowed down to a race condition between the vertex/tiler jobs responsible for draws, and the fragment job responsible for screen painting. With this cause in mind, a simple fix squashed the bug, hopefully for good; renders are now deterministic and correct. Huge thanks to Rob Clark for letting me use him as a sounding board to solve this.

In terms of decoding the command stream, some miscellaneous GL state has been determined, like some details about tiler memory management, texture descriptors, and shader linkage (attribute and varying metadata). By far, however, the most significant discovery was the operation of blending on Midgard. It's… well, unique. If I had known how nuanced the encoding was - and how much code it takes to generate from Gallium blend state - I would have postponed decoding like originally planned.

In any event, blending is now understood. Under Midgard, there are two paths in the hardware for blending: the fixed-function fast path, and the programmable slow path, using "blend shaders". This distinction has been discussed sparsely in Mali documentation, but the conditions for the fast path were not known until now. Without further ado, the fixed-function blending hardware works when:

If these conditions are not met, a blend shader is used instead, incurring a presently unknown performance hit.

By dominant and non-dominant modes, I'm essentially referring to the more complex and less complex blend functions respectively, comparing between the functions for the source and the destination. The exact details of the encoding are a little hairy beyond the scope of this post but are included in the corresponding Panfrost headers and the corresponding code in the driver.

In any event, this separation between fixed-function and programmable blending is now more or less understood. Additionally, blend shaders themselves are now intelligible with Connor Abbott's Midgard disassembler; blend shaders are just normal Midgard shaders, with an identical ISA to vertex and fragment shaders, and will eventually be generated with the existing NIR compiler. With luck, we should be able to reuse code from the NIR compiler for the vc4, an embedded GPU lacking fixed-function hardware for any blending whatsoever. Additionally, blend shaders open up some interesting possibilities; we may be able to enable developers to write blend shaders themselves in GLSL through a vendored GL extension. More practically, blend shaders should enable implementation of all blend modes, as this is ES 3.2 class hardware, as well as presumably logic operations.

Command-stream work aside, the Midgard compiler also saw some miscellaneous improvements. In particular, the mystery surrounding varyings in vertex shaders has finally been cracked. Recall that gl_Position stores are accomplished by writing the screen-space coordinate to the special register r27, and then including a st_vary instruction with the mysterious input register r1 to the appropriate address. At the time, I had (erroneously) assumed that the r27 store was responsible for the write, and the subsequent instruction was a peculiar errata workaround.

New findings shows it is quite the opposite: it is the store instruction that does the store, but it uses the value of r27, not r1 for its input. What does the r1 signify, then? It turns out that two different registers can be used for varying writes, r26 and r27. The register in the store instruction selects between these registers: a value of zero uses r26 whereas a value of one uses r27. Why, then, are there two varying source registers? Midgard is a VLIW architecture, in this case meaning that it can execute two store instructions simultaneously for improved performance. To achieve this parallelism, it needs two source registers, to be able to write two different values to the two varyings.

This new understanding clarifies some previous peculiar disassemblies, as the purpose of writes to r26 are now understood. This discovery would have been easier had r26 not also represented a reference to an embedded constant!

More importantly, it enables us to implement varying stores in the vertex shader, allowing for smoothed demos, like the shading on test-cube, to work. As a bonus, it cleans up the code relating to gl_Position writes, as we now know they can use the same compiler code path as writes to normal varyings.

Besides varyings, the Midgard compiler also saw various improvements, notably including a basic register allocator, crucial for compiling even slightly nontrivial shaders, such as that of the cube.

Beyond Midgard, my personal focus, Bifrost has continued to seen sustained progress. Connor Abbott has continued decoding the new shader ISA, uncovering and adding disassembler support for a few miscellaneous new instructions and in particular branching. Branching under Bifrost is somewhat involved - the relevant disassembler commit added over two hundred lines of code - with semantics differing noticeably from Midgard. He has also begun work porting the panwrap infrastructure for capturing, decoding, and replaying command streams from Midgard to Bifrost, to pave the way for a full port of the driver to Bifrost down the line.

While Connor continues work on his disassembler, Lyude Paul has been working on a Bifrost assembler compatible with the disassembler's output, a milestone necessary to demonstrate understanding of the instruction set and a useful prerequisite to writing a Bifrost compiler.

Going forward, I plan on cleaning up technical debt accumulated in the driver to improve maintainability, flexibility, and perhaps performance. Additionally, it is perhaps finally time to address the elephant in the command stream room: textures. Prior to this post, there were two major bugs in the driver: the missing tile bug and the texture reading bug. Seeing as the former was finally solved with a bit of persistence, there's hope for the latter as well.

May the pans frost on.

22 Apr 2018 7:00am GMT

21 Apr 2018


Roman Gilg: Progress on Plasma Wayland for 5.13

In February after Plasma 5.12 was released we held a meeting on how we want to improve Wayland support in Plasma 5.13. Since its beta is now less than one month away it is time for a status report on what has been achieved and what we still plan to work on.

Also today started a week-long Plasma Sprint in Berlin, what will hopefully accelerate the Wayland work for 5.13. So in order to kick-start the sprint this is a good opportunity to sum up where we stand now.


Let us start with a small change, but with huge implications: the decision to not set the environment variable QT_QPA_PLATFORM to wayland anymore in Plasma's startup script.

Qt based applications use this environment variable to determine the platform plugin they should load. The environment variable was set to wayland in Plasma's Wayland session in order to tell Qt based applications that they should act like Wayland native clients. Otherwise they load the default plugin, which is xcb and means that they try to be X clients in a Wayland session.

This also works, thanks to Xwayland, but of course in a Wayland session we want as many applications to be Wayland native clients as possible. That was probably the rationale behind setting the environment variable in the first place. The problem is though, that this is not always possible. While KDE applications are compiled with the Qt Wayland platform plugin, some third-party Qt applications were not. A prominent example is the Telegram desktop client, which would just give up on launch in a Wayland session because of that.

With the change this is no longer a problem. Not being forced in its QT_QPA_PLATFORM environment variable to some unavailable plugin the Telegram binary will just execute using the xcb plugin and therefore run as Xwayland client in our Wayland session.

One drawback is that this now applies to all Qt based applications. While the Plasma processes were adjusted to now be able to select the Wayland plugin themselves based on session information other applications might not although the wayland plugin might be availale and then still run as Xwayland clients. But this problem might go away with Qt 5.11, which is supposed to either change the behavior of QT_QPA_PLATFORM itself or feature a new environment variable such that an application can express preferences for plugins and fallback if to the first supported one by the session.

Martin Flöser, who wrote most of the patches for this change, talked about it and the consequences in his blog as well.


A huge topic on Desktop Wayland was screen recording and sharing. In the past application developers had a single point of entry to write for in order to receive screencasts: the XServer. In Wayland the compositor as Wayland server has replaced the XServer and so an application would need to talk to the compositor if it wants access to screen content.

This rightfully raised the fear that now developers of screencast apps would need to write for every other Wayland compositor a different backend to receive video data. As a spoiler: luckily this won't be necessary.

So how did we achieve this? First of all support for screencasts had to be added to KWin and KWayland. This was done by Oleg Chernovskiy. While this is still a KWayland specific interface the trick was now to proxy via xdg-desktop-portal and PipeWire. Jan Grulich jumped in and implemented the necessary backend code on the xdg-desktop-portal side.

A screencast app therefore in the future only needs to talk with xdg-desktop-portal and receive video data through PipeWire on Plasma Wayland. Other compositors then will have to add a similar backend to xdg-desktop-portal as it was done by Jan, but the screencast app stays the same.

Configure your mouse

I wrote a system settings module (KCM) for touchpad configuration on Wayland last year. The touchpad KCM had higher priority than the Mouse KCM back then because there was no way to configure anything about a touchpad on Wayland, while there was a small hack in KWin to at least control the mouse speed.

Still this was no long term solution in regards to the Mouse KCM, and so I wrote a libinput based Wayland Mouse KCM similar to the one I wrote for touchpads.

Wayland Mouse KCM

I went one step further and made the Mouse KCM interact with Libinput on X as well. There was some work on this in the Mouse KCM done in the past, but now it features a fitting Ui like on Wayland and uses the same backend abstraction.

Dmabuf-based Wayland buffers

Fredrik Höglund uploaded patches for review to add support for dmabuf-based Wayland buffer sharing. This is a somewhat technical topic and will not directly influence the user experience in 5.13. But it is to see in the context of bigger changes upstream in Wayland, X and Mesa. The keyword here is buffer modifiers. You can read more about them in this article by Daniel Stone.

Per output color correction

Adjusting the colors and overall gamma of displays individually is a feature, which is quite important to some people and is provided in a Plasma X session via KGamma in a somewhat simplistic fashion.

Since I wrote Night Color as a replacement for Redshift in our Wayland session not long ago I was already somewhat involved in the color correction game.

But this game is becoming increasingly more complex: my current solution for per output color correction includes changes to KWayland, KWin, libkscreen, libcolorcorrect and adds a KCM replacing KGamma on Wayland to let the user control it.

Additionally there are different opinions on how this should work in general and some explanations by upstream more confused me than guided me to the one best solution. I will most likely ignore these opinions for the moment and concentrate on the one solution I have at the moment, which might already be sufficient for most people. I believe it will be actually quite nice to use, for example I plan to provide a color curve widget borrowed from Krita to set the color curves via some control points and curve interpolation.

More on 5.13 and beyond

In the context of per output color correction another topic, which I am working on right now, is abstracting our output classes in KWin's Drm and Virtual backends to the compositing level. This will first enable my color correction code to be nicely integrated and I anticipate in the long term will be even necessary for two other far more important topics: layered rendering and compositing per output, what will improve performance and allow different refresh rates on multi-monitor setups. But these two tasks will need much more time.

Scaling on Wayland can be done per output and while I am no expert on this topic from what I heard because of that and for other reasons scaling should work much better on Wayland than on X. But there is currently one huge drawback in our Wayland session: we can only scale integer factors. To change this David Edmundson has posted patches for review adding support for xdg-output to KWayland and to KWin. This is one step in allowing fractional scaling on Wayland. There is more to do according to Davd and since he takes part in the sprint I hope we can talk about scaling on Wayland extensively in order for me to better understand the current mechanism and what all needs to be changed in order to provide fractional scaling.

At last there is cursor locking, which is in theory supported by KWin, but in practice does not work well in the games I tried it with. I hope to start work on this topic before 5.13, but I will most likely not finish it for 5.13.

So overall there is lots of progress, but still quite some work to do. In this regard I am certain the Plasma Sprint this week will be fruitful. We can discuss problems, exchange knowledge and simply code in unity (no pun intended). If you have questions or feedback that you want us to address at this sprint, feel free to comment this article.

21 Apr 2018 11:30pm GMT

20 Apr 2018


Alan Coopersmith: Solaris 10 Extended Support Patches & Patchsets Released!

On Tuesday April 17 we released the first batch of Solaris 10 patches & patchsets under Solaris 10 Extended Support. There were a total of 24 Solaris 10 patches, including kernel updates, and 4 patchsets released on MOS!

Solaris 10 Extended Support will run thru January 2021. Scott Lynn put together a very informative Blog on Solaris 10 Extended Support detailing the benefits that customers can get by purchasing Extended Support for Solaris 10 - see https://blogs.oracle.com/solaris/oracle-solaris-10-support-explained.

Those of you that have taken advantage of our previous Extended Support offerings for Solaris 8 and Solaris 9 will notice that we've changed things around a little with Solaris 10 Extended Support; previously we did not publish any updates to the Solaris 10 Recommended Patchsets during the Extended Support period. This meant that the Recommended Patchsets remained available to all customers with Premier Operating Systems support, as all the patches the patchsets contained had Operating Systems entitlement requirements.

Moving forward with Solaris 10 Extended Support, the decision has been made to continue to update the Recommended Patchsets thru the Solaris 10 Extended Support period. This means customers that purchase Solaris 10 Extended Support get the benefit of continued Recommended Patchset updates, as patches that meet the criteria for inclusion in the patchsets are released. During the Solaris 10 Extended Support period, the updates to the Recommended Patchsets will contain patches that require a Solaris 10 Extended Support contract, so the Solaris 10 Recommended Patchsets will also require a Solaris 10 Extended Support contract during this period.

For customers that do not wish to avail of Extended Support and would like to access the last Recommended Patchsets created prior to the beginning of Extended Support for Solaris 10, the January 2018 Critical Patch Updates (CPUs) for Solaris 10 will remain available to those with Premier Operating System Support.

The CPU Patchsets are rebranded versions of the Recommended Patchset on the CPU dates; the patches included in the CPUs are identical to the Recommended Patchset released on those CPU dates, but the CPU READMEs will be updated to reflect their use as CPU resources. CPU patchsets are archived and are always available via MOS at later dates so that customers can easily align to their desired CPU baseline at any time. A further benefit that only Solaris 10 Extended Support customers will receive is access to newly created CPU Patchsets for Solaris 10 thru the Extended Support period.

The following table provides a quick reference to the recent Solaris 10 patchsets that have been released, including details of the support contract required to access them:

Patchset Name Patchset Details README Download Support Contract Required Recommended OS Patchset for Solaris 10 SPARC Patchset Details README Download Extended Support Recommended OS Patchset for Solaris 10 x86 Patchset Details README Download Extended Support CPU OS Patchset 2018/04 Solaris 10 SPARC Patchset Details README Download Extended Support CPU OS Patchset 2018/04 Solaris 10 x86 Patchset Details README Download Extended Support CPU OS Patchset 2018/01 Solaris 10 SPARC Patchset Details README


Operating Systems Support CPU OS Patchset 2018/01 Solaris 10 x86 Patchset Details README Download Operating Systems Support
Please reach out to your local sales representative if you wish to get more information on the benefits of purchasing Extended Support for Solaris 10.

20 Apr 2018 12:13pm GMT

18 Apr 2018


Alan Coopersmith: Oracle Solaris ZFS Device Removal

At long last, we provide the ability to remove a top-level VDEV from a ZFS storage pool in the upcoming Solaris 11.4 Beta refresh release.

For many years, our recommendation was to create a pool based on current capacity requirements and then grow the pool to meet increasing capacity needs by adding VDEVs or by replacing smaller LUNs with larger LUNs. It is trivial to add capacity or replace smaller LUNs with larger LUNs, sometimes with just one simple command.

The simplicity of ZFS is one of its great strengths!

I still recommend the practice of creating a pool that meets current capacity requirements and then adding capacity when needed. If you need to repurpose pool devices in an over-provisioned pool or if you accidentally misconfigure a pool device, you now have the flexibility to resolve these scenarios.

Review the following practical considerations when using this new feature, which should be used as an exception rather than the rule for pool configuration on production systems:

A few implementation details in case you were wondering:

See the examples below.

Repurpose Pool Devices

The following pool, tank, has low space consumption so one VDEV is removed.

# zpool list tank NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT tank 928G 28.1G 900G 3% 1.00x ONLINE # zpool status tank pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 errors: No known data errors - # zpool remove tank mirror-1 # zpool status tank pool: tank state: ONLINE status: One or more devices are being removed. action: Wait for the resilver to complete. Run 'zpool status -v' to see device specific details. scan: resilver in progress since Sun Apr 15 20:58:45 2018 28.1G scanned 3.07G resilvered at 40.9M/s, 21.83% done, 4m35s to go config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c3t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 mirror-1 REMOVING 0 0 0 c1t7d0 REMOVING 0 0 0 c5t3d0 REMOVING 0 0 0 errors: No known data errors

Run the zpool iostat command to verify that data is being written to the remaining VDEV.

# zpool iostat -v tank 5 capacity operations bandwidth pool alloc free read write read write ------------------------- ----- ----- ----- ----- ----- ----- tank 28.1G 900G 9 182 932K 21.3M mirror-0 14.1G 450G 1 182 7.90K 21.3M c3t2d0 - - 0 28 4.79K 21.3M c4t2d0 - - 0 28 3.92K 21.3M mirror-1 - - 8 179 924K 21.2M c1t7d0 - - 1 28 495K 21.2M c5t3d0 - - 1 28 431K 21.2M ------------------------- ----- ----- ----- ----- ----- ----- capacity operations bandwidth pool alloc free read write read write ------------------------- ----- ----- ----- ----- ----- ----- tank 28.1G 900G 0 967 0 60.0M mirror-0 14.1G 450G 0 967 0 60.0M c3t2d0 - - 0 67 0 60.0M c4t2d0 - - 0 68 0 60.4M mirror-1 - - 0 0 0 0 c1t7d0 - - 0 0 0 0 c5t3d0 - - 0 0 0 0 ------------------------- ----- ----- ----- ----- ----- ----- Misconfigured Pool Device

In this case, a device was intended to be added as a cache device but was added a single device. The problem is identified and resolved.

# zpool status rzpool pool: rzpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 errors: No known data errors # zpool add rzpool c3t3d0 vdev verification failed: use -f to override the following errors: mismatched replication level: pool uses raidz and new vdev is disk Unable to build pool from specified devices: invalid vdev configuration # zpool add -f rzpool c3t3d0 # zpool status rzpool pool: rzpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c3t3d0 ONLINE 0 0 0 errors: No known data errors # zpool remove rzpool c3t3d0 # zpool add rzpool cache c3t3d0 # zpool status rzpool pool: rzpool state: ONLINE scan: resilvered 0 in 1s with 0 errors on Sun Apr 15 21:09:35 2018 config: NAME STATE READ WRITE CKSUM rzpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 cache c3t3d0 ONLINE 0 0 0

In summary, Solaris 11.4 includes a handy new option for repurposing pool devices and resolving pool misconfiguration errors.

18 Apr 2018 7:00pm GMT

17 Apr 2018


Alan Coopersmith: Oracle Solaris 11.4 Open Beta Refreshed!

On January 30, 2018, we released the Oracle Solaris 11.4 Open Beta. It has been quite successful.

Today, we are announcing that we've refreshed the 11.4 Open Beta. This refresh includes new capabilities and additional bug fixes (over 280 of them) as we drive to the General Availability Release of Oracle Solaris 11.4.

Some new features in this release are:

Also, the Oracle Solaris 11.4 Beta refresh includes the changes to mitigate CVE-2017-5753, otherwise known as Spectre Variant 1, for Firefox, the NVIDIA Graphics driver, and the Solaris Kernel (see MOS docs on SPARC and x86 for more information).

Additionally, new bundled software includes, gcc 7.3, libidn2, and qpdf 7.0.0, and more than 45 new bundled software versions.

Before I go further, I have to say:

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle's products remains at the sole discretion of Oracle Corporation.

I want to take a few minutes to address some questions I've been getting that the upcoming release of Oracle Solaris 11.4 has sparked.

Oracle Solaris 11.4 runs on Oracle SPARC and x86 systems released since 2011, but not on certain older systems that had been supported in Solaris 11.3 and earlier. Specifically, systems not supported in Oracle Solaris 11.4 include systems based on the SPARC T1, T2, and T3 processors or the SPARC64 VII+ and earlier based "Sun4u" systems such as the SPARC Enterprise M4000. To allow customers time to migrate to newer hardware we intend to provide critical security fixes as necessary on top of the last SRU delivered for 11.3 for the following year. These updates will not provide the same level of content as regular SRUs and are intended solely as a transition vehicle. Customers using newer hardware are encouraged to update to Oracle Solaris 11.4 and subsequent Oracle Solaris 11 SRUs as soon as practical.

Another question I've been getting quite a bit is about the release frequency and strategy for Oracle Solaris 11.

After much discussion internally and externally, with you, our customers, about our current continuous delivery release strategy, we are going forward with our current strategy with some minor changes:

​This should make our releases more predictable, maintain the reliability you've come to depend on, and provide new features to you rapidly, allowing you to test them and deploy them faster.

Oracle Solaris 11.4 is secure, simple and cloud-ready and compatible with all your existing Oracle Solaris 11.3 and earlier applications.

Go give the latest beta a try. You can download it here.

17 Apr 2018 7:00pm GMT

Alan Coopersmith: Oracle Solaris 11.3 SRU 31

We've just released Oracle Solaris 11.3 SRU 31. This is the April Critical Patch update and contains some important security fixes as well as enhancements to Oracle Solaris. SRU31 is now available from My Oracle Support Doc ID 2045311.1, or via 'pkg update' from the support repository at https://pkg.oracle.com/solaris/support .

The following components have been updated to address security issues:

These enhancements have also been added:

Full details of this SRU can be found in My Oracle Support Doc 2385753.1.
For the list of Service Alerts affecting each Oracle Solaris 11.3 SRU, see Important Oracle Solaris 11.3 SRU Issues (Doc ID 2076753.1).

17 Apr 2018 5:37pm GMT

Iago Toral: Frame analysis of a rendering of the Sponza model

For some time now I have been working on a personal project to render the well known Sponza model provided by Crytek using Vulkan. Here is a picture of the current (still a work-in-progress) result:

Sponza rendering

This screenshot was captured on my Intel Kabylake laptop, running on the Intel Mesa Vulkan driver (Anvil).

The following list includes the main features implemented in the demo:

I have been thinking about writing post about this for some time, but given that there are multiple features involved I wasn't sure how to scope it. Eventually I decided to write a "frame analysis" post where I describe, step by step, all the render passes involved in the production of the single frame capture showed at the top of the post. I always enjoyed reading this kind of articles so I figured it would be fun to write one myself and I hope others find it informative, if not entertaining.

To avoid making the post too dense I won't go into too much detail while describing each render pass, so don't expect me to go into the nitty-gritty of how I implemented Screen Space Ambient Occlussion for example. Instead I intend to give a high-level overview of how the various features implemented in the demo work together to create the final result. I will provide screenshots so that readers can appreciate the outputs of each step and verify how detail and quality build up over time as we include more features in the pipeline. Those who are more interested in the programming details of particular features can always have a look at the Vulkan source code (link available at the bottom of the article), look for specific tutorials available on the Internet or wait for me to write feature-specifc posts (I don't make any promises though!).

If you're interested in going through with this then grab a cup of coffe and get ready, it is going to be a long ride!

Step 0: Culling

This is the only step in this discussion that runs on the CPU, and while optional from the point of view of the result (it doesn't affect the actual result of the rendering), it is relevant from a performance point of view. Prior to rendering anything, in every frame, we usually want to cull meshes that are not visible to the camera. This can greatly help performance, even on a relatively simple scene such as this. This is of course more noticeable when the camera is looking in a direction in which a significant amount of geometry is not visible to it, but in general, there are always parts of the scene that are not visible to the camera, so culling is usually going to give you a performance bonus.

In large, complex scenes with tons of objects we probably want to use more sophisticated culling methods such as Quadtrees, but in this case, since the number of meshes is not too high (the Sponza model is slightly shy of 400 meshes), we just go though all of them and cull them individually against the camera's frustum, which determines the area of the 3D space that is visible to the camera.

The way culling works is simple: for each mesh we compute an axis-aligned bounding box and we test that box for intersection with the camera's frustum. If we can determine that the box never intersects, then the mesh enclosed within it is not visible and we flag it as such. Later on, at rendering time (or rather, at command recording time, since the demo has been written in Vulkan) we just skip the meshes that have been flagged.

The algorithm is not perfect, since it is possible that an axis-aligned bounding box for a particular mesh is visible to the camera and yet no part of the mesh itself is visible, but it should not affect a lot of meshes and trying to improve this would incur in additional checks that could undermine the efficiency of the process anyway.

Since in this particular demo we only have static geometry we only need to run the culling pass when the camera moves around, since otherwise the list of visible meshes doesn't change. If dynamic geometry were present, we would need to at least cull dynamic geometry on every frame even if the camera stayed static, since dynamic elements may step in (or out of) the viewing frustum at any moment.

Step 1: Depth pre-pass

This is an optional stage, but it can help performance significantly in many cases. The idea is the following: our GPU performance is usually going to be limited by the fragment shader, and very specially so as we target higher resolutions. In this context, without a depth pre-pass, we are very likely going to execute the fragment shader for fragments that will not end up in the screen because they are occluded by fragments produced by other geometry in the scene that will be rasterized to the same XY screen-space coordinates but with a smaller Z coordinate (closer to the camera). This wastes precious GPU resources.

One way to improve the situation is to sort our geometry by distance from the camera and render front to back. With this we can get fragments that are rasterized from background geometry quickly discarded by early depth tests before the fragment shader runs for them. Unfortunately, although this will certainly help (assuming we can spare the extra CPU work to keep our geometry sorted for every frame), it won't eliminate all the instances of the problem in the general case.

Also, some times things are more complicated, as the shading cost of different pieces of geometry can be very different and we should also take this into account. For example, we can have a very large piece of geometry for which some pixels are very close to the camera while some others are very far away and that has a very expensive shader. If our renderer is doing front-to-back rendering without any other considerations it will likely render this geometry early (since parts of it are very close to the camera), which means that it will shade all or most of its very expensive fragments. However, if the renderer accounts for the relative cost of the shader execution it would probably postpone rendering it as much as possible, so by the time it actually renders it, it takes advantage of early fragment depth tests to avoid as many of its expensive fragment shader executions as possible.

Using a depth-prepass ensures that we only run our fragment shader for visible fragments, and only those, no matter the situation. The downside is that we have to execute a separate rendering pass where we render our geometry to the depth buffer so that we can identify the visible fragments. This pass is usually very fast though, since we don't even need a fragment shader and we are only writing to a depth texture. The exception to this rule is geometry that has opacity information, such as opacity textures, in which case we need to run a cheap fragment shader to identify transparent pixels and discard them so they don't hit the depth buffer. In the Sponza model we need to do that for the flowers or the vines on the columns for example.

Depth pre-pass output

The picture shows the output of the depth pre-pass. Darker colors mean smaller distance from the camera. That's why the picture gets brighter as we move further away.

Now, the remaining passes will be able to use this information to limit their shading to fragments that, for a given XY screen-space position, match exactly the Z value stored in the depth buffer, effectively selecting only the fragments that will be visible in the screen. We do this by configuring the depth test to do an EQUAL test instead of the usual LESS test, which is what we use in the depth-prepass.

In this particular demo, running on my Intel GPU, the depth pre-pass is by far the cheapest of all the GPU passes and it definitely pays off in terms of overall performance output.

Step 2: Shadow map

In this demo we have single source of light produced by a directional light that simulates the sun. You can probably guess the direction of the light by checking out the picture at the top of this post and looking at the direction projected shadows.

I already covered how shadow mapping works in previous series of posts, so if you're interested in the programming details I encourage you to read that. Anyway, the basic idea is that we want to capture the scene from the point of view of the light source (to be more precise, we want to capture the objects in the scene that can potentially produce shadows that are visible to our camera).

With that information, we will be able to inform out lighting pass so it can tell if a particular fragment is in the shadows (not visible from our light's perspective) or in the light (visible from our light's perspective) and shade it accordingly.

From a technical point of view, recording a shadow map is exactly the same as the depth-prepass: we basically do a depth-only rendering and capture the result in a depth texture. The main differences here are that we need to render from the point of view of the light instead of our camera's and that this being a directional light, we need to use an orthographic projection and adjust it properly so we capture all relevant shadow casters around the camera.

Shadow map

In the image above we can see the shadow map generated for this frame. Again, the brighter the color, the further away the fragment is from the light source. The bright white area outside the atrium building represents the part of the scene that is empty and thus ends with the maximum depth, which is what we use to clear the shadow map before rendering to it.

In this case, we are using a 4096×4096 texture to store the shadow map image, much larger than our rendering target. This is because shadow mapping from directional lights needs a lot of precision to produce good results, otherwise we end up with very pixelated / blocky shadows, more artifacts and even missing shadows for small geometry. To illustrate this better here is the same rendering of the Sponza model from the top of this post, but using a 1024×1024 shadow map (floor reflections are disabled, but that is irrelevant to shadow mapping):

Sponza rendering with 1024×1024 shadow map

You can see how in the 1024×1024 version there are some missing shadows for the vines on the columns and generally blurrier shadows (when not also slightly distorted) everywhere else.

Step 3: GBuffer

In deferred rendering we capture various attributes of the fragments produced by rasterizing our geometry and write them to separate textures that we will use to inform the lighting pass later on (and possibly other passes).

What we do here is to render our geometry normally, like we did in our depth-prepass, but this time, as we explained before, we configure the depth test to only pass fragments that match the contents of the depth-buffer that we produced in the depth-prepass, so we only process fragments that we now will be visible on the screen.

Deferred rendering uses multiple render targets to capture each of these attributes to a different texture for each rasterized fragment that passes the depth test. In this particular demo our GBuffer captures:

  1. Normal vector
  2. Diffuse color
  3. Specular color
  4. Position of the fragment from the point of view of the light (for shadow mapping)

It is important to be very careful when defining what we store in the GBuffer: since we are rendering to multiple screen-sized textures, this pass has serious bandwidth requirements and therefore, we should use texture formats that give us the range and precision we need with the smallest pixel size requirements and avoid storing information that we can get or compute efficiently through other means. This is particularly relevant for integrated GPUs that don't have dedicated video memory (such as my Intel GPU).

In the demo, I do lighting in view-space (that is the coordinate space used takes the camera as its origin), so I need to work with positions and vectors in this coordinate space. One of the parameters we need for lighting is surface normals, which are conveniently stored in the GBuffer, but we will also need to know the view-space position of the fragments in the screen. To avoid storing the latter in the GBuffer we take advantage of the fact that we can reconstruct the view-space position of any fragment on the screen from its depth (which is stored in the depth buffer we rendered during the depth-prepass) and the camera's projection matrix. I might cover the process in more detail in another post, for now, what is important to remember is that we don't need to worry about storing fragment positions in the GBuffer and that saves us some bandwidth, helping performance.

Let's have a look at the various GBuffer textures we produce in this stage:

Normal vectors

GBuffer normal texture

Here we see the normalized normal vectors for each fragment in view-space. This means they are expressed in a coordinate space in which our camera is at the origin and the positive Z direction is opposite to the camera's view vector. Therefore, we see that surfaces pointing to the right of our camera are red (positive X), those pointing up are green (positive Y) and those pointing opposite to the camera's view direction are blue (positive Z).

It should be mentioned that some of these surfaces use normal maps for bump mapping. These normal maps are textures that provide per-fragment normal information instead of the usual vertex normals that come with the polygon meshes. This means that instead of computing per-fragment normals as a simple interpolation of the per-vertex normals across the polygon faces, which gives us a rather flat result, we use a texture to adjust the normal for each fragment in the surface, which enables the lighting pass to render more nuanced surfaces that seem to have a lot more volume and detail than they would have otherwise.

For comparison, here is the GBuffer normal texture without bump mapping enabled. The difference in surface detail should be obvious. Just look at the lion figure at the far end or the columns and and you will immediately notice the addditional detail added with bump mapping to the surface descriptions:

GBuffer normal texture (bump mapping disabled)

To make the impact of the bump mapping more obvious, here is a different shot of the final rendering focusing on the columns of the upper floor of the atrium, with and without bump mapping:

Bump mapping enabled Bump mapping disabled

All the extra detail in the columns is the sole result of the bump mapping technique.

Diffuse color

GBuffer diffuse texture

Here we have the diffuse color of each fragment in the scene. This is basically how our scene would look like if we didn't implement a lighting pass that considers how the light source interacts with the scene.

Naturally, we will use this information in the lighting pass to modulate the color output based on the light interaction with each fragment.

Specular color

GBuffer specular texture

This is similar to the diffuse texture, but here we are storing the color (and strength) used to compute specular reflections.

Similarly to normal textures, we use specular maps to obtain per-fragment specular colors and intensities. This allows us to simulate combinations of more complex materials in the same mesh by specifying different specular properties for each fragment.

For example, if we look at the cloths that hang from the upper floor of the atrium, we see that they are mostly black, meaning that they barely produce any specular reflection, as it is to be expected from textile materials. However, we also see that these same cloths have an embroidery that has specular reflection (showing up as a light gray color), which means these details in the texture have stronger specular reflections than its surrounding textile material:

Specular reflection on cloth embroidery

The image shows visible specular reflections in the yellow embroidery decorations of the cloth (on the bottom-left) that are not present in the textile segment (the blue region of the cloth).

Fragment positions from Light

GBuffer light-space position texture

Finally, we store fragment positions in the coordinate space of the light source so we can implement shadows in the lighting pass. This image may be less intuitive to interpret, since it is encoding space positions from the point of view of the sun rather than physical properties of the fragments. We will need to retrieve this information for each fragment during the lighting pass so that we can tell, together with the shadow map, which fragments are visible from the light source (and therefore are directly lit by the sun) and which are not (and therefore are in the shadows). Again, more detail on how that process works, step by step and including Vulkan source code in my series of posts on that topic.

Step 4: Screen Space Ambient Occlusion

With the information stored in the GBuffer we can now also run a screen-space ambient occlusion pass that we will use to improve our lighting pass later on.

The idea here, as I discussed in my lighting and shadows series, the Phong lighting model simplifies ambient lighting by making it constant across the scene. As a consequence of this, lighting in areas that are not directly lit by a light source look rather flat, as we can see in this image:

SSAO disabled

Screen-space Ambient Occlusion is a technique that gathers information about the amount of ambient light occlusion produced by nearby geometry as a way to better estimate the ambient light term of the lighting equations. We can then use that information in our lighting pass to modulate ambient light accordingly, which can greatly improve the sense of depth and volume in the scene, specially in areas that are not directly lit:

SSAO enabled

Comparing the images above should illustrate the benefits of the SSAO technique. For example, look at the folds in the blue curtains on the right side of the images, without SSAO, we barely see them because the lighting is too flat across all the pixels in the curtain. Similarly, thanks to SSAO we can create shadowed areas from ambient light alone, as we can see behind the cloths that hang from the upper floor of the atrium or behind the vines on the columns.

To produce this result, the output of the SSAO pass is a texture with ambient light intensity information that looks like this (after some blur post-processing to eliminate noise artifacts):

SSAO output texture

In that image, white tones represent strong light intensity and black tones represent low light intensity produced by occlusion from nearby geometry. In our lighting pass we will source from this texture to obtain per-fragment ambient occlusion information and modulate the ambient term accordingly, bringing the additional volume showcased in the image above to the final rendering.

Step 6: Lighting pass

Finally, we get to the lighting pass. Most of what we showcased above was preparation work for this.

The lighting pass mostly goes as I described in my lighting and shadows series, only that since we are doing deferred rendering we get our per-fragment lighting inputs by reading from the GBuffer textures instead of getting them from the vertex shader.

Basically, the process involves retrieving diffuse, ambient and specular color information from the GBuffer and use it as input for the lighting equations to produce the final color for each fragment. We also sample from the shadow map to decide which pixels are in the shadows, in which case we remove their diffuse and specular components, making them darker and producing shadows in the image as a result.

We also use the SSAO output to improve the ambient light term as described before, multipliying the ambient term of each fragment by the SSAO value we computed for it, reducing the strength of the ambient light for pixels that are surrounded by nearby geometry.

The lighting pass is also where we put bump mapping to use. Bump mapping provides more detailed information about surface normals, which the lighting pass uses to simulate more complex lighting interactions with mesh surfaces, producing significantly enhanced results, as I showcased earlier in this post.

After combining all this information, the lighting pass produces an output like this. Compare it with the GBuffer diffuse texture to see all the stuff that this pass is putting together:

Lighting pass output

Step 7: Tone mapping

After the lighting pass we run a number of post-processing passes, of which tone mapping is the first one. The idea behind tone mapping is this: normally, shader color outputs are limited to the range [0, 1], which puts a hard cap on our lighting calculations. Specifically, it means that when our light contributions to a particular pixel go beyond 1.0 in any color component, they get clamped, which can distort the resulting color in unrealistic ways, specially when this happens during intermediate lighting calculations (since the deviation from the physically correct color is then used as input to more computations, which then build on that error).

To work around this we do our lighting calculations in High Dynamic Range (HDR) which allows us to produce color values with components larger than 1.0, and then we run a tone mapping pass to re-map the result to the [0, 1] range when we are done with the lighting calculations and we are ready for display.

The nice thing about tone mapping is that it gives the developer control over how that mapping happens, allowing us to decide if we are interested in preserving more detail in the darker or brighter areas of the scene.

In this particular demo, I used HDR rendering to ramp up the intensity of the sun light beyond what I could have represented otherwise. Without tone mapping this would lead to unrealistic lighting in areas with strong light reflections, since would exceed the 1.0 per-color-component cap and lead to pure white colors as result, losing the color detail from the original textures. This effect can be observed in the following pictures if you look at the lit area of the floor. Notice how the tone-mapped picture better retains the detail of the floor texture while in the non tone-mapped version the floor seems to be over-exposed to light and large parts of it just become white as a result (shadow mapping has been disabled to better showcase the effects of tone-mapping on the floor):

Tone mapping disabled Tone mapping enabled

Step 8: Screen Space Reflections (SSR)

The material used to render the floor is reflective, which means that we can see the reflections of the surrounding environment on it.

There are various ways to capture reflections, each with their own set of pros and cons. When I implemented my OpenGL terrain rendering demo I implemented water reflections using "Planar Reflections", which produce very accurate results at the expense of requiring to re-render the scene with the camera facing in the same direction as the reflection. Although this can be done at a lower resolution, it is still quite expensive and cumbersome to setup (for example, you would need to run an additional culling pass), and you also need to consider that we need to do this for each planar surface you want to apply reflections on, so it doesn't scale very well. In this demo, although it is not visible in the reference screenshot, I am capturing reflections from the floor sections of both stories of the atrium, so the Planar Reflections approach might have required me to render twice when fragments of both sections are visible (admittedly, not very often, but not impossible with the free camera).

So in this particular case I decided to experiment with a different technique that has become quite popular, despite its many shortcomings, because it is a lot faster: Screen Space Reflections.

As all screen-space techniques, the technique uses information already present in the screen to capture the reflection information, so we don't have to render again from a different perspective. This leads to a number of limitations that can produce fairly visible artifacts, specially when there is dynamic geometry involved. Nevertheless, in my particular case I don't have any dynamic geometry, at least not yet, so while the artifacts are there they are not quite as distracting. I won't go into the details of the artifacts introduced with SSR here, but for those interested, here is a good discussion.

I should mention that my take on this is fairly basic and doesn't implement relevant features such as the Hierarchical Z Buffer optimization (HZB) discussed here.

The technique has 3 steps: capturing reflections, applying roughness material properties and alpha blending:

Capturing reflections

I only implemented support for SSR in the deferred path, since like in the case of SSAO (and more generally all screen-space algorithms), deferred rendering is the best match since we are already capturing screen-space information in the GBuffer.

The first stage for this requires to have means to identify fragments that need reflection information. In our case, the floor fragments. What I did for this is to capture the reflectiveness of the material of each fragment in the screen during the GBuffer pass. This is a single floating-point component (in the 0-1 range). A value of 0 means that the material is not reflective and the SSR pass will just ignore it. A value of 1 means that the fragment is 100% reflective, so its color value will be solely the reflection color. Values in between allow us to control the strength of the reflection for each fragment with a reflective material in the scene.

One small note on the GBuffer storage: because this is a single floating-point value, we don't necessarily need an extra attachment in the GBuffer (which would have some performance penalty), instead we can just put this in the alpha component of the diffuse color, since we were not using it (the Intel Mesa driver doesn't support rendering to RGB textures yet, so since we are limited to RGBA we might as well put it to good use).

Besides capturing which fragments are reflective, we can also store another piece of information relevant to the reflection computations: the material's roughness. This is another scalar value indicating how much blurring we want to apply to the resulting reflection: smooth metal-like surfaces can have very sharp reflections but with rougher materials that have not smooth surfaces we may want the reflections to look a bit blurry, to better represent these imperfections.

Besides the reflection and roughness information, to capture screen-space reflections we will need access to the output of the previous pass (tone mapping) from which we will retrieve the color information of our reflection points, the normals that we stored in the GBuffer (to compute reflection directions for each fragment in the floor sections) and the depth buffer (from the depth-prepass), so we can check for reflection collisions.

The technique goes like this: for each fragment that is reflective, we compute the direction of the reflection using its normal (from the GBuffer) and the view vector (from the camera and the fragment position). Once we have this direction, we execute a ray marching from the fragment position, in the direction of the reflection. For each point we generate, we take the screen-space X and Y coordinates and use them to retrieve the Z-buffer depth for that pixel in the scene. If the depth buffer value is smaller than our sample's it means that we have moved past foreground geometry and we stop the process. If we got to this point, then we can do a binary search to pin-point the exact location where the collision with the foreground geometry happens, which will give us the screen-space X and Y coordinates of the reflection point. Once we have that we only need to sample the original scene (the output from the tone mapping pass) at that location to retrieve the reflection color.

As discussed earlier, the technique has numerous caveats, which we need to address in one way or another and maybe adapt to the characteristics of different scenes so we can obtain the best results in each case.

The output of this pass is a color texture where we store the reflection colors for each fragment that has a reflective material:

Reflection texture

Naturally, the image above only shows reflection data for the pixels in the floor, since those are the only ones with a reflective material attached. It is immediately obvious that some pixels lack reflection color though, this is due to the various limitations of the screen-space technique that are discussed in the blog post I linked above.

Because the reflections will be alpha-blended with the original image, we use the reflectiveness that we stored in the GBuffer as the base for the alpha component of the reflection color as well (there are other aspects that can contribute to the alpha component too, but I won't go into that here), so the image above, although not visible in the screenshot, has a valid alpha channel.

Considering material roughness

Once we have captured the reflection image, the next step is to apply the material roughness settings. We can accomplish this with a simple box filter based on the roughness of each fragment: the larger the roughness, the larger the box filter we apply and the blurrier the reflection we get as a result. Because we store roughness for each fragment in the GBuffer, we can have multiple reflective materials with different roughness settings if we want. In this case, we just have one material for the floor though.

Alpha blending

Finally, we use alpha blending to incorporate the reflection onto the original image (the output from the tone mapping) ot incorporate the reflections to the final rendering:

SSR output

Step 9: Anti-aliasing (FXAA)

So far we have been neglecting anti-aliasing. Because we are doing deferred rendering Multi-Sample Anti-Aliasing (MSAA) is not an option: MSAA happens at rasterization time, which in a deferred renderer occurs before our lighting pass (specifically, when we generate the GBuffer), so it cannot account for the important effects that the lighting pass has on the resulting image, and therefore, on the eventual aliasing that we need to correct. This is why deferred renderers usually do anti-aliasing via post-processing.

In this demo I have implemented a well-known anti-aliasing post-processing pass known as Fast Approximate Anti Aliasing (FXAA). The technique attempts to identify strong contrast across neighboring pixels in the image to identify edges and then smooth them out using linear filtering. Here is the final result which matches the one I included as reference at the top of this post:

Anti-aliased output

The image above shows the results of the anti-aliasing pass. Compare that with the output of the SSR pass. You can see how this pass has effectively removed the jaggies observed in the cloths hanging from the upper floor for example.

Unlike MSAA, which acts on geometry edges only, FXAA works on all pixels, so it can also smooth out edges produced by shaders or textures. Whether that is something we want to do or not may depend on the scene. Here we can see this happening on the foreground column on the left, where some of the imperfections of the stone are slightly smoothed out by the FXAA pass.

Conclusions and source code

So that's all, congratulations if you managed to read this far! In the past I have found articles that did frame analysis like this quite interesting so it's been fun writing one myself and I only hope that this was interesting to someone else.

This demo has been implemented in Vulkan and includes a number of configurable parameters that can be used to tweak performance and quality. The work-in-progress source code is available here, but beware that I have only tested this on Intel, since that is the only hardware I have available, so you may find issues if you run this on other GPUs. If that happens, let me know in the comments and I might be able to provide fixes at some point.

17 Apr 2018 12:45pm GMT

16 Apr 2018


Eric Anholt: 2018-04-16

On the vc4 front, I did the investigation of the HDL to determine that the OLED matrix applies before the gamma tables, so we can expose it in the DRM for Android's color correction. Stefan was also interested in reworking his fencing patches to use syncobjs, so hopefully we can merge those and get DRM HWC support in mainline soon.

I also took a look at a warning we're seeing when a cursor with a nonzero hotspot goes to the upper left corner of the screen - unfortunately, fixing it properly looks like it'll be a bit of a rework.

I finally took a moment to port over an etnaviv change to remove the need for a DRM subsystem node in the DT. This was a request from Rob Herring long ago, but etnaviv's change finally made it clear what we should be doing instead.

For vc5, I stabilized the GPU scheduler work and pushed it to my main branch. I've now started working on using the GMP to isolate clients from each other (important for being able to have unprivileged GPU workloads running alongside X, and also for making sure that say, some misbehaving webgl doesn't trash your X server's other window contents). Hopefully once this security issue is resolved, I can (finally!) propose merging it to the kernel.

16 Apr 2018 12:30am GMT

13 Apr 2018


Robert Foss: Upstream Linux support for the new NXP i.MX 8M

Dart iMX 8M

The i.MX6 platform has for the past few years enjoyed a large effort to add upstream support to Linux and surrounding projects. Now it is at the point where nothing is really missing any more. Improvements are still being made, to the graphics driver for i.MX 6, but functionally it is complete.

Etnaviv driver development timeline

The i.MX8 is a different story. The newly introduced platform, with hardware still difficult to get access to, is seeing lots of work being done, but much still remains to be done.

That being said, initial support for the GPU, the Vivante GC7000, is in place and is able to successfully run Wayland/Weston, glmark, etc. This should also mean that running Android ontop of the currently not-quite-upstream stack is …

13 Apr 2018 9:39am GMT

09 Apr 2018


Eric Anholt: 2018-04-09

I continued spending time on VC5 in the last two weeks.

First, I've ported the driver over to the AMDGPU scheduler. Prior to this, vc4 and vc5's render jobs get queued to the HW in the order that the GL clients submit them to the kernel. OpenGL requires that jobs within a client effectively happen in that order (though we do some clever rescheduling in userspace to reduce overhead of some render-to-texture workloads due to us being a tiler). However, having submission order to the kernel dictate submission order to the HW means that a single busy client (imagine a crypto miner) will starve your desktop workload, since the desktop has to wait behind all of the bulk-work jobs the other client has submitted.

With the AMDGPU scheduler, each client gets its own serial run queue, and the scheduler picks between them as jobs in the run queues become ready. It also gives us easy support for in-fences on your jobs, one of the requirements for Android. All of this is with a bit less vc5 driver code than I had for my own, inferior scheduler.

Currently I'm making it most of the way through piglit and conformance test runs, before something goes wrong around the time of a GPU reset and the kernel crashes. In the process, I've improved the documentation on the scheduler's API, and hopefully this encourages other drivers to pick it up.

Second, I've been working on debugging some issues that may be TLB flushing bugs. On the piglit "longprim" test, we go through overflow memory quickly, and allocating overflow memory involves updating PTEs and then having the GPU read from those in very short order. I see a lot of GPU segfaults on non-writable PTEs where the new overflow BO was allocated just after the last one (so maybe the lookups that happened near the end of the last one pre-fetched some PTEs from our space?). The confusing part is that I keep getting write errors far past where I would have expected any previous PTE lookups to have gone. Yet, outside of this case and maybe a couple of others within piglit and the CTS, we seem to be completely fine at PTE updates.

On the VC4 front, I wrote some docs for what I think the steps are for people that want to connect new DSI panels to Raspberry Pi. I reviewed Stefan's patches for using the CTM for color correction on Android (promising, except I'm concerned it applies at the wrong stage of the DRM display pipeline), and some of Boris's work on async updates (simplifying our cursor and async pageflip path). I also reviewed an Intel patch that's necessary for a core DRM change we want for our SAND display support, and a Mesa patch fixing a regression with the new modifiers code.

09 Apr 2018 12:30am GMT

08 Apr 2018


Peter Hutterer: GNOME 3.28 uses clickfinger behaviour by default on touchpads

To reduce the number of bugs filed against libinput consider this a PSA: as of GNOME 3.28, the default click method on touchpads is the 'clickfinger' method (see the libinput documentation, it even has pictures). In short, rather than having a separate left/right button area on the bottom edge of the touchpad, right or middle clicks are now triggered by clicking with 2 or 3 fingers on the touchpad. This is the method macOS has been using for a decade or so.

Prior to 3.28, GNOME used the libinput defaults which vary depending on the hardware (e.g. mac touchpads default to clickfinger, most other touchpads usually button areas). So if you notice that the right button area disappeared after the 3.28 update, either start using clickfinger or reset using the gnome-tweak-tool. There are gsettings commands that achieve the same thing if gnome-tweak-tool is not an option:

$ gsettings range org.gnome.desktop.peripherals.touchpad click-method
$ gsettings get org.gnome.desktop.peripherals.touchpad click-method
$ gsettings set org.gnome.desktop.peripherals.touchpad click-method 'areas'

For reference, the upstream commit is in gsettings-desktop-schemas.

Note that this only affects so-called ClickPads, touchpads where the entire touchpad is a button. Touchpads with separate physical buttons in front of the touchpad are not affected by any of this.

08 Apr 2018 10:14pm GMT