06 Feb 2016


Julien Danjou: FOSDEM 2016, recap

Last week-end, I was in Brussels, Belgium for the FOSDEM, one of the greatest open source developer conference. I was not sure to go there this year (I already skipped it in 2015), but it turned out I was requested to do a talk in the shared Lua & GNU Guile devroom.

As a long time Lua user and developer, and a follower of GNU Guile for several years, the organizer asked me to run a talk that would be a link between the two languages.

I've entitled my talk "How awesome ended up with Lua and not Guile" and gave it to a room full of interested users of the awesome window manager 🙂.

We continued with a panel discussion entitled "The future of small languages Experience of Lua and Guile" composed of Andy Wingo, Christopher Webber, Ludovic Courtès, Etiene Dalcol, Hisham Muhammaad and myself. It was a pretty interesting discussion, where both language shared their views on the state of their languages.

It was a bit awkward to talk about Lua & Guile whereas most of my knowledge was year old, but it turns out many things didn't change. I hope I was able to provide interesting hindsight to both community. Finally, it was a pretty interesting FOSDEM to me, and it was a long time I didn't give talk here, so I really enjoyed it. See you next year!

06 Feb 2016 10:54am GMT

04 Feb 2016


Ian Romanick: FOSDEM2016 Presentation

The slides from my FOSDEM talk are now available.

The video of the talk will eventually be available. It takes a lot of time and hard work to prepare all of the videos. The notes in the slides are probably more coherent that I am in the video. There were a few good questions at the end, so you may want to fast forward to that.

I am planning to post the source for the partial Amiga simulator next week. When I do, I will also have another blog post. Watch this space...

04 Feb 2016 11:12pm GMT

03 Feb 2016


Alberto Ruiz: This blog has moved to Wordpress.com

I've moved my blog, you can read my new posts here https://siliconislandblog.wordpress.com/

03 Feb 2016 8:20pm GMT

30 Jan 2016


Simon McVittie: GNOME Developer Experience hackfest: xdg-app + Debian

Over the last few days I've been at the GNOME Developer Experience hackfest in Brussels, looking into xdg-app and how best to use it in Debian and Debian derivatives.

xdg-app is basically a way to run "non-core" software on Linux distributions, analogous to apps on Android and iOS. It doesn't replace distributions like Debian or packaging systems, but it adds a layer above them. It's mostly aimed towards third-party apps obtained from somewhere that isn't your distribution vendor, aiming to address a few long-standing problems in that space:

To address the first point, each application uses a specified "runtime", which is available as /usr inside its sandbox. This can be used to run application bundles with multiple, potentially incompatible sets of dependencies within the same desktop environment. A runtime can be updated within its branch - for instance, if an application uses the "GNOME 3.18" runtime (consisting of a basic Linux system, the GNOME 3.18 libraries, other related libraries like Mesa, and their recursive dependencies like libjpeg), it can expect to see minor-version updates from GNOME 3.18.x (including any security updates that might be necessary for the bundled libraries), but not a jump to GNOME 3.20.

To address the second issue, the plan is for application bundles to be available as a single file, containing metadata (such as the runtime to use), the app itself, and any dependencies that are not available in the runtime (which the app vendor is responsible for updating if necessary). However, the primary way to distribute and upgrade runtimes and applications is to package them as OSTree repositories, which provide a git-like content-addressed filesystem, with efficient updates using binary deltas. The resulting files are hard-linked into place.

To address the last point, application bundles run partially isolated from the wider system, using containerization techniques such as namespaces to prevent direct access to system resources. Resources from outside the sandbox can be accessed via "portal" services, which are responsible for access control; for example, the Documents portal (the only one, so far) displays an "Open" dialog outside the sandbox, then allows the application to access only the selected file.

xdg-app for Debian

One thing I've been doing at this hackfest is improving the existing Debian/Ubuntu packaging for xdg-app (and its dependencies ostree and libgsystem), aiming to get it into a state where I can upload it to Debian experimental. Because xdg-app aims to be a general freedesktop project, I'm currently intending to make it part of the "Utopia" packaging team alongside projects like D-Bus and polkit, but I'm open to suggestions if people want to co-maintain it elsewhere.

In the process of updating xdg-app, I sent various patches to Alex, mostly fixing build and test issues, which are in the new 0.4.8 release.

I'd appreciate co-maintainers and further testing for this stuff, particularly ostree: ostree is primarily a whole-OS deployment technology, which isn't a use-case that I've tested, and in particular ostree-grub2 probably doesn't work yet.

Source code:

Binaries (no trust path, so only use these if you have a test VM):

The "Hello, World" of xdg-apps

Another thing I set out to do here was to make a runtime and an app out of Debian packages. Most of the test applications in and around GNOME use the "freedesktop" or "GNOME" runtimes, which consist of a Yocto base system and lots of RPMs, are rebuilt from first principles on-demand, and are extensive and capable enough that they make it somewhat non-obvious what's in an app or a runtime.

So, here's a step-by-step route through xdg-app, first using typical GNOME instructions, but then using the simplest GUI app I could find - xvt, a small xterm clone. I'm using a Debian testing (stretch) x86_64 virtual machine for all this. xdg-app currently requires systemd-logind to put users and apps in cgroups, either with systemd as pid 1 (systemd-sysv) or systemd-shim and cgmanager; I used the default systemd-sysv. In principle it could work with plain cgmanager, but nobody has contributed that support yet.

Demonstrating an existing xdg-app

Debian's kernel is currently patched to be able to allow unprivileged users to create user namespaces, but make it runtime-configurable, because there have been various security issues in that feature, making it a security risk for a typical machine (and particularly a server). Hopefully unprivileged user namespaces will soon be secure enough that we can enable them by default, but for now, we have to do one of three things to let xdg-app use them:

First, we'll need a runtime. The standard xdg-app tutorial would tell you to download the "GNOME Platform" version 3.18. To do that, you'd add a remote, which is a bit like a git remote, and a bit like an apt repository:

$ wget http://sdk.gnome.org/keys/gnome-sdk.gpg
$ xdg-app remote-add --user --gpg-import=gnome-sdk.gpg gnome \

(I'm ignoring considerations like trust paths and security here, for brevity; in real life, you'd want to obtain the signing key via https and/or have a trust path to it, just like you would for a secure-apt signing key.)

You can list what's available in a remote:

$ xdg-app remote-ls --user gnome

The Platform runtimes are what we want here: they are collections of runtime libraries with which you can run an application. The Sdk runtimes add development tools, header files, etc. to be able to compile apps that will be compatible with the Platform.

For now, all we want is the GNOME 3.18 platform:

$ xdg-app install --user gnome org.gnome.Platform 3.18

Next, we can install an app that uses it, from Alex Larsson's nightly builds of a subset of GNOME. The server they're on doesn't have a great deal of bandwidth, so be nice :-)

$ wget
$ xdg-app remote-add --user --gpg-import=nightly.gpg nightly \
$ xdg-app install --user nightly org.mypaint.MypaintDevel

We now have one app, and the runtime it needs:

$ xdg-app list
$ xdg-app run org.mypaint.MypaintDevel
[you see a GUI window]

Digression: what's in a runtime?

Behind the scenes, xdg-app runtimes and apps are both OSTree trees. This means the ostree tool, from the package of the same name, can be used to inspect them.

$ sudo apt install ostree
$ ostree refs --repo ~/.local/share/xdg-app/repo

A "ref" has roughly the same meaning as in git: something like a branch or a tag. ostree can list the directory tree that it represents:

$ ostree ls --repo ~/.local/share/xdg-app/repo \
d00755 0 0      0 /
-00644 0 0    493 /metadata
d00755 0 0      0 /files
$ ostree ls --repo ~/.local/share/xdg-app/repo \
    runtime/org.gnome.Platform/x86_64/3.18 /files
d00755 0 0      0 /files
l00777 0 0      0 /files/local -> ../var/usrlocal
l00777 0 0      0 /files/sbin -> bin
d00755 0 0      0 /files/bin
d00755 0 0      0 /files/cache
d00755 0 0      0 /files/etc
d00755 0 0      0 /files/games
d00755 0 0      0 /files/include
d00755 0 0      0 /files/lib
d00755 0 0      0 /files/lib64
d00755 0 0      0 /files/libexec
d00755 0 0      0 /files/share
d00755 0 0      0 /files/src

You can see that /files in a runtime is basically a copy of /usr. This is not coincidental: the runtime's /files gets mounted at /usr inside the xdg-app container. There is also some metadata, which is in the ini-like syntax seen in .desktop files:

$ ostree cat --repo ~/.local/share/xdg-app/repo \
    runtime/org.gnome.Platform/x86_64/3.18 /metadata

[Extension org.freedesktop.Platform.GL]

[Extension org.freedesktop.Platform.Timezones]

[Extension org.gnome.Platform.Locale]


Looking at an app, the situation is fairly similar:

$ ostree ls --repo ~/.local/share/xdg-app/repo \
d00755 0 0      0 /
-00644 0 0    258 /metadata
d00755 0 0      0 /export
d00755 0 0      0 /files

This time, /files maps to what will become /app for the application, which was compiled with --prefix=/app:

$ ostree ls --repo ~/.local/share/xdg-app/repo \
    app/org.mypaint.MypaintDevel/x86_64/master /files
d00755 0 0      0 /files
-00644 0 0   4599 /files/manifest.json
d00755 0 0      0 /files/bin
d00755 0 0      0 /files/lib
d00755 0 0      0 /files/share

There is also a /export directory, which is made visible to the host system so that the contained app can appear as a "first-class citizen" in menus:

$ ostree ls --repo ~/.local/share/xdg-app/repo \
    app/org.mypaint.MypaintDevel/x86_64/master /export
d00755 0 0      0 /export
d00755 0 0      0 /export/share
user@debian:~$ ostree ls --repo ~/.local/share/xdg-app/repo \
    app/org.mypaint.MypaintDevel/x86_64/master /export/share
d00755 0 0      0 /export/share
d00755 0 0      0 /export/share/app-info
d00755 0 0      0 /export/share/applications
d00755 0 0      0 /export/share/icons
user@debian:~$ ostree ls --repo ~/.local/share/xdg-app/repo \
    app/org.mypaint.MypaintDevel/x86_64/master /export/share/applications
d00755 0 0      0 /export/share/applications
-00644 0 0    715 /export/share/applications/org.mypaint.MypaintDevel.desktop
$ ostree cat --repo ~/.local/share/xdg-app/repo \
    app/org.mypaint.MypaintDevel/x86_64/master \
[Desktop Entry]
Name=(Nightly) MyPaint
Exec=mypaint %f
Comment=Painting program for digital artists
GenericName=Raster Graphics Editor
GenericName[fr]=Éditeur d'Image Matricielle

Again, there's some metadata:

$ ostree cat --repo ~/.local/share/xdg-app/repo \
    app/org.mypaint.MypaintDevel/x86_64/master /metadata


[Extension org.mypaint.MypaintDevel.Debug]

Building a runtime, probably the wrong way

The way in which the reference/demo runtimes and containers are generated is... involved. As far as I can tell, there's a base OS built using Yocto, and the actual GNOME bits come from RPMs. However, we don't need to go that far to get a working runtime.

In preparing this runtime I'm probably completely ignoring some best-practices and tools - but it works, so it's good enough.

First we'll need a repository:

$ sudo install -d -o$(id -nu) /srv/xdg-apps
$ ostree init --repo /srv/xdg-apps

I'm just keeping this local for this demonstration, but you could rsync it to a web server's exported directory or something - a lot like a git repository, it's just a collection of files. We want everything in /usr because that's what xdg-app expects, hence usrmerge:

$ sudo mount -t tmpfs -o mode=0755 tmpfs /mnt
$ sudo debootstrap --arch=amd64 --include=libx11-6,usrmerge \
    --variant=minbase stretch /mnt
$ sudo mkdir /mnt/runtime
$ sudo mv /mnt/usr /mnt/runtime/files

This obviously has a lot of stuff in it that we don't need - most obviously init, apt and dpkg - but it's Good Enough™.

We will also need some metadata. This is sufficient:

$ sudo sh -c 'cat > /mnt/runtime/metadata'

That's a runtime. We can commit it to ostree, and generate xdg-app metadata:

$ ostree commit --repo /srv/xdg-apps \
    --branch runtime/org.debian.Debootstrap/x86_64/8.20160130 \
$ fakeroot ostree commit --repo /srv/xdg-apps \
    --branch runtime/org.debian.Debootstrap/x86_64/8.20160130
$ fakeroot xdg-app build-update-repo /srv/xdg-apps

(I'm not sure why ostree and xdg-app report "Operation not permitted" when we aren't root or fakeroot - feedback welcome.)

build-update-repo would presumably also be the right place to GPG-sign your repository, if you were doing that.

We can add that as another xdg-app remote:

$ xdg-app remote-add --user --no-gpg-verify local file:///srv/xdg-apps
$ xdg-app remote-ls --user local

Building an app, probably the wrong way

The right way to build an app is to build a "SDK" runtime - similar to that platform runtime, but with development files and tools - and recompile the app and any missing libraries with ./configure --prefix=/app && make && make install. I'm not going to do that, because simplicity is nice, and I'm reasonably sure xvt doesn't actually hard-code /usr into the binary:

$ install -d xvt-app/files/bin
$ sudo apt-get --download-only install xvt
$ dpkg-deb --fsys-tarfile /var/cache/apt/archives/xvt_2.1-20.1_amd64.deb \
    | tar -xvf - ./usr/bin/xvt
$ mv usr xvt-app/files

Again, we'll need metadata, and it's much simpler than the more production-quality GNOME nightly builds:

$ cat > xvt-app/metadata

$ fakeroot ostree commit --repo /srv/xdg-apps \
    --branch app/org.debian.packages.xvt/x86_64/2.1-20.1 xvt-app
$ fakeroot xdg-app build-update-repo /srv/xdg-apps
Updating appstream branch
No appstream data for runtime/org.debian.Debootstrap/x86_64/8.20160130
No appstream data for app/org.debian.packages.xvt/x86_64/2.1-20.1
Updating summary
$ xdg-app remote-ls --user local

The obligatory screenshot

OK, good, now we can install it:

$ xdg-app install --user local org.debian.Debootstrap 8.20160130
$ xdg-app install --user local org.debian.packages.xvt 2.1-20.1
$ xdg-app run --branch=2.1-20.1 org.debian.packages.xvt

xvt in a container

and you can play around with the shell in the xvt and see what you can and can't do in the container.

I'm sure there were better ways to do most of this, but I think there's value in having such a simplistic demo to go alongside the various GNOMEish apps.


Thanks to all those!

30 Jan 2016 6:07pm GMT

25 Jan 2016


Peter Hutterer: libinput and semi-mt touchpads

libinput 1.1.5 has a change in how we deal with semi-mt touchpads, in particular: interpretation of touch points will cease and we will rely on the single touch position and the BTN_TOOL_* flags instead to detect multi-finger interaction. For most of you this will have little effect, even if you have a semi-mt touchpad. As a reminder: semi-mt touchpads are those that can detect the bounding box of two-finger interactions but cannot identify which finger is which. This provides some ambiguity, a pair of touch points at x1/y1 and x2/y2 could be a physical pair of touches at x1/y2 and x2/y1. More importantly, we found issues with semi-mt touchpads that go beyond the ambiguity and reduce the usability of the touchpoints.

Some devices have an extremely low resolution when two-fingers are down (see Bug 91135), the data is little better than garbage. We have had 2-finger scrolling disabled on these touchpads since before libinput 1.0. More recently, Bug 93583 showed that some semi-mt touchpads do not assign the finger positions for some fingers, especially when three fingers are down. This results in touches defaulting to position 0/0 which triggers palm detection or results in scroll jumps, neither of which are helpful. Other semi-mt touchpads assign a straightforward 0/0 as position data and don't update until several events later (see Red Hat Bug 1295073). libinput is not particularly suited to handle this, and even if it did, the touchpad's reaction to a three-finger tap would be noticeably delayed.

In light of these problems, and since these affect all three big semi-mt touchpad manufacturers we decided to drop back and handle semi-mt touchpads as single-finger touchpads with extra finger capability. This means we track only one touchpoint but detect two- and three-finger interactions. Two-finger scrolling is still possible and so is two- and three-finger tapping or the clickfinger behaviour. What isn't possible anymore are pinch gestures and some of the built-in palm detection is deactivated. As mentioned above, this is unlikely to affect you too much, but if you're wondering why gestures don't work on your semi-mt device: the data is garbage.

25 Jan 2016 5:24am GMT

Peter Hutterer: Is Wayland ready yet?

This question turns up a lot, on the irc channel, mailing lists, forums, your local Stammtisch and at weddings. The correct answer is: this is the wrong question. And I'll explain why in this post. Note that I'll be skipping over a couple of technical bits, if you notice those then you're probably not the person that needs to ask the question in the first place.

On your current Linux desktop, right now, you have at least three processes running: the X server, a window manager/compositor and your web browser. The X server is responsible for rendering things to the screen and handling your input. The window manager is responsible for telling the X server where to render the web browser window. Your web browser is responsible for displaying this post. The X server and the window manager communicate over the X protocol, the X server and the web browser do so too. The browser and the window manager communicate through X properties using the X server as a middle man. That too is done via the X protocol. Note: This is of course a very simplified view.

Wayland is a protocol and it replaces the X protocol. Under Wayland, you only need two processes: a compositor and your web browser. The compositor is effectively equivalent to the X server and window manager merged into one thing, and it communicates with the web browser over the Wayland protocol. For this to work you need the compositor and the web browser to be able to understand the Wayland protocol.

This is why the question "is wayland ready yet" does not make a lot of sense. Wayland is the communication protocol and says very little about the implementation of the two sides that you want to communicate.

Let's assume a scenario where we all decide to switch from English to French because it sounds nicer and English was designed in the 80s when ASCII was king so it doesn't support those funky squiggles that the French like to put on every second character. In this scenario, you wouldn't ask "Is French ready yet?" If no-one around you speaks French yet, then that's not the language not being ready, the implementation (i.e. the humans) aren't ready. Maybe you can use French in a restaurant, but not yet in the supermarket. Maybe one waiter speaks both English and French, but the other one French only. So whether you can use French depends very much on the situation. But everyone agrees that eventually we'll all speak French, even though English will hang around for ages until it finally falls out of use. And those squiggles are so cute!

Wayland is the same. The protocol is stable and has been for a while. But not every compositor and/or toolkit/application speak Wayland yet, so it may not be sufficient for your use-case. So rather than asking "Is Wayland ready yet", you should be asking: "Can I run GNOME/KDE/Enlightenment/etc. under Wayland?" That is the right question to ask, and the answer is generally "It depends what you expect to work flawlessly." This also means "people working on Wayland" is often better stated as "people working on Wayland support in ....".

An exception to the above: Wayland as a protocol defines what you can talk about. As a young protocol (compared to X with 30 years worth of extensions) there are things that should be defined in the protocol but aren't yet. For example, Wacom tablet support is currently missing. Those are the legitimate cases where you can say Wayland isn't ready yet and where people are "working on Wayland". Of course, once the protocol is agreed on, you fall back to the above case: both sides of the equation need to implement the new protocol before you can make use of it.

Update 25/01/15: Matthias' answer to Is GNOME on Wayland ready yet?

25 Jan 2016 12:27am GMT

24 Jan 2016


Ben Widawsky: A bit on Intel GPU frequency

I've had this post sitting in my drafts for the last 7 months. The code is stale, but the concepts are correct. I had intended to add some pictures before posting, but it's clear that won't happen now. Words are better than nothing, I suppose… Recently I pushed an intel-gpu-tool tool for modifying GPU frequencies. […]

24 Jan 2016 2:08am GMT

Ben Widawsky: True PPGTT [part 3]

  • EDIT1: I forgot to include a diagram I did of the software state machine for some presentation. I long lost the SVG, and it got kind of messed up, but it's there at the bottom.
  • EDIT2: (Apologies to aggregators) Grammar fixes. Fixed some bugs in a couple of the images.
  • EDIT3: (Again, apologies to aggregators) s/indirect rendering/direct rendering. I had to fix this or else the sentence made no sense.
  • EDIT4 (2017-07-13): I was under the impression we were not yet allowed to talk about preemption. But apparently we are. So feature matrix at the bottom is updated.

The Per-Process Graphics Translation Tables provide real process isolation among the various graphics processes running within an i915 based system. When in use, the combination of the PPGTT and the Hardware Context provide the equivalent of the traditional CPU process. Most of the same capabilities can be provided, and most of the same limitations come with it. True PPGTT encompasses all of the functionality currently merged into the i915 kernel driver that support page tables and address spaces. It's called, "true" because the Aliasing PPGTT was introduced first and often was simply called, "PPGTT."

The True PPGTT patches represent one of the more challenging aspects of working on a project like the Linux kernel. The feature couldn't realistically be enabled in isolation of the existing driver. When regressions occur it's likely that the user gets no display. To say we get chided on occasion would be an understatement. Ipso facto, this feature is not enabled by default. There are quite a few patches on the mailing list that build new functionality on top of this support, and to help stabilize existing support. If one wishes to try enabling the real PPGTT, one must simply use the i915 module parameter: enable_ppgtt=2. I highly recommended that the stability patches be used unless you're reading this in some future where the stability problems are fixed upstream.

Unlike the previous posts where I tried to emphasize the hardware architecture for this feature, the following will not go into almost no detail about how hardware works. There won't be PRM references, or hardware state machines. All of those mechanics have been described in parts 1 and part 2

A Brief History of the i915 Graphics Process

There have been three stages of the definition of a graphics process within the i915 driver. I believe that by explaining the stages one can get a better appreciation for the capabilities. In the following pictures there is meant to be a highlighted region (yellow in the first two, yellow, orange and blue in the last) that denote the scope of a GPU context/process with the specified feature. Incrementally the definition of a process begins to bleed between the CPU, and the GPU.

Unfortunately I have some overlap with my earlier post about Hardware Contexts. I found no good way to write this post without doing so. If you read that post, consider this a refresher.

File Descriptors

Initially all GPU state was shared by every GPU client. The only partition was done via the operating system. Every process that does direct rendering will get a file descriptor for the device. The file descriptor is the thing through which commands are submitted. This could be used by the i915 driver to help disambiguate "who" was doing "what." This permitted the i915 kernel driver to prevent one GPU client from directly referencing the buffers owned by a different GPU client. By making the buffer object handles per file descriptor (this is very easy to implement, it's just an idr in the kernel) there exist no mechanism to reference buffer handles from a different file descriptor. For applications which do not require context saved, non-buggy apps, or non-malicious apps, this separation is still perfectly sufficient. As an example, BO handle #1 for the X server is not the same as BO handle #1 for xonotic since each has a different file descriptor1. Even though we had this partition at the software level, nothing was enforced by the hardware. Provided a GPU client could guess where another buffer resided, it could easily operate on that buffer. Similarly, a GPU client could not expect the GPU state it had set previously to be preserved for any amount of time.

File descriptor isolation. Before hardware contexts.File descriptor isolation.Before hardware contexts.

Hardware Contexts

The next step towards isolation was the Hardware Context2. The hardware contexts built upon the isolation provided by the original file descriptor mechanism. The hardware context was an opt-in interface which meant that those not wishing to use the interface received the old behavior: they could purposefully or accidentally use the state from another GPU client3. There was quite a bit of discussion around this at the time the patches were in review, and there's not really any point in lamenting about how it could be better, now.

The context exists within the domain of the process/file descriptor in the same way that a BO exists in that domain. Contexts cannot be shared [intentionally]. The interface created was, and remains extremely simple.

struct drm_i915_gem_context_create {
    /* output: id of new context*/
    __u32 ctx_id;
    __u32 pad;

struct drm_i915_gem_context_destroy {
    __u32 ctx_id;
    __u32 pad;

As you can see from the two IOCTL payloads above, I wasn't lying about the simplicity. Because there was not a great deal of variable functionality, there just wasn't a lot to add in terms of the interface. Destroy is an optional call because we have the file descriptor and can clean up if a process does not. The primary motivation for destroy() is simply to allow very meticulous and memory conscious GPU clients to keep things tidy. Earlier I had a list of 3 types of GPU clients that could survive without this separation. Considering their inverse; this takes one of those off the list.

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

The block diagram is quite similar to above diagram with the exception that now there are discrete blocks for the persistent state. I was a bit lazy with the separation on this drawing. Hopefully, you get the idea.

Hardware context isolationHardware context isolation


The last piece was to provide a discrete virtual address space for each GPU client. For completeness, I will provide the diagram, but by now you should already know what to expect.

PPGTT, full isolationPPGTT, full isolation

If I write about this picture, there would be no point in continuing with an organized blog post :-). So I'll continue to explain this topic. Take my word for it that this addresses the other two types of GPU clients

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

Since the GGTT isn't really mentioned much in this post, I'd like to point out that the GTT still exists as you can see in this diagram. It is required for several components that were listed in my previous blog post.

VMAs and Address Spaces (AKA VMs)

The patch series which began to implement PPGTT was actually a separate series. It was the one that introduced the Virtual Memory Area for the PPGTT, simply referred to as, VMA4. You can think of a VMA in a very similar way to a GEM BO. It is an identifiable, continuous range within an address space. Conceptually there isn't much difference between a GEM BO. To try to define it in my horrible math jargon: a logical grouping of virtual addresses representing an operand for some GPU operation within a given PPGTT domain. A VMA is uniquely identified via the tuple (BO, Address space). In the likely case that I made no sense just there, a VMA is just another handle on a chunk of GPU memory used for rendering.

Sharing VMAs

You can't (see the note at the bottom). There's not a whole lot I can say without doing another post about DMA-Buf, and/or Flink. Perhaps someday I will, but for now I'll keep things general and brief.

It is impossible to share a VMA. To repeat, a VMA is uniquely identifiable by the address space, and a BO. It remains possible to share a BO. An address space exists for an individual GPU client's process. Therefore it makes no sense to share a VMA since the address space cannot be shared5. As a result of using the existing sharing interfaces a GPU will get multiple VMAs that reference the same BO. Trying to go back to the math jargon again:

  1. VMA: (BO, Address Space) // Some BO mapped by the address space.
  2. VMA′: (BO′, Address Space) // Another BO mapped into the address space
  3. VMA″: (BO, Address Space′) // The same BO as 1, mapped into a different address space.
VMA : PPGTT :: BO : GGTTM = {1,2,3,…} N = {1,2,3,…}

In case it's still unclear, I'll use an example (which is kind of a simplified/false demonstration). The scanout buffer is the thing which is displayed on the screen. When doing frontbuffer rendering, one directly renders to that buffer. If we remember my previous post, the Display Engine requires a GGTT mapping. Therefore we know we have VMAglobal. Jumping ahead, a GPU client cannot have a global mapping, therefore, to render to the frontbuffer it too has a VMA, VMApp. There you have two VMAs pointing to the same Buffer Object.

NOTE: You can actually share VMAs if you are already sharing a Context/PPGTT. I can't think of any real world examples off of the top of my head, but it is possible, and potentially a useful thing to do.

Data Structures

Here are the relevant data structures cropped for the sake of brevity.

struct i915_address_space {
        struct drm_mm mm;
    unsigned long start;            /* Start offset always 0 for dri2 */
    size_t total;           /* size addr space maps (ex. 2GB for ggtt) */
    struct list_head active_list;
    struct list_head inactive_list;

struct i915_hw_ppgtt {
        struct i915_address_space base;
    int (*switch_mm)(struct i915_hw_ppgtt *ppgtt,
             struct intel_engine_cs *ring,
             bool synchronous);

struct i915_vma {
        struct drm_mm_node node;
        struct drm_i915_gem_object *obj;
        struct i915_address_space *vm;

The struct i915_hw_ppgtt is a subclass of a struct i915_address_space. Only two implementors of i915_address space exist: the i915_hw_ppgtt (a PPGTT), and the i915_gtt (the GGTT). It might make some sense to create a new PPGTT subclass for GEN8+ but I've not opted to do this. I feel there is too much duplication for not enough benefit.

I've already explained in different words that a range of used address space is the VMA. If the address space has the drm_mm, then it should make direct sense that the VMA has the drm_mm_node because this is the used part of the address space6. In the i915_vma struct above is a pointer to the address space for which the VMA exists, and the object the VMA is referencing. This provides the tuple that define the VMA.

HOLE 0x0->0x64000 VMA 1 0x64000->0x69000 HOLE 0x69000->512M VMA 2 512M->512.004M HOLE 1.5GBHOLE 0x0->0x64000
VMA 1 0x64000->0x69000
HOLE 0x69000->512M
VMA 2 512M->512.004M
HOLE ~512M->2GB
Allocated space: 0x6000 Free space: 0x7fffa000

Relation to the Hardware Context

struct intel_context {
    struct kref ref;
    int id;
    struct i915_address_space *vm;

With the 3 elements discussed a few times already: file descriptor, context, PPGTT, we get real GPU process isolation. Since the context was historically an opt-in interface, changes needed to be made in order to keep the opt-in behavior yet provide isolation behind the scenes regardless of what the GPU client tried to do. If this was not done, then innocent GPU clients could feel the wrath. The file descriptor was already intimately connected with the direct rendering process (one cannot render without getting a file descriptor), it made sense to hook off of that to create the contexts and PPGTTs.

Implicit Context ("private default context")

From here on out we can consider a, "context" as the 3 elements: fd, HW context, and a PPGTT. In the driver as it exists today if a GPU client does not provide a context for rendering, it cannot rely on GPU state being preserved. A context is created for GPU clients that do not provide one, but the state of this context should be considered completely opaque to all GPU clients. I've called this the Private Default Context as it very much resembles the default context that exists for the whole system (again, let me point you to the previous blog post on contexts). The driver will isolate the various contexts within the system from implicit contexts, and vice versa. Hardware state is undefined while using the private default context. Hardware state maintains it's state from the previous render operation when using the IOCTLs.

The behavior of the implicit context does result in waste when userspace uses contexts (as mesa/libgl does). There are a few solutions to this problem, and I've submitted patches for all of them (I can count 3 off the top of my head). Perhaps one day in the not too distant future, this above section will be false and we can just say - every process will get a context when they open the DRI file. If they want more contexts, they can use the IOCTL.

Multi Context

A GPU client can create more than one context. The context they wish to use for a given rendering command is built into the execbuffer2 API (note that KMS is not context savvy).

struct drm_i915_gem_execbuffer2 {
     * List of gem_exec_object2 structs
    __u64 buffers_ptr;
    __u32 buffer_count;

    /** Offset in the batchbuffer to start execution from. */
    __u32 batch_start_offset;
    /** Bytes used in batchbuffer from batch_start_offset */
    __u32 batch_len;
    __u64 flags;
    __u64 rsvd1; /* now used for context info */
    __u64 rsvd2;

A process may wish to create several GL contexts. The API allows this, and for reasons I don't understand, it's something some applications wish to do. If there was no mechanism to create a new contexts, userspace would be forced to open a new file descriptor for each GL context or else they would not reap the benefits of everything we've discussed for a GL context.

The Big Picture - literally



One of the more contentious topics in the very early stages of development was the relationship and connection of a PPGTT and a HW context.

Quoting myself from one of earlier public declarations, here:

My long term vision is for contexts to have a 1:1 relationship with a PPGTT. Sharing objects between address spaces would work similarly to the flink/dmabuf model if needed.

My idea was to embed the PPGTT within the context structure, and creating a context always resulted in a new PPGTT. Creating a PPGTT by itself would have been impossible. This is not what we ended up doing. The implementation allows multiple hardware contexts to share a PPGTT. I'm still unclear exactly what is needed to support share groups within OpenGL, but it has been speculated that this is a requirement for share groups. Fundamentally this would allow the client to create multiple GPU contexts that share an address space (it resembles what you'd get back when there was only HW contexts). The execbuffer2 IOCTL allows one to specify the context. Behaviorally however, my proposal matches what is in use currently. I think it's a bit easier to think of things this way too.

Current Mesa Current DDX 2 hypothetical scenariosCurrent Mesa
Current DDX
2 hypothetical scenarios


Please feel free to send me issues or questions.
Oh yeah. Here is a state machine that I did for a presentation on this. Things got rendered weird, and I lost the original SVG file, but perhaps it will be of some value to someone.

State MachineState Machine


As I alluded to earlier, there is still some work left to do in order to get this feature turned on by default. I gave the links to some patches, and the parameter to make it happen. If you feel motivated to help get this stuff moving forward, test it, report bugs, try to fix stuff, don't yell at me when things break :-).


That's most of it. I like to give the 10 second summary.

  1. i915_vma, i915_hw_ppgtt, i915_address_space: important things.
  2. The GPU has a virtual address space per DRI file descriptor.
  3. There is a connection between the PPGTT, and a Hardware Context.
  4. VMAs are backed by BOs which are backed by physical pages.
  5. GPU clients have some flexibility with how they interact with contexts, and therefore the PPGTT.

And finally, since I compared our now well defined notion of a GPU process to the traditional CPU process, I wanted to create a quick list of what I think are some interesting data points regarding the capabilities of the processors.

Thing,Modern X86 CPU,Modern i915 GPU
Phys Address Limit,48b?,~40b
Process Isolation, Yes, Yes (with True PPGTT)
Virtual Address Space, Yes, Yes
64b VA Space, Yes, GEN8+ 48b only
PTE access controls, Yes, No
Page Fault Handling, Yes, No
Preemption ((I am defining the word, "preemption" as the ability to switch at an arbitrary point in time between contexts. On the CPU, this is <em>easily</em> accomplished. The GPU running the i915 driver, as of today, has no way to do this. Once a batch is running, it cannot be interrupted except for RC6.)), Yes, *With execlists

So while True PPGTT brings the GPU closer to having all of the [what I consider to be] interesting features of a modern x86 CPU - it still has a ways to go. I would be surprised if things didn't continue going in this direction.

SVG Links

As usual, please feel free to do something useful with the images I've created. Also as usual, they are really poorly named.

Download PDF

  1. It's technically possible to make them be the same BO through the two buffer sharing mechanisms.

  2. Around the same time Hardware Contexts were introduced, so was the Aliasing PPGTT. The Aliasing PPGTT was interesting, however it does not contribute to any part of the GPU "process"

  3. Hardware contexts use a mechanism which will inhibit the restoration of state when not opted-in. This means if one GPU client does opt-in, and another does not, the client without contexts can reuse the state of the client with contexts. As the address space is still shared, this is actually a really dangerous thing to allow.

  4. I would have preferred the reservation of a space within the address space be called a, "GVMA", but that was shot down during review

  5. There's a whole section below describing how this statement could be false. For now, let's pretend address spaces can't be shared

  6. For those unfamiliar with the Direct Render Manager memory manager, a drm_mm is the structure for the memory manager provided by the DRM midlayer helper layer. It does all the things you'd expect out of a memory manager like, find free nodes, allocate nodes, free up nodes… A drm_mm_node is a structure representing an allocation from the memory manager. The PPGTT code relies entirely on thedrm_mm and the DRM helper functions in order to actually do the address space allocations and frees.

24 Jan 2016 1:40am GMT

Ben Widawsky: Future PPGTT [part 4] (Dynamic page table allocations, 64 bit address space, GPU “mirroring”, and yeah, something about relocs too)


GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you're going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I'd write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

  1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There's no indication about the "upstreamability". What this means is that if you read my blog to understand how the i915 driver currently works, you'll be taking a crap-shoot on this one.
  2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can't refer to them directly, I can only refer to the code.
  3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn't possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it's still fresh (for some very warped definition of "fresh").


GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn't nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don't have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

Present: Relocations

Throughout many of my previous blog posts I've gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It's impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT supportCurrent PPGTT support

To get to the above state, something like the following would happen.

  1. Create BOx
  2. Create BOy
  3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
  4. Do one of aforementioned operations on BOx and BOy
  5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

Future: No relocations

GPU MirroringGPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it's still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and 
kernels executing on devices to directly share complex, pointer-containing data structures such 
as trees and linked lists. It also eliminates the need to marshal data between the host and devices. 
As a result, SVM substantially simplifies OpenCL programming and may improve performance."

Makin' it Happen


As I've already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table HierarchyPage Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you're probably left with a big 'WTF' face. I probably can't help you if you're in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you'd expect them to. Instead of the x86 CPU's CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. <a href="http://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses" target="_blank">This wikipedia article</a> explains it pretty well, and has links.

"This will take one week. I can just allocate everything up front." (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. "One week. I can just allocate everything up front." If what I have is, "done" then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 512^2 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops

Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can't use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
        return pte.pte;

static inline pteval_t pte_flags(pte_t pte)
        return native_pte_val(pte) & PTE_FLAGS_MASK;

static inline int pte_present(pte_t a)
        return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
        return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))

#define pgd_offset(mm, address) ( (mm)->pgd + pgd_index((address)))
static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
        return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current->mm;
unsigned long address = 0xdefeca7e;

pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
ptep = pte_offset_map(pmd, address);
printk("Page is present: %s\n", pte_present(*ptep) ? "yes" : "no");

X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

  1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don't know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
  2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don't want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn't do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It's difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn't a good solution, but at that time I just wanted to get the thing working and didn't really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won't work. For the following, think of the four level page tables as arrays. ie.

  1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
  2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0 is free.
    2. [i915]Allocate PDP for PML4[0]
    3. [i915]Allocate PD for PDP[0][0]
    4. [i915]Allocate PT for PD[0][0][0]/li>
    5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
    6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO's backing page.
  3. [i915] Dispatch work to the GPU on behalf of mesa.
  4. [i915] Observe the hardware has completed
  5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
  6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0x200000 is free.
    2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
    3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
    4. Abort.

Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn't try it.

After I had determined I couldn't reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went - none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \ 34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \ 34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes
That's 4GB simply to track every page. There's some more overhead because page [tables, directories, directory pointers] are also tracked.
 256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \ 67240192entries \times \frac{1bit}{1entry} = 67240192bits \ 67240192bits \times \frac{8bits}{1byte} = 8405024bytes \ 4294967296bytes + 8405024bytes = 4303372320bytes \ 4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G

I can't remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn't see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you're using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn't fall into that trap, and I just wasted your time, but there it is anyway.

Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would "walk" to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)

 * alloc_page_tables - Allocate all page tables for the given virtual address.
 * This will allocate all the necessary page tables to map exactly one page at
 * @address. The page tables will not be connected, and the PTE will not point
 * to a page.
 * @ppgtt:  The PPGTT structure encapsulating the virtual address space.
 * @address:    The virtual address for which we want page tables.
static void
alloc_page_tables(ppgtt, unsigned long address)
    struct i915_pagetab *pt;
    struct i915_pagedir *pd;
    struct i915_pagedirpo *pdp;
    struct i915_pml4 *pml4 = &ppgtt->pml4; /* Always there */

    int pml4e = (address >> GEN8_PML4E_SHIFT) & GEN8_PML4E_MASK;
    int pdpe = (address >> GEN8_PDPE_SHIFT) & GEN8_PDPE_MASK;
    int pde = (address >> GEN8_PDE_SHIFT) & I915_PDE_MASK;
    int pte = (address & I915_PDES_PER_PD);

    if (!test_bit(pml4e, pml4->used_pml4es))
        goto alloc;

    pdp = pml4->pagedirpo[pml4e];
    if (!test_bit(pdpe, pdp->used_pdpes;))
        goto alloc;

    pd = pdp->pagedirs[pdpe];
    if (!test_bit(pde, pd->used_pdes)
        goto alloc;

    pt = pd->page_tables[pde];
    if (test_bit(pte, pt->used_ptes))

    pdp = alloc_one_pdp(pml4, pml4e);
    set_bit(pml4e, pml4->used_pml4es);
    pd = alloc_one_pd(pdp, pdpe);
    set_bit(pdpe, pdp->used_pdpes);
    pt = alloc_one_pt(pd, pde);
    set_bit(pde, pd->used_pdes);

Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tablesBitmaps tracking page tables

The GPU mirroring interface

I really don't want to spend too much time here. In other words, no more pictures. As I've already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I've submitted, 2 changes were made to the existing userptr interface (which wasn't then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
    __u64 user_ptr;
    __u64 user_size;
    __u32 ctx_id;
    __u32 flags;</p>

<h1>define I915_USERPTR_READ_ONLY          (1&lt;&lt;0)</h1>

<h1>define I915_USERPTR_GPU_MIRROR         (1&lt;&lt;1)</h1>

<h1>define I915_USERPTR_UNSYNCHRONIZED     (1&lt;&lt;31)</h1>

 * Returned handle for the object.
 * Object handles are nonzero.
__u32 handle;
__u32 pad;


The context argument is to tell the i915 driver for which address space we'll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

What should be: soft pin

There hasn't been too much discussion here, so it's hard to say. I believe the trends of the discussion (and the author's personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, "soft pin." It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can't merged until there is an open source userspace. Stay tuned. Perhaps I'll update the blog as the story unfolds.

Wrapping it up (all 4 parts)

As usual, please report bugs or ask questions.

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, "buy me a beer if you liked this", I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

Image links

The images I've created. Feel free to do with them as you please.

Download PDF

  1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN"

  2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)

  3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect.

  4. I am not sure how KVM works manages page tables. At least conceptually I'd think they'd have a similar problem to the i915 driver's page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn't have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took

  5. Let me be clear that I don't think userptr is a bad thing. It's a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring

24 Jan 2016 1:40am GMT

20 Jan 2016


Peter Hutterer: X.Org the project vs X.Org the Foundation

In light of recent general confusion between X.Org the technical project and X.Org the Foundation here's a little overview.

X.Org the project

X.Org is the current reference implementation of the X Window System which has been around since the mid-80s. Its most prominent members is the X server and the related drivers but we put a whole bunch of other things under the same umbrella, e.g. mesa, drm, and - yes - wayland. Like most free software projects it is loosely organised and very few developers are involved in everything, everybody has their niche. If you're running Linux or a BSD and you can see a desktop environment in front of you, X.Org the technical project is somewhere in that stack.

X.Org the Foundation

The foundation is a non-profit organisation tasked with the stewardship of the X Window System, particularly the X.Org implementation. The most important thing is: the X.Org Foundation does not control the technical direction, it acts in a supporting role only. X.Org has a 501(c)3 tax code in the US which means that donations can be tax deducted (though we haven't collected donations in years). It also means that how we can spend money is very restricted. These days the Foundation's supporting roles are largely: sponsoring the annual X Developers Conference (XDC), providing travel sponsorship to XDC attendees and be the organisation to participate in the Google Summer of Code. Oh, and did I mention that the X.Org Foundation does not control the technical direction?

What does it matter?

The difference matters, especially for well-nuanced and thought-out statements like "X must die" in response to articles about the X.Org Foundation. If you want the Foundation to cease to exist, you're essentially saying "XDC and X.Org's GSoC participation must die". Given that a significant percentage of those two are now Wayland-related that may have some unintended side-effects. If you want the technical project to die, it may be wise to consider the side-effects. Wayland isn't quite ready yet, much of the work that is done under the umbrella of X benefits Wayland (libinput, graphics driver work, etc.).

Now if you excuse me, there's a windmill that needs tilting at. Rocinante, where are you?

20 Jan 2016 1:11am GMT

14 Jan 2016


Daniel Vetter: VT Switching with Atomic Modeset

First the title is a slight lie, this really is about compositor switching and not necessarily about using Linux VTs for that. But I hope that the title draws in the right folks and tempts them to read this. Since with atomic there's a bit a problem if you want to switch between different compositors - maybe you have X running and hack on wayland-mutter, or kwin and mutter or just a DE and a login manager - and expect it to not end up in a modern arts project like this.

Now the trouble with atomic modesetting and switching between different compositors is that atomic display updates are incremental for two reasons:

But if you mix this with a bunch of different compositors which all understand different subsets of all the atomic extensions a driver supports, suddenly the assumption that unhandled values have reasonable settings becomes invalid, and partial updates become a problem. Recently there's been a discussions on mailing lists and IRC about how to solve this, which ultimately ended in the conclusion that us kernel folks don't really know what would work best for distros, desktop environments and their compositors. Just going ahead with some new kernel ABI could easily result in a mistake that we have to support for the next 10 years. Therefore this blog post here covers the ideas with come up with to tackle this, just to make it clear that kernel folks are aware of this gap. As soon as userspace people with real clue about this topic run into problems they're more than welcome on dri-devel@lists.freedesktop.org and then we can figure out what to implement.

With that out of the way, let's look at possible solutions.

FBDEV resets to defaults

This is the cheap cop-out that is essentially implemented right now - when switching to a kernel console running on top of the FBDEV emulation KMS drivers can provide, the driver resets atomic properties of new extensions (like rotation, color management, Z-position/alpha/blending, whatever) to hopefully sane defaults. That's good enough for developers hacking around on different compositors, as long as you remember to VT-switch to a kernel console after things went south.

But FBDEV is seriously uncool, and a lot of people are working towards removing the kernel's VT subsytem from modern distros, too. This doesn't really work everywhere, but it's kinda the minimal and what we'll definitely implement. This has also the downside that maybe you only want to restore some properties, while keeping others (since they might be crucial to your setup, for example rotating the screen).

System compositor restores boot-up state

If doing something in the kernel isn't flexible enough then the usual approach is to do it in userspace. Most systems have some kind of master compositor that's run in-between user sessions, like a login manager. That system compositor could restore modeset state to something sensible every time it runs again, and user session compositors could then take over that sensible setup as their starting point.

This has the benefit that more clever and specific stuff like only restoring some properties is easy to implement. This shouldn't need a hole lot of code since a compositor needs to be able to restore it's state anyway to allow switching between them.

But, you're saying, how can a system compositor know about all the possible and future atomic KMS extensions? Won't this scheme break the much heralded extensibility? Well the neat trick is that userspace doesn't need to be able to understand properties to save and restore them - the actual property value transport between kernel and userspace is fully generic. There are a few special cases, like the need to disable outputs that have been unplugged meanwhile, and also some properties with special meaning, like framebuffers. But generic userspace can read out all the metadata, and even if future property types extend e.g. the value range that should still work.

The downsides of this approach is that it depends upon the global system compositor to do this right, so if that crashes and leaves your display setup in shambles you can't easily recover any more. This also requires that everyone who cares about switching between different compositors to have such a system compositors.

Interlude: Only launching compositors matters

The above observation that atomic clients can always faithfully restore a state (even if they don't understand the semantics of all properties) means that switching compositors itself will always work. The real trouble is making sure that a compositor can start up with a sane enough configuration to be able to successfully get pixels onto the screen.

New atomic IOCTL kernel flag

What might be needed is a way to make this safe state persistent in the kernel. The next option tries that, by adding a new flag to the atomic IOCTL which asks the kernel to start out with an atomic KMS state reset to some default value.

But again the trouble here is, like with the FBDEV approach, that it's monolithic, and doesn't easily allow userspace to control what's being restored. And again the question is, should things get restored to the boot-up state (and which boot-up state - something equivalent to what FBDEV emulation would pick, what the firmware would have picked or a mix), or maybe reset values (set everything to unrotated) is better?

On top of that most often compositors don't want to reset state at all, to be able to smoothly take over the display configuration from the preceeding KMS client. Users have lost pretty much all appreciation of unsightly flickering that commonly happened in the pre-KMS world when switching compositors.

Another problem with keeping the boot-up state around is that the kernel then needs to keep a copy of all such state (and some objects like gamma tables are sizeable) around. Just in case there's userspace around to ask for it.

Use SysRq to reset atomic state

A problem with adding a flag to the atomic IOCTL to reset state is that all compositors need to implement support for it, and somehow make a decision for when to employ it. But on most systems compositors shouldn't completely mess up the atomic state. Most likely that's after a crash, and then there's no userspace around anyway to fix things up. Now generally this is, or well should, only be a problem for developers, and a possible solution might be to wire up a SysRq hotkey where the kernel force-resets atomic state to defaults. This would be similar to the FBDEV based solution, except without FBDEV and not tied to VT switching.

An alternative would be to implement this in the boot-splash service, by sampling boot-up state and providing some command to force-reset to that. But that's pretty much the system compositor approach, but using the boot splash, and a tool to restore its state from a stored location, as the system compositor.

Expose reset or boot-up state

An easy fix to give control back to userspace over what will get restored is to instead expose the boot-up values or the reset values to userspace, through an extension to the GET_PROPERTY IOCTL. But again storing boot-up state in the kernel would be wasteful on systems that will never need it (like Android or CrOS), and exposing reset values somewhat pointless if e.g. you really want your screen rotated, always.

Per-compositor atomic state

A slight spiel on all this is to make atomic state per-compositor in the kernel. This sounds a bit like it might help, but on the other hand implementing full state restore isn't more effort when compositors need to restore their state after a VT switch anyway. This leaves the tricky question of what the inherited state should be when a new compositor starts up: Normally you want the current state, so that the compositor can take over smoothly. Except when that's a really bad idea because the current state is badly mangled from a different compositor that just crashed.

Overall lots of different approaches and ideas, but no clear winner. That's why kernel folks need distro, compositor and desktop people to run into this issue first, to make sure the solution that lands actually solves the right problem. And in a way that suits userspace.

Thanks to Daniel Stone, Pekka Paalanen and Ville Syrjälä for input on this.

14 Jan 2016 1:55pm GMT

11 Jan 2016


Daniel Vetter: Neat drm/i915 stuff for 4.5

Kernel version 4.4 is released, it's time for our regular look at what's in store for the Intel graphics driver in the next release.

Overall this cycle has seen lots and lots of bugfixes, and the reason for that is that we're rebuilding our CI infrastructure after it went up in a poof of smoke last summer. Really big thanks to the entire team for the effort invested! And that's why this overview is a bit different and we'll start with bugfix efforts before delving into the few feature additions:

Ville fixed up display fifo underruns all over the place: FDI modeset fixes for Haswell/Broadwell, correctly detecting fused-of VGA on the same, and disabling fifo underrun reporting in some places where we've learned that underruns just happen - mostly around starting up the display pipeline.

Next up is improved runtime PM wakelock debugging from Imre Deak, with efforts from other folks trying to fix up various issues. Unfortunately this turned up so many little buglets that we had to disable the reporting again, at least for now. But at least we'll now have a very clear list of things to address, and a reliable way to audit any failures, to finally be able to enable runtime PM by default hopefully soon.

There's also been lots of fixes for PSR and FBC from Rodrigo and Paulo respectively. PSR enabled by default missed the 4.5 merge window by just a hair - it's already enabled for 4.6. FBC is also pretty close, the last bit Paulo is working is untangling the locking issues. FBC is sitting between GEM and KMS and FBC code gets called by both subsystems. And that's an easy recipe for deadlocks, which Paulo is now working to resolve.

There's also fixes included from Tvrtko Ursulin, Lukas Wunner and Chris Wilson to remedy some long-standing regressions in the fbdev framebuffer setup code. In GEM finally Dave Gordon fixed issues with the page dirty tracking. And Chris Wilson fine-tuned the request polling logic to avoid needlessly wasting CPU cycles.

Imre, Patrik and others have done a lot of work to fix up various issues in the DMC firmware loader for Skylake and the DC5/6 support. It works well now on that platform, and could reenable the overall display power well support again on Skylake. But there's still plenty of issues on Broxton unfortunately.

Since bugfixes have been highly prioritized over feature work this time around there's only very little progress on atomic modesettting and specifically atomic watermark updates. But 4.5 includes a few more prep patches from Maarten Lankhorst and Matt Roper.

There have been some real features though still: Alex Goins from nvidia implementd proper sync for page-flipping dma-buf backed framebuffers, benefitting setups where nVidia renders buffers that the Intel driver displays.

Finally there's also been the usual amount of internal refactoring to prepare the code for the future and keep it maintainable. Jani Nikula rewrote the VBT parsing code. And Ander started to rework the DP detection code as the first step of a large DP support revamp. And finally ther's been a bit of enabling for Kabylake too, but it's not yet complete.

And of courese there's been a lot more smaller things, again mostly bugfixes.

11 Jan 2016 4:12pm GMT

10 Jan 2016


Daniel Vetter: Better Markup for the Kernel GPU DocBook

This summer Intel sponsored some work to improve the kerneldoc toolchain, with the aim to use all that to extend the DRM and i915 driver documentation we have. Most of it landed, but the last bit to integrate some type of text markup processing was stalled until it could be discussed at the kernel summit, see the LWN summary. Unfortunately it died in a bikeshed fest due to an alliance of people who think docs are useless and you should just read the code, and others who didn't even know how to convert the kerneldoc into something pretty.

But we still need this, since without lists, highlighting, basic tables and inserting code snippets it's really hard to write decent documentation. Luckily Dave Airlie is ok with using it for DRM kerneldoc as long as Intel maintains the support. It's purely opt-in and the only downside of not using asciidoc is that the resulting docs won't be as pretty. All the changes to the text itself to use this markup are going into upstream as normal. The only bit that's not in upstream is the tooling, which is available in a topic branch at

        git://anongit.freedesktop.org/drm-intel topic/kerneldoc

If you want to build pretty docs just install asciidoc and base your drm documentation patches on top of drm-intel-nightly from the same repository - that tree also includes all of Dave's tree. Alternatively pull in the above topic branch into your own personal tree. Note that asciidoc is detected automatically, so you really only need it and the tooling branch to check the rendering of your changes.

For added convenience Intel also maintains an autobuilder that pushes latest drm-intel-nightly DRM documentation builds to http://dri.freedesktop.org/docs/drm/.

Aside: If all you want to build is just the GPU DocBook instead of all of them, you can do that with

        $ make DOCBOOKS="gpu.xml" htmldocs

With that have fun reading the new&improved documentation, and if you spot anything please submit a patch to dri-devel@lists.freedesktop.org.

10 Jan 2016 11:00pm GMT

Bastien Nocera: Support for "Airplane mode" keys

As we were working on audio jack notifications, and were wondering whether the type of notification we'd pop up in this case could be reused in other cases, I encountered a feature request that could now be solved easily with the rfkill D-Bus service we added to gnome-settings-daemon for the 3.10 release.

If you have keyboard buttons on your laptop to enable or disable Bluetooth, or Airplane mode, you can now use them. Note that the "UWB" toggle key will toggle the whole airplane mode mainly because no in-kernel driver uses it, and nobody remembers what UWB is.

Note that the labels and icons used are still subject to changes. In particular as you can see that the labels are too long for lower resolutions.

10 Jan 2016 1:24pm GMT

09 Jan 2016


Bastien Nocera: gom is now usable from JavaScript/gjs

Prodded by me while I snoozed on his sofa and with his cat warming me up, a day before the Content Applications hackfest, Florian Müllner started working on fixing a long-standing gjs bug that made it impossible to use gom in GNOME/JavaScript applications. The result of that initial research came a few days later, and is now part of the latest gjs release.

This also fixes using GtkBuilder and json-glib when the libraries create new objects for the benefit of the JavaScript code.

We can finally use gom to store user data in applications like Bolso. Thanks Florian!

09 Jan 2016 3:23pm GMT

05 Jan 2016


Christian Schaller: Fedora Workstation and the quest for stability and robustness

One of the things that makes me really happy in terms of the public reception to the Fedora Workstation is all the people calling out how stable and solid it is, as this was and is one of our big goals from the start of the Fedora Workstation effort.

From the start we wanted to bury the old idea of Fedora being only for people who didn't mind risking a lot of instability in return for being on the so called bleeding edge. We also wanted to bury the related idea that by using Fedora you where basically alpha testing highly unstable and unfinished software for Red Hat Enterprise Linux. Yet at the same time we did want to preserve and build upon the idea that Fedora is a great operating system if you want to experience a lot of the latest and greatest new developments as they are happening. At first glance those two goals might seem a bit contradictory, but we decided that we should be able to do both by both adjusting our policies a bit and also by relying more on the Fedora retrace server as our bug fixing prioritization tool.

So in terms of policies the division of Fedora into a distinct server and workstation images and also the clearer separation of the spins, allowed us to start making decisions without worrying so much how they affected other usecases than our own. Because sometimes what from a user perspective seems like a bug or something being broken was non-workstation policy decisions getting in the way of the desktop behaving as expected, for instance firewall rules hindering basic desktop functions.

Secondly we incorporated a more careful approach into what and when we brought in new stuff, meaning we still try to keep on top of major upstream developments and be a leading edge system, but at the same time we do a little mental exercise for each decision to make sure its a decision that makes us 'leading edge' and not 'bleeding edge'. And if we really want something in, but it isn't 100% ready for prime time yet we do what we have done with Wayland or the GTK3 port of LibreOffice, we make it available as an option for early adopters, but we default to the safer choice while we work out the last wrinkles. (Btw, if you are interested in progress on Wayland, Kevin Martin, sent out an emailing with a link to a good Wayland development status just before the Holidays.

The final piece of the puzzle is regularly checking and identifying important bugs from the Fedora retrace server. Because like almost all developers we get way more bug reports than we realistically can ever address, so having the data from the retrace server allows us to easily identify the crashes that affect the most users, and just as importantly lets us filter out the bug reports that are likely caused by users installing weird stuff on their system. When we started using retrace various desktop modules tended to dominate the top 3 pages when sorting bugs based on count, but due to a continuous effort over the last few years desktop modules appearing in the top crashers list are few and far between and when they do appear we make sure to get fixes done quickly for them. So if you ever wonder if the data collected by these kind of systems are actually helping developers working on the software you use better, I can say that it is true for Fedora for sure.

That said I thought it could be interesting to explain a bit the challenges we have with tracking our progress in this area. So lets start by looking at a graph I pulled from the retrace server.
Looking at that graph one could say that it is clear that we have made great strides in improving system stability and I do believe that is the case, however the graphs doesn't truly prove that inconclusively, they are just an indication. The reason they are not hard evidence is that there are a lot of things you need to take into consideration when reading them. First of all they are not adjusted based on total user population, which means that if you win or lose a lot of users between releases it can create an appearance of increased instability or decreased instability which is actually due to increase or decrease in user population, not in 'how well does the system run on an individual users system'. So from what we see through other metrics our user population has been increasing since we launched the Fedora Workstation which means we shouldn't be getting any 'help' in these graphs from a declining user population.

A second reason is that there are a lot of false positives being reported here, for instance we had an issue for a long while that the Intel graphics drivers generating a ton of this crash reports without it actually being crashes as such. So while they did represent bugs that should ideally be fixed they where not issues you might actually have noticed as a user of the system. So we spent some effort between Fedora Workstation 21 and Fedora Workstation 22 to reduce the amount of noise caused by this, which was an useful effort for us in terms of reducing noise in our retrace server, but from a user perspective it didn't really make a tangible difference. And even with our efforts there are a still a lot of kernel issues showing up here which are not impacting users in a way that they are likely to perceive as the system being unstable.

A third item that might in a given release skewer the statistics is that we currently don't differentiate between Fedora Workstation and spins in the statistics, which means that there might be issues caused by one of the spins generating a lot of bug reports against a module, but that might be a bug or an API usage issue that is not triggered by the Workstation edition and thus those items appearing or disappearing might affect the statistics, but as a user of the Fedora Workstation you would never experience it.

So keeping this is mind the retrace server is an important tool for us and one that at least gives us a decent indication of how we are doing with quality. But we can always do better so we will keep reviewing the reports we get through the ABRT and retrace systems and I also do strong recommend any application or library maintainers out there to look into what major issues are reported against their own modules.

05 Jan 2016 4:06pm GMT