# Preface

GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you're going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I'd write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There's no indication about the "upstreamability". What this means is that if you read my blog to understand how the i915 driver currently works, you'll be taking a crap-shoot on this one.
2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can't refer to them directly, I can only refer to the code.
3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn't possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it's still fresh (for some very warped definition of "fresh").

# Goal

GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

## Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn't nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don't have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

## Present: Relocations

Throughout many of my previous blog posts I've gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It's impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support

To get to the above state, something like the following would happen.

1. Create BOx
2. Create BOy
3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
4. Do one of aforementioned operations on BOx and BOy
5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

## Future: No relocations

GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it's still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and
kernels executing on devices to directly share complex, pointer-containing data structures such
as trees and linked lists. It also eliminates the need to marshal data between the host and devices.
As a result, SVM substantially simplifies OpenCL programming and may improve performance."


# Makin' it Happen

## 64b

As I've already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you're probably left with a big 'WTF' face. I probably can't help you if you're in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you'd expect them to. Instead of the x86 CPU's CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. This wikipedia article explains it pretty well, and has links.


## "This will take one week. I can just allocate everything up front." (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. "One week. I can just allocate everything up front." If what I have is, "done" then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 512^2 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops


### Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can't use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
{
return pte.pte;
}

static inline pteval_t pte_flags(pte_t pte)
{
return native_pte_val(pte) & PTE_FLAGS_MASK;
}

static inline int pte_present(pte_t a)
{
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
_PAGE_NUMA);
}
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
{
}

#define pgd_offset(mm, address) ( (mm)->pgd + pgd_index((address)))
static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
{
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
}

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current->mm;
unsigned long address = 0xdefeca7e;

pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
ptep = pte_offset_map(pmd, address);
printk("Page is present: %s\n", pte_present(*ptep) ? "yes" : "no");


X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don't know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don't want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn't do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

### Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It's difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn't a good solution, but at that time I just wanted to get the thing working and didn't really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won't work. For the following, think of the four level page tables as arrays. ie.

• PML4[0-255], each point to a PDP
• PDP[0-255][0-511], each point to a PD
• PD[0-255][0-511][0-511], each point to a PT
• PT[0-255][0-511][0-511][0-511] (where PT[0][0][0][0][0] is the 0th PTE in the system)
1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
1. [DRM]Find that address 0 is free.
2. [i915]Allocate PDP for PML4[0]
3. [i915]Allocate PD for PDP[0][0]
4. [i915]Allocate PT for PD[0][0][0]/li>
5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO's backing page.
3. [i915] Dispatch work to the GPU on behalf of mesa.
4. [i915] Observe the hardware has completed
5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
1. [DRM]Find that address 0x200000 is free.
2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
4. Abort.

### Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn't try it.

After I had determined I couldn't reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went - none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
$\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \\ 34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \\ 34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes$
That's 4GB simply to track every page. There's some more overhead because page [tables, directories, directory pointers] are also tracked.
$256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \\ 67240192entries \times \frac{1bit}{1entry} = 67240192bits \\ 67240192bits \times \frac{8bits}{1byte} = 8405024bytes \\ 4294967296bytes + 8405024bytes = 4303372320bytes \\ 4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G$

I can't remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn't see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you're using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn't fall into that trap, and I just wasted your time, but there it is anyway.

#### Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would "walk" to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)
{
...
}

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)
{
...
}

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)
{
...
}

/**
* alloc_page_tables - Allocate all page tables for the given virtual address.
*
* This will allocate all the necessary page tables to map exactly one page at
* @address. The page tables will not be connected, and the PTE will not point
* to a page.
*
* @ppgtt:      The PPGTT structure encapsulating the virtual address space.
* @address:    The virtual address for which we want page tables.
*
*/
static void
alloc_page_tables(ppgtt, unsigned long address)
{
struct i915_pagetab *pt;
struct i915_pagedir *pd;
struct i915_pagedirpo *pdp;
struct i915_pml4 *pml4 = &ppgtt->pml4; /* Always there */

int pml4e = (address >> GEN8_PML4E_SHIFT) & GEN8_PML4E_MASK;
int pdpe = (address >> GEN8_PDPE_SHIFT) & GEN8_PDPE_MASK;
int pde = (address >> GEN8_PDE_SHIFT) & I915_PDE_MASK;
int pte = (address & I915_PDES_PER_PD);

if (!test_bit(pml4e, pml4->used_pml4es))
goto alloc;

pdp = pml4->pagedirpo[pml4e];
if (!test_bit(pdpe, pdp->used_pdpes;))
goto alloc;

pd = pdp->pagedirs[pdpe];
if (!test_bit(pde, pd->used_pdes)
goto alloc;

pt = pd->page_tables[pde];
if (test_bit(pte, pt->used_ptes))
return;

alloc_pdp:
pdp = alloc_one_pdp(pml4, pml4e);
set_bit(pml4e, pml4->used_pml4es);
alloc_pd:
pd = alloc_one_pd(pdp, pdpe);
set_bit(pdpe, pdp->used_pdpes);
alloc_pt:
pt = alloc_one_pt(pd, pde);
set_bit(pde, pd->used_pdes);
}


Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables

# The GPU mirroring interface

I really don't want to spend too much time here. In other words, no more pictures. As I've already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I've submitted, 2 changes were made to the existing userptr interface (which wasn't then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
__u64 user_ptr;
__u64 user_size;
__u32 ctx_id;
__u32 flags;
#define I915_USERPTR_GPU_MIRROR         (1<<1)
#define I915_USERPTR_UNSYNCHRONIZED     (1<<31)
/**
* Returned handle for the object.
*
* Object handles are nonzero.
*/
__u32 handle;
};


The context argument is to tell the i915 driver for which address space we'll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

• Pros:
• This interface is very simple.
• Existing userptr code does the hard work for us
• Cons:
• You need 1 IOCTL per object. Much undeeded overhead.
• It's subject to a lot of problems userptr has5
• Userptr was already merged, so unless pad get's repruposed, we're screwed

## What should be: soft pin

There hasn't been too much discussion here, so it's hard to say. I believe the trends of the discussion (and the author's personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, "soft pin." It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can't merged until there is an open source userspace. Stay tuned. Perhaps I'll update the blog as the story unfolds.

# Wrapping it up (all 4 parts)

As usual, please report bugs or ask questions.

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, "buy me a beer if you liked this", I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN"

2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)

3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect.

4. I am not sure how KVM works manages page tables. At least conceptually I'd think they'd have a similar problem to the i915 driver's page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn't have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took

5. Let me be clear that I don't think userptr is a bad thing. It's a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring

29 Aug 2015 7:51pm GMT

# Overview

Pictures are the right way to start.

Conceptual view of aliasing PPGTT bind/unbind

There is exactly one thing to get from the above drawing, everything else is just to make it as close to fact as possible.

1. The aliasing PPGTT (aliases|shadows|mimics) the global GTT.

## The wordy overview

Support for Per-process Graphics Translation Tables (PPGTT) debuted on Sandybridge (GEN6). The features provided by hardware are a superset of Aliasing PPGTT, which is entirely a software construct. The most obvious unimplemented feature is that the hardware supports multiple PPGTTs. Aliasing PPGTT is a single instance of a PPGTT. Although not entirely true, it's easiest to think of the Aliasing PPGTT as a set page table of page tables that is maintained to have the identical mappings as the global GTT (the picture above). There is more on this in the Summary section

Until recently, aliasing PPGTT was the only way to make use of the hardware feature (unless you accidentally stepped into one of my personal branches). Aliasing PPGTT is implemented as a performance feature (more on this later). It was an important enabling step for us as well as it provided a good foundation for the lower levels of the real PPGTT code.

In the following, I will be using the HSW PRMs as a reference. I'll also assume you've read, or understand part 1.

## Selecting GGTT or PPGTT

Choosing between the GGTT and the Aliasing PPGTT is very straight forward. The choice is provided in several GPU commands. If there is no explicit choice, than there is some implicit behavior which is usually sensible. The most obvious command to be provided with a choice is MI_BATCH_BUFFER_START. When a batchbuffer is submitted, the driver sets a single bit that determines whether the batch will execute out of the GGTT or a Aliasing PPGTT1. Several commands as well, like PIPE_CONTROL, have a bit to direct which to use for the reads or writes that the GPU command will perform.

# Architecture

The names for all the page table data structures in hardware are the same as for IA CPU. You can see the Intel® 64 and IA-32 Architectures Software Developer Manuals for more information. (At the time of this post: page 1988 Vol3. 4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW). I don't want to rehash the HSW PRMs too much, and I am probably not allowed to won't copy the diagrams. However, for the sake of having a consolidated post, I will rehash the most pertinent parts.

There is one conceptual Page Directory for a PPGTT - the docs call this a set of Page Directory Entries (PDEs), however since they are contiguous, calling it a Page Directory makes a lot of sense to me. In fact, going back to the Ironlake docs, that seems to be the case. So there is one page directory with up to 512 entries, each pointing to a page table. There are several good diagrams which I won't bother redrawing in the PRMs2

Page Directory Entry
31:12 11:04 03:02 01 0
Physical Page Address 31:12 Physical Page Address 39:32 Rsvd Page size (4K/32K) Valid
Page Table Entry
31:12 11 10:04 03:01 0
Physical Page Address 31:12 Cacheability Control[3] Physical Page Address 38:32 Cacheability Control[2:0] Valid

There's some things we can get from this for those too lazy to click on the links to the docs.

1. PPGTT page tables exist in physical memory.
2. PPGTT PTEs have the exact same layout as GGTT PTEs.
3. PDEs don't have cache attributes (more on this later).
4. There exists support for big pages3

With the above definitions, we now can derive a lot of interesting attributes about our GPU. As already stated, the PPGTT is a two-level page table (I've not yet defined the size).

• A PDE is 4 bytes wide
• A PTE is 4 bytes wide
• A Page table occupies 4k of memory.
• There are 4k/4 entries in a page table.

With all this information, I now present you a slightly more accurate picture.

An object with an aliased PPGTT mapping

## Size

PP_DCLV - PPGTT Directory Cacheline Valid Register: As the spec tells us, "This register controls update of the on-chip PPGTT Directory Cache during a context restore." This statement is directly contradicted in the very next paragraph, but the important part is the bit about the on-chip cache. This register also determines the amount of virtual address space covered by the PPGTT. The documentation for this register is pretty terrible, so a table is actually useful in this case.

PPGTT Directory Cacheline Valid Register (from the docs)
63:32 31:0
MBZ PPGTT Directory Cache Restore [1..32] 16 entries
DCLV, the right way
31 30 1 0
PDE[511:496] enable PDE [495:480] enable PDE[31:16] enable PDE[15:0] enable

The, "why" is not important. Each bit represents a cacheline of PDEs, which is how the register gets its name4. A PDE is 4 bytes, there are 64b in a cacheline, so 64/4 = 16 entries per bit. We now know how much address space we have.

512 PDEs * 1024 PTEs per PT * 4096 PAGE_SIZE = 2GB


## Location

PP_DIR_BASE: Sadly, I cannot find the definition to this in the public HSW docs. However, I did manage to find a definition in the Ironlake docs yay me. There are several mentions in more recent docs, and it works the same way as is outlined on Ironlake. Quoting the docs again, "This register contains the offset into the GGTT where the (current context's) PPGTT page directory begins." We learn a very important caveat about the PPGTT here - the PPGTT PDEs reside within the GGTT.

## Programming

With these two things, we now have the ability to program the location, and size (and get the thing to load into the on-chip cache). Here is current i915 code which switches the address space (with simple comments added). It's actually pretty ho-hum.

...
ret = intel_ring_begin(ring, 6);
if (ret)
return ret;

intel_ring_emit(ring, RING_PP_DIR_DCLV(ring));
intel_ring_emit(ring, PP_DIR_DCLV_2G);       // program size
intel_ring_emit(ring, RING_PP_DIR_BASE(ring));
intel_ring_emit(ring, get_pd_offset(ppgtt)); // program location
intel_ring_emit(ring, MI_NOOP);
...


As you can see, we program the size to always be the full amount (in fact, I fixed this a long time ago, but never merged). Historically, the offset was at the top of the GGTT, but with my PPGTT series merged, that is abstracted out, and the simple get_pd_offset() macro gets the offset within the GGTT. The intel_ring_emit() stuff is because the docs recommended setting the registers via the GPU's LOAD_REGISTER_IMMEDIATE command, though empirically it seems to be fine if we simply write the registers via MMIO (for Aliasing PPGTT). See my previous blog post if you want more info about the commands execution in the GPU's ringbuffer. If it's easier just pretend it's 2 MMIO writes.

### Initialization

All of the resources are allocated and initialized upfront. There are 3 main steps. Note that the following comes from a relatively new kernel, and I have already submitted patches which change some of the cosmetics. However, the concepts haven't changed for pre-gen8.

1. Allocate space in the GGTT for the PPGTT PDEs

ret = drm_mm_insert_node_in_range_generic(&dev_priv->gtt.base.mm,
&ppgtt->node, GEN6_PD_SIZE,
GEN6_PD_ALIGN, 0,
0, dev_priv->gtt.base.total,
DRM_MM_TOPDOWN);


2. Allocate the page tables

for (i = 0; i < ppgtt->num_pd_entries; i++) {
ppgtt->pt_pages[i] = alloc_page(GFP_KERNEL);
if (!ppgtt->pt_pages[i]) {
gen6_ppgtt_free(ppgtt);
return -ENOMEM;
}
}


3. [possibly] IOMMU map the pages

for (i = 0; i < ppgtt->num_pd_entries; i++) {

pt_addr = pci_map_page(dev->pdev, ppgtt->pt_pages[i], 0, 4096,
PCI_DMA_BIDIRECTIONAL);
...
}


As the system binds, and unbinds objects into the aliasing PPGTT, it simply writes the PTEs for the given object (possibly spanning multiple page tables). The PDEs do not change. PDEs are mapped to a scratch page when not used, as are the PTEs.

### IOMMU

As we saw in step 3 above, I mention that the page tables may be mapped by the IOMMU. This is one important caveat that I didn't fully understand early on, so I wanted to recap a bit. Recall that the GGTT is allocated out of system memory during the boot firmware's initialization. This means that as long as Linux treats that memory as special, everything will just work (just don't look for IOMMU implicated bugs on our bugzilla). The page tables however are special because they get allocated after Linux is already running, and the IOMMU is potentially managing the memory. In other words, we don't want to write the physical address to the PDEs, we want to write the dma address. Deferring to wikipedia again for the description of an IOMMU., that's all.It tripped be up the first time I saw it because I hadn't dealt with this kind of thing before. Our PTEs have worked the same way for a very long time when mapping the BOs, but those have somewhat hidden details because they use the scatter-gather functions.

Feel free to ask questions in the comments if you need more clarity - I'd probably need another diagram to accommodate.

## Cached page tables

Let me be clear, I favored writing a separate post for the Aliasing PPGTT because it gets a lot of the details out of the way for the post about Full PPGTT. However, the entire point of this feature is to get a [to date, unmeasured] performance win. Let me explain… Notice bits 4:3 of the ECOCHK register. Similarly in the i915 code:

ecochk = I915_READ(GAM_ECOCHK);
if (IS_HASWELL(dev)) {
ecochk |= ECOCHK_PPGTT_WB_HSW;
} else {
ecochk |= ECOCHK_PPGTT_LLC_IVB;
ecochk &= ~ECOCHK_PPGTT_GFDT_IVB;
}
I915_WRITE(GAM_ECOCHK, ecochk);


What these bits do is tell the GPU whether (and how) to cache the PPGTT page tables. Following the Haswell case, the code is saying to map the PPGTT page table with write-back caching policy. Since the writes for Aliasing PPGTT are only done at initialization, the policy is really not that important.

Below is how I've chosen to distinguish the two. I have no evidence that this is actually what happens, but it seems about right.

Flow chart for GPU GGTT memory access. Red means slow.

Flow chart for GPU PPGTT memory access. Red means slow.

Red means slow. The point which was hopefully made clear above is that when you miss the TLB on a GGTT access, you need to fetch the entry from memory, which has a relatively high latency. When you miss the TLB on a PPGTT access, you have two caches (the special PDE cache for PPGTT, and LLC) which are backing the request. Note there is an intentional bug in the second diagram - you may miss the LLC on the PTE fetch also. I was trying to keep things simple, and show the hopeful case.

Because of this, all mappings which do not require GGTT mappings get mapped to the aliasing PPGTT.

## Distinctions from the GGTT

At this point I hope you're asking why we need the global GTT at all. There are a few limited cases where the hardware is incapable (or it is undesirable) of using a per process address space.

A brief description of why, with all the current callers of the global pin interface.

• Display: Display actually implements it's own version of the GGTT. Maintaining the logic to support multiple level page tables was both costly, and unnecessary. Anything relating to a buffer being scanned out to the display must always be mapped into the GGTT. Ie xpect this to be true, forever.
• i915_gem_object_pin_to_display_plane(): page flipping
• intel_setup_overlay(): overlays
• Ringbuffer: Keep in mind that the aliasing PPGTT is a special case of PPGTT. The ringbuffer must remain address space and context agnostic. It doesn't make any sense to connect it to the PPGTT, and therefore the logic does not support it. The ringbuffer provides direct communication to the hardware's execution logic - which would be a nightmare to synchronize if we forget about the security nightmare. If you go off and think about how you would have a ringbuffer mapped by multiple address spaces, you will end up with something like execlists.
• allocate_ring_buffer()
• HW Contexts: Extremely similar to ringbuffer.
• intel_alloc_context_page(): Ironlake RC6
• i915_gem_create_context(): Create the default HW context
• i915_gem_context_reset(): Re-pin the default HW context
• do_switch(): Pin the logical context we're switching to
• Hardware status page: The use of this, prior to execlists, is much like rinbuffers, and contexts. There is a per process status page with execlists.
• init_status_page()
• Workarounds:
• init_pipe_control(): Initialize scratch space for workarounds.
• intel_init_render_ring_buffer(): An i830 w/a I won't bother to understand
• render_state_alloc(): Full initialization of GPUs 3d state from within the kernel
• Other
• i915_gem_gtt_pwrite_fast(): Handle pwrites through the aperture. More info here.
• i915_gem_fault(): Map an object into the aperture for gtt_mmap. More info here.
• i915_gem_pin_ioctl(): The DRI1 pin interface.

# GEN8 disambiguation

Off the top of my head, the list of some of the changes on GEN8 which will get more detail in a later post. These changes are all upstream from the original Broadwell integration.

• PTE size increased to 8b
• Therefore, 512 entries per table
• Format mimics the CPU PTEs
• PDEs increased to 8b (remains 512 PDEs per PD)
• Page Directories live in system memory
• GGTT no longer holds the PDEs.
• There are 4 PDPs, and therefore 4 PDs
• PDEs are cached in LLC instead of special cache (I'm guessing)
• New HW PDP (Page Directory Pointer) registers point to the PDs, for legacy 32b addressing.
• PP_DIR_BASE, and PP_DCLV are removed
• Support for 4 level page tables, up to 48b virtual address space.
• PML4[PML4E]->PDP
• PDP[PDPE] -> PD
• PD[PDE] -> PT
• PT{PTE] -> Memory
• Big pages are now 64k instead of 32k (still not implemented)
• New caching interface via PAT like structure

# Summary

There's actually an interesting thing that you start to notice after reading Distinctions from the GGTT. Just about every thing mapped into the GGTT shouldn't be mapped into the PPGTT. We already stated that we try to map everything else into the PPGTT. The set of objects mapped in the GGTT, and the set of objects mapped into the PPGTT are disjoint5). The patches to make this work are not yet merged. I'd put an image here to demonstrate, but I am feeling lazy and I really want to get this post out today.

Recapping:

• The Aliasing PPGTT is a single instance of the hardware feature: PPGTT.
• Aliasing PPGTT was designed as a drop in performance replacement to the GGTT.
• GEN8 changed a lot of architectural stuff.
• The Aliasing PPGTT shouldn't actually alias the GGTT because the objects they map are a disjoint set.

Like last time, links to all the SVGs I've created. Use them as you like.

1. Actually it will use whatever the current PPGTT is, but for this post, that is always the Aliasing PPGTT

2. Big pages have the same goal as they do on the CPU - to reduce TLB pressure. To date, there has been no implementation of big pages for GEN (though a while ago I started putting something together). There has been some anecdotal evidence that there isn't a big win to be had for many workloads we care about, and so this remains a low priority.

3. This register thus allows us to limit, or make a sparse address space for the PPGTT. This mechanism is not used, even in the full PPGTT patches

4. There actually is a case on GEN6 which requires both. Currently this need is implemented by drivers/gpu/drm/i915/i915_gem_execbuffer.c: i915_gem_execbuffer_relocate_entry(

29 Aug 2015 7:41pm GMT

#### Ben Widawsky: OS backtrace with symbol names

For those of you reading this that didn't know, I've had two months of paid vacation - one of the real perks of working for Intel. Today is the last day. It is as hard as I thought it would be.

Most of the vacation was spent vacationing. As I have access to none of the pictures at the moment, and I don't want to make you jealous, I'll skip over that. Toward the end though, I ended up at a coffee shop waiting for someone with nothing to do. I spent a little bit of time working on HOBos, and I thought it could be interesting to write about it.

WARNING: There is nothing novel here.

# A brief history of HOBos

The HOBby operating system is an operating system project I started a while ago. I am a bit embarrassed that I started an OS. In my opinion, it's one of lamer tasks to take on because

1. everyone seems to do it;
2. there really isn't a need, there are many operating systems with permissive licenses already; and
3. sites like OSDev have made much of the work trivial (I like to think that when I started there wasn't quite as much info readily available, but that's a lie).

Larrabee Av in Portland
(not what the project was named after)

HOBos began while I was working on the Larrabee project. The team spent a lot of time optimizing the memory management, and the scheduler for the embedded software. I really wanted to work on these things full time. Unfortunately for me, having had a background in device drivers, I was often required to do other things. As a means to scratch the itch, I started HOBos after not finding anything suitable for my needs. The stuff I found were all either too advanced, or too rudimentary. When I was hired to work on i915, I decided that it was better use of my free time. Since then, I've been tweaking things here or there, and I do try to make sure things still run with the latest QEMU and compilers at least once a year. The last actual feature I added was more than 1300 days ago:

commit 1c9b5c78b22b97246989b00e807c9bf1fbc9e517
Author: Ben Widawsky
Date: Sat Mar 19 21:19:57 2011 -0700

basic backtrace


So back to the coffee shop. I tried to do something or other, got a hang, and didn't want to fire up GDB.

# Backtracing

HOBos had implemented backtraces since the original import from SVN (let's pretend that means, since always). Obtaining a backtrace is actually pretty straightforward on x86.

## The stack frame

The stack frame can be thought of as memory contents that are locally scoped to a function. Declaring a local variable will end up in the stack frame. A global variable will not. As functions call other functions, you end up with multiple frames. A stack is used because the last frames added are the first ones removed (this is not always true for things like exceptions, but nevermind that for now). The fact that a stack decrements is arbitrarily chosen, as far as I can tell. The following shows the stack when the function foo() calls the function bar().

Example Stackframe

void bt_fp(void *fp)
{
do {
uint64_t prev_rbp = *((uint64_t *)fp);
uint64_t prev_ip = *((uint64_t *)(fp + sizeof(prev_rbp)));
struct sym_offset sym_offset = get_symbol((void *)prev_ip);
printf("\t%s (+0x%x)\n";, sym_offset.name, sym_offset.offset);
fp = (void *)prev_rbp;
/* Stop if rbp is not in the kernel
* TODO: need an upper bound too*/
if (fp <= (void *)KVADDR(DMAP_PML,0,0,0))
break;
} while(1);
}


The memory contents shown above are created as a result of two things. First is what the CPU implicitly does upon execution of the call instruction. The second is what the compiler generates. Since we're talking about x86 here, the call instruction always pushes at least the return address. The second I'll detail a bit more in a bit. Correlating this to the picture, the green (foo) and blue (bar) are creations of the compiler. The brown is automatically pushed on to the stack by the hardware, and is automatically popped from the stack on the ret instruction.

In the above there are two registers worth noting, RBP, and RSP. RBP which is the 64b extension to the original 8086 BP, Base Pointer, register is the beginning of the stack frame ie the Frame Pointer. RSP, the extension to the 8086 SP, Stack Pointer, points to the end of the stack frame. By convention the Base Pointer doesn't change throughout a function being executed and therefore it is often used as the reference to local variables stored on the stack. -100(%rbp) as an example above.

Digging further into that disassembly above, one notices a pattern. Every function begins with:

push   %rbp       // Push the old RBP, RSP now points to this
mov    %rsp,%rbp  // Store RSP in RBP


Assuming this is the convention, it implies that at any given point during the execution of a function we can obtain the previous RBP by reading the current RBP and doing some processing. Specifically, reading RBP gives us the old Stack Pointer, which is pointing to the last RBP. As mentioned above, the x86 CPU pushed the return address immediately before the push %rbp - which means as we work backwards through the Base Pointers, we can also obtain the caller for the current stack frame. People have done really nice pictures on this - use your favorite search engine.

Here is the HOBos code (ignore the part about symbols for now):

void bt_fp(void *fp)
{
do {
uint64_t prev_rbp = *((uint64_t *)fp);
uint64_t prev_ip = *((uint64_t *)(fp + sizeof(prev_rbp)));
struct sym_offset sym_offset = get_symbol((void *)prev_ip);
printf("\t%s (+0x%x)\n", sym_offset.name, sym_offset.offset);
fp = (void *)prev_rbp;
/* Stop if rbp is not in the kernel
* TODO: need an upper bound too*/
if (fp <= (void *)KVADDR(DMAP_PML,0,0,0))
break;
} while(1);
}


As far as I know, all modern CPUs work in a similar fashion with differences sprinkled here and there. ARM for example has an LR register for the return address instead of using the stack.

## ABI/Calling Conventions

The fact that we can work backwards this way is a byproduct of the calling convention. One example of an aspect of the calling convention is where to put the arguments to a function. Do they go on the stack, in registers, or somewhere else? In addition to these arguments, the way in which RBP, and RSP are used are strictly software constructs that are part of the convention. As a result, it might not always be possible to get a backtrace if:

1. This convention is not adhered to (or -fomit-frame-pointer)
2. The contents of RBP are destroyed
3. The contents of the stack are corrupted.

How arguments are passed to function are also needed to make sure linkers and loaders (both static and dynamic) can operate to correctly form an executable, or dynamically call a function. Since this isn't really important to obtaining a backtrace, I will leave it there. Some architectures do provide a way to obtain useful backtrace information without caring about the calling convention: Intel's Processor Trace for example.

# Symbol information

The previous section will get us a reverse list of addresses for all function calls working backward from a given point during execution. But having names makes things much more easier to quickly diagnose what is going wrong. There is a lot of information on the internet about this stuff. I'm simply providing all that's relevant to my specific problem.

## ELF Symbols (linking)

The ELF format provides everything we need (assuming things aren't stripped). Glossing over the details (see this simple tool if you're curious), we end up with two "sections" that tell us everything we need. They are conventionally named, ".symtab", and ".strtab" and are conveniently of type, SHT_SYMTAB, and SHT_STRTAB. The symbol table defines the information about each symbol (functions, variables, whatever). Part of the information is a name, which is an index into the string table. In this simplest case, these are provisions for inter-object linking. If I had defined foo() in foo.c, and bar() in bar.c, the compiled object files can be linked together, but the linker needs the information about the symbol bar (in this case) in order to do its job.

readelf -S a.out

[Nr] Name Type Address Offset
[33] .symtab SYMTAB 0000000000000000 000015b8
[34] .strtab STRTAB 0000000000000000 00001c90
> readelf -S a.out | egrep "\.strtab|\.symtab" | wc -l
2
> strip a.out
> readelf -S a.out | egrep "\.strtab|\.symtab" | wc -l
0


Summing that up, if we have an entire ELF file, and the symbol and string tables haven't been stripped, we're good to go. However, ELF sections are not the unit in which an ELF loader decides what to load. The loader loads segments which are of type PT_LOAD. A segment is made up of 0 or more sections, plus padding. Since the Operating System is itself an ELF loaded by an ELF loader (the bootloader) we're not in a good position.

> readelf -l a.out | egrep "\.strtab|\.symtab" | wc -l
0


### Debug Info

Note that what we want is not the same thing as debug information. If one wants to do source level debug, there needs to be some way of correlating a machine instruction to a line of source code. This is also a software abstraction, and there is usually a spec for it unless you are using some proprietary thing. It would technically be possible to include DWARF capabilities within the kernel, but I do not know of a way to get that info to the OS (see multiboot stuff for details).

## From boot to symbols

The HOBos project implements a small Multiboot compliant bootloader called smallboot. When the machine starts up, boot firmware is loaded from a fixed location (this is currently done by SeaBIOS). The boot firmware then loads the smallboot bootloader. The bootloader will load the kernel (smallboot, and most modern bootloaders will do this through a text configuration file on the resident filesystem). In the HOBos case, the kernel is simply an ELF file. smallboot implements a basic ELF loader to load the kernel into memory and give execution over.

The multiboot specification is a standardized communication mechanism (for various things) from the bootloader to the Operating System (or any file really). One of these things is symbol information. Quoting the multiboot spec

If bit 5 in the 'flags' word is set, then the following fields in the Multiboot information structure starting at byte 28 are valid:

+-------------------+
28      | num               |
32      | size              |
36      | addr              |
40      | shndx             |
+-------------------+


These indicate where the section header table from an ELF kernel is, the size of each entry, number of entries, and the string table used as the index of names. They correspond to the 'shdr_*' entries ('shdr_num', etc.) in the Executable and Linkable Format (elf) specification in the program header. All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory (refer to the i386 elf documentation for details as to how to read the section header(s)). Note that 'shdr_num' may be 0, indicating no symbols, even if bit 5 in the 'flags' word is set.

Since the beginning I had implemented these fields in the bootloader:

multiboot_info.flags |= MULTIBOOT_INFO_ELF_SHDR;
multiboot_info.u.elf_sec = *table;


Because of the fact that the symbols weren't in the ELF segments though, I was stumped as to how to get at the data one the OS is loaded. As it turned out, I didn't actually read all 4 sentences and I had missed one very important part.

All sections are loaded, and the physical address fields of the elf section header then refer to where the sections are in memory

What the spec is dictating is that even though the sections are not in loadable segments, they shall exist within memory during the handover to the OS, and the section header information will be updated so that the OS knows where to find it. With this, the OS can copy out, or just make sure not to overwrite the info, and then get access to it.

for (i = 0; i < shnum; i++) {
__ElfN(Shdr) *sh = &shdr[i];
if (sh->sh_size == 0)
continue;

continue;

ASSERT(sizeof(void *) == 4);
}


# et cetera

The code for pulling out the symbols is quite a bit longer, but it can be found in kern/core/syms.c. With the given RBP unwinder near the top, we can easily get the IP for the caller. With that IP, we do a symbol lookup from the symbols we got via the multiboot info.

Screenshot with backtrace

29 Aug 2015 7:24pm GMT

# LAST REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

Here's the last reminder that the systemd.conf 2015 CfP ends on August 31st 11:59:59pm Central European Time (that's monday next week)! Make sure to submit your proposals until then!

Please submit your proposals on our website!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

27 Aug 2015 10:00pm GMT

## 26 Aug 2015

### planet.freedesktop.org

#### Peter Hutterer: pointer acceleration in libinput - building a DPI database for mice

Mice have an optical sensor that tells them how far they moved in "mickeys". Depending on the sensor, a mickey is anywhere between 1/100 to 1/8200 of an inch or less. The current "standard" resolution is 1000 DPI, but older mice will have 800 DPI, 400 DPI etc. Resolutions above 1200 DPI are generally reserved for gaming mice with (usually) switchable resolution and it's an arms race between manufacturers in who can advertise higher numbers.

HW manufacturers are cheap bastards so of course the mice don't advertise the sensor resolution. Which means that for the purpose of pointer acceleration there is no physical reference. That delta of 10 could be a millimeter of mouse movement or a nanometer, you just can't know. And if pointer acceleration works on input without reference, it becomes useless and unpredictable. That is partially intended, HW manufacturers advertise that a lower resolution will provide more precision while sniping and a higher resolution means faster turns while running around doing rocket jumps. I personally don't think that there's much difference between 5000 and 8000 DPI anymore, the mouse is so sensitive that if you sneeze your pointer ends up next to Philae. But then again, who am I to argue with marketing types.

For us, useless and unpredictable is bad, especially in the use-case of everyday desktops. To work around that, libinput 0.7 now incorporates the physical resolution into pointer acceleration. And to do that we need a database, which will be provided by udev as of systemd 218 (unreleased at the time of writing). This database incorporates the various devices and their physical resolution, together with their sampling rate. udev sets the resolution as the MOUSE_DPI property that we can read in libinput and use as reference point in the pointer accel code. In the simplest case, the entry lists a single resolution with a single frequency (e.g. "MOUSE_DPI=1000@125"), for switchable gaming mice it lists a list of resolutions with frequencies and marks the default with an asterisk ("MOUSE_DPI=400@50 800@50 *1000@125 1200@125"). And you can and should help us populate the database so it gets useful really quickly.

# How to add your device to the database

We use udev's hwdb for the database list. The upstream file is in /usr/lib/udev/hwdb.d/70-mouse.hwdb, the ruleset to trigger a match is in /usr/lib/udev/rules.d/70-mouse.rules. The easiest way to add a match is with the libevdev mouse-dpi-tool (version 1.3.2). Run it and follow the instructions. The output looks like this:

$sudo ./tools/mouse-dpi-tool /dev/input/event8Mouse Lenovo Optical USB Mouse on /dev/input/event8Move the device along the x-axis.Pause 3 seconds before movement to reset, Ctrl+C to exit.Covered distance in device units: 264 at frequency 125.0Hz | |^CEstimated sampling frequency: 125HzTo calculate resolution, measure physical distance coveredand look up the matching resolution in the table below 16mm 0.66in 400dpi 11mm 0.44in 600dpi 8mm 0.33in 800dpi 6mm 0.26in 1000dpi 5mm 0.22in 1200dpi 4mm 0.19in 1400dpi 4mm 0.17in 1600dpi 3mm 0.15in 1800dpi 3mm 0.13in 2000dpi 3mm 0.12in 2200dpi 2mm 0.11in 2400dpiEntry for hwdb match (replace XXX with the resolution in DPI):mouse:usb:v17efp6019:name:Lenovo Optical USB Mouse: MOUSE_DPI=XXX@125  Take those last two lines, add them to a local new file /etc/udev/hwdb.d/71-mouse.hwdb. Rebuild the hwdb, trigger it, and done: $ sudo udevadm hwdb --update$sudo udevadm trigger /dev/input/event8  Leave out the device path if you're not on systemd 218 yet. Check if the property is set: $ udevadm info /dev/input/event8 | grep MOUSE_DPIE: MOUSE_DPI=1000@125


And that shows everything worked. Restart X/Wayland/whatever uses libinput and you're good to go. If it works, double-check the upstream instructions, then file a bug against systemd with those two lines and assign it to me.

Trackballs are a bit hard to measure like this, my suggestion is to check the manufacturer's website first for any resolution data.

Update 2014/12/06: trackball comment added, udevadm trigger comment for pre 218
Update 2015/08/26: udpated link to systemd bugzilla (now on github)

26 Aug 2015 12:08am GMT

# First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf 2015 sponsors!

Our first Silver sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS's commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here https://coreos.com/, Tectonic here, https://tectonic.com/ or follow CoreOS on Twitter @coreoslinux.

A Bronze sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world's top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We'll shortly announce our second round of sponsors, please stay tuned!

If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

24 Aug 2015 10:00pm GMT

## 19 Aug 2015

### planet.freedesktop.org

#### Christian Schaller: Fedora Workstation Next Steps: Wayland and Graphics

So I realized I hadn't posted a Wayland update in a while. So we are still making good progress on Wayland, but the old saying that the last 10% is 90% of the work is definitely true here. So there was a Wayland BOF at GUADEC this year which tried to create a TODO list for major items remaining before Wayland is ready to replace X.

• Proper menu positioning. Most visible user issue currently for people testing Wayland
• Relative pointer/confinement/locking. Important for games among other things.
• Kinetic scroll in GTK+
• More work needed to remove all X11 dependencies (so that desktop has no dependency on XWayland being available.
• Minimize main thread stalls (could be texture uploads for example)
• Tablet support. Includes gestures, on-screen keyboard and more.

A big thank you to Jonas Ådahl, Owen Taylor, Carlos Garnacho, Rui Matos, Marek Chalupa, Olivier Fourdan and more for their ongoing work on polishing
up the Wayland experience.

So as you can tell there is a still lot of details that needs working out when doing something as major as switching from one Display system to the other, but things are starting to looking really good now.

One new feature I am particularly excited about is what we call multi-DPI support ready for Wayland sessions in Fedora 23. What this means is that if you have a HiDPI laptop screen and a standard DPI external monitor you should be able to drag windows back and forth between the two screens and have them automatically rescale to work with the different DPIs. This is an example of an issue which was relatively straightforward to resolve under Wayland, but which would have been a lot of pain to get working under X.

We will not be defaulting to Wayland in Fedora Workstation 23 though, because as I have said in earlier blog posts about the subject, we will need to have a stable and feature complete Wayland in at least one release before we switch the default. We hope Fedora Workstation 23 will be that stable and feature complete release, which means Fedora Workstation 24 is the one where we can hope to make the default switchover.

Of course porting the desktop itself to Wayland is just part of the story. While we support running X applications using XWayland, to get full benefit from Wayland we need our applications to run on top of Wayland natively. So we spent effort on getting some major applications like LibreOffice and Firefox Wayland ready recently.

Caolan McNamara has been working hard on finishing up the GTK3 port of LibreOffice which is part of the bigger effort to bring LibreOffice nativly to Wayland. The GTK3 version of LibreOffice should be available in Fedora Workstation 23 (as non-default option) and all the necessary code will be included in LibreOffice 5 which will be released pretty soon. The GTK3 version should be default in F24, hopefully with complete Wayland support.

For Firefox Martin Stransky has been looking into ensuring Firefox runs on Wayland now that the basic GTK3 port is done. Martin just reported today that he got Firefox running natively under Wayland, although there are still a lot of glitches and issues he needs to figure out before it can be claimed to be ready for normal use.

Another major piece we are working on which is not directly Wayland related, but which has a Wayland component too is to try to move the state of Linux forward in the context of dealing with multiple OpenGL implementations, multi-GPU systems and the need to free our 3D stack from its close ties to GLX.

This work with is lead by Adam Jackson, but where also Dave Airlie is a major contributor, involves trying to decide and implement what is needed to have things like GL Dispatch, EGLstreams and EGL Device proposals used across the stack. Once this work is done the challenges around for instance using the NVidia binary driver on a linux system or using a discreet GPU like on Optimus laptops should be a thing of the past.

So the first step of this work is getting GL Dispatch implemented. GL Dispatch basically allows you to have multiple OpenGL implementations installed and then have your system pick the right one as needed. So for instance on a system with NVidia Optimus you can use Mesa with the integrated Intel graphics card, but NVidias binary OpenGL implementatioin with the discreet Optimus GPU. Currently that is a pain to do since you can only have one OpenGL implementation used. Bumblebee tries to hack around that requirement, but GL Dispatch will allow us to resolve this without having to 'fight' the assumptions of the system.

We plan to have easy to use support for both Optimus and Prime (the Nouveau equivalent of Optimus) in the desktop, allowing you to choose what GPU to use for your applications without needing to edit any text files or set environment variables.

The final step then is getting the EGL Device and EGLStreams proposal implemented so we can have something to run Wayland on top of. And while GL Dispatch are not directly related to those we do feel that having it in place should make the setup easier to handle as you don't risk conflicts between the binary NVidia driver and the Mesa driver anymore at that point, which becomes even more crucial for Wayland since it runs on top of EGL.

19 Aug 2015 4:50pm GMT

# REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

We'd like to remind you that the systemd.conf 2015 Call for Presentations ends on August 31st! Please submit your presentation proposals before that data on our website.

We are specifically interested in submissions from projects and vendors building today's and tomorrow's products, services and devices with systemd. We'd like to learn about the problems you encounter and the benefits you see! Hence, if you work for a company using systemd, please submit a presentation!

We are also specifically interested in submissions from downstream distribution maintainers of systemd! If you develop or maintain systemd packages in a distribution, please submit a presentation reporting about the state, future and the problems of systemd packaging so that we can improve downstream collaboration!

And of course, all talks regarding systemd usage in containers, in the cloud, on servers, on the desktop, in mobile and in embedded are highly welcome! Talks about systemd networking and kdbus IPC are very welcome too!

Please submit your presentations until August 31st!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about the CfP consult the CfP page.

For further details about systemd.conf consult the conference website.

18 Aug 2015 10:00pm GMT

# REMINDER! systemd.conf 2015 Call for Papers ends August 31st!

We'd like to remind you that the systemd.conf 2015 Call for Presentations ends on August 31st! Please submit your presentation proposals before that data on our website.

We are specifically interested in submissions from projects and vendors building today's and tomorrow's products, services and devices with systemd. We'd like to learn about the problems you encounter and the benefits you see! Hence, if you work for a company using systemd, please submit a presentation!

We are also specifically interested in submissions from downstream distribution maintainers of systemd! If you develop or maintain systemd packages in a distribution, please submit a presentation reporting about the state, future and the problems of systemd packaging so that we can improve downstream collaboration!

And of course, all talks regarding systemd usage in containers, in the cloud, on servers, on the desktop, in mobile and in embedded are highly welcome! Talks about systemd networking and kdbus IPC are very welcome too!

Please submit your presentations until August 31st!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details abou the CfP consult the CfP page.

For further details about systemd.conf consult the conference website.

18 Aug 2015 10:00pm GMT

# First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf 2015 sponsors!

Our first Silver sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS's commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here https://coreos.com/, Tectonic here, https://tectonic.com/ or follow CoreOS on Twitter @coreoslinux.

A Bronze sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world's top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We'll shortly announce our second round of sponsors, please stay tuned!

If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.

18 Aug 2015 10:00pm GMT

## 17 Aug 2015

### planet.freedesktop.org

#### Christian Schaller: An Open Letter to Apache Foundation and Apache OpenOffice team

A couple of weeks I visited my mother back home in Norway. She had gotten a new laptop some time ago that my brother-in-law had set up for her. As usual when I come for a visit I was asked to look at some technical issues my mother was experiencing with her computer. Anyway, one thing I discovered while looking at these issues was that my brother-in-law had installed OpenOffice on her computer. So knowing that the OpenOffice project is all but dead upstream since IBM pulled their developers of the project almost a year ago and has significantly fallen behind feature wise, I of course installed LibreOffice on the system instead, knowing it has a strong and vibrant community standing behind it and is going from strength to strength.

And this is why I am writing this open letter. Because while a lot of us who comes from technical backgrounds have already caught on to the need to migrate from OpenOffice to LibreOffice, there are still many non-technical users out there who are still defaulting to installing OpenOffice when they are looking for an open source Office Suite, because that is the one they came across 5 years ago. And I believe that the Apache Foundation, being an organization dedicated to open source software, care about the general quality and perception of open source software and thus would share my interest in making sure that all users of open source software gets the best experience possible, even if the project in question isn't using their code license of preference.

So I realize that the Apache Foundation took a lot of pride in and has invested a lot of effort trying to create an Apache Licensed Office suite based on the old OpenOffice codebase, but I hope that now that it is clear that this effort has failed that you would be willing to re-direct people who go to the openoffice.org website to the LibreOffice website instead. Letting users believe that OpenOffice is still alive and evolving is only damaging the general reputation of open source Office software among non-technical users and thus I truly believe that it would be in everyones interest to help the remaining OpenOffice users over to LibreOffice.

And to be absolutely clear I am only suggesting this due to the stagnant state of the OpenOffice project, if OpenOffice had managed to build a large active community beyond the resources IBM used to provide then it would have been a very different story, but since that did not happen I don't see any value to anyone involved to just let users keep downloading an aging releases of a stagnant codebase until the point where bit rot chases them away or they hear about LibreOffice true mainstream media or friends. And as we all know it is not about just needing a developer or two to volunteer here, maintaining and developing something as large as OpenOffice is a huge undertaking and needs a very sizeable and dedicated community to be able to succeed.

So dear Apache developers, for the sake of open source and free software, please recommend people to go and download LibreOffice, the free office suite that is being actively maintained and developed and which has the best chance of giving them a great experience using free software. OpenOffice is an important part of open source history, but that is also what it is at this point in time.

17 Aug 2015 2:50pm GMT

## 16 Aug 2015

### planet.freedesktop.org

#### Daniel Vetter: Atomic Modesetting Design Overview

After a few years of development the atomic display update IOCTL for drm drivers is finally ready for prime time with the 4.2 pull request from Dave Airlie. It's been a long road, with a lot of drivers already converted over to atomic and even more in progress, the atomic helper libraries and support code in the drm subsystem sufficiently polished. But what's really missing is a design overview of what the overall atomic infrastructure looks like and why some decisions and details are implemented like they are.

That's now done and published on LWN: Part 1 talks about the problem space, issues with the Android atomic display framework and the basic atomic IOCTL interface. Part 2 goes into more detail about a few specific things like locking, helper library design and the exact semantics of atomic modessetting updates. Happy Reading!

16 Aug 2015 1:52pm GMT

## 15 Aug 2015

### planet.freedesktop.org

#### Rob Clark: freedreno - mesa 11.0 progress update, OpenGLES3 and more

So the big news for the upcoming mesa 11.0 release is gl4.x support for radeon and nouveau. Which has been in the works for a long time, and a pretty tremendous milestone (and the reason that the next mesa release is 11.0 rather than 10.7). But on the freedreno side of things, we haven't been sitting still either. In fact, with the transform-feedback support I landed a couple weeks ago (for a3xx+a4xx), plus MRT+z32s8 support for a4xx (Ilia landed the a3xx parts of those a while back), we now support OpenGLES 3.0[1] on both adreno 3xx and 4xx!!

In addition, with the TBO support that landed a few days ago, plus handful of other fixes in the last few days, we have the new antarctica gl3.1 render engine for supertuxkart working!

Note that you need to use MESA_GL_VERSION_OVERRIDE=3.1 and MESA_GLSL_VERSION_OVERRIDE=140, since while we support everything that stk needs, we don't yet support everything needed to advertise gl3.1. (But hey, according to qualcomm, adreno 3xx doesn't even support higher than gles3.0.. I guess we'll have to show them ;-))

The nice thing to see about this working, is that it is utilizing pretty much all of the recent freedreno features (transform feedback, MRT, UBO's, TBO's, etc).

Of course, the new render engine is considerably more heavyweight compared to older versions of stk. But I think there is some low hanging fruit on the stk engine side of things to reclaim some of those lost fps.

update: oh, and the first time around, I completely forgot to mention that qualcomm has recently published *some* gpu docs, for a3xx, for the dragonboard 410c. Not quite as extensive as what broadcom has published for vc4, but it gives us all the a3xx registers, which is quite a lot more than any other SoC vendor has done to date :-)

[1] minus MSAA.. There is a bigger task, which is on the TODO list, to teach mesa st about some extensions to support MSAA resolve on tile->mem.. such as EXT_multisampled_render_to_texture, plus perhaps driconf option to enable it for apps that are not aware, which would make MSAA much more useful on a tiling gpu. Until then, mesa doesn't check MSAA for gles3, and if it did we could advertise PIPE_CAP_FAKE_SW_MSAA. Plus, who really cares about MSAA on a 5" 4k screen?

15 Aug 2015 10:15pm GMT

## 13 Aug 2015

### planet.freedesktop.org

#### Julien Danjou: Reading LWN.net with Pocket

I've started to use Pocket a few months ago to store my backlog of things to read. It's especially useful as I can use it to read content offline since we still don't have any Internet access in places such as airplanes or the Paris metro. It's only 2015 after all.

I am also a LWN.net subscriber for years now, and I really like their articles from the weekly edition. Unfortunately, as the access is restricted to subscribers, you need to login: it makes it impossible to add these articles to Pocket directly. Sad.

Yesterday, I thought about that and decided to start hacking on it. LWN provides a feature called "Subscriber Link" that allows you to share an article with a friend. I managed to use that feature to share the articles with my friend… Pocket!

As doing that every week is tedious, I wrote a small Python program called lwn2pocket that I published on GitHub. Feel free to use it, hack it and send pull requests.

13 Aug 2015 4:39pm GMT

## 11 Aug 2015

### planet.freedesktop.org

#### Matthias Klumpp: AppStream/DEP-11 for everyone! (beta!)

One of the things we discussed at this year's Akademy conference is making AppStream work on Kubuntu.

On Debian-based systems, we use a YAML-based implementation of AppStream, called "DEP-11″. DEP-11 exists for historical reasons (the DEP-11 YAML format was a superset of AppStream once) and because YAML, unlike XML, is one accepted file format by the Debian FTPMasters team, who we want to have on board when adding support for AppStream.

So I've spent the last few days on setting up the DEP-11 generator for Kubuntu, as well as improving it greatly to produce more meaningful error messages and to generate better output. It became a bit slower in the process, but the greatly improved diagnostics data is worth it.

For example, maintainers of Debian packages will now get a more verbose explanation of issues found with metadata in their packages, making them easier to fix for people who didn't get in contact with AppStream yet.

At time, we generate AppStream metadata for Tanglu, Kubuntu and Debian, but so far only Tanglu makes real use of it by shipping it in a .deb package. Shipping the data as package is only a workaround though, for a proper implementation, the data will be downloaded by Apt. To achieve that, the data needs to reach the archive first, which is something that I can hopefully discuss and implement with the FTPMasters team of Debian at this year's Debconf.

When this is done, the new metadata will automatically become available in tools like GNOME-Software or Muon Discover.

## How can I see if there are issues with my package?

The dep11-generator tool will return HTML pages to show both the extracted metadata, as well as issues with it.

You can find the information for the respective distribution here:

Each issue tag will contain a detailed explanation of what went wrong. Errors generally lead to ignoring the metadata, so it will not be processed. Warnings usually concern things which might reduce the amount of metadata or make it less useful, while Info-type hints contain information on how to improve the metadata or make it more useful.

## Can I use the data already?

Yes, you can. You just need to place the compressed YAML files in /var/cache/app-info/yaml and the icons in /var/cache/app-info/icons/<suite>-<component>/<size>, for example: /var/cache/app-info/icons/jessie-amd64/64×64

## I think I found a bug in the generator…

In that case, please report the issue against the appstream-dep11 package at Debian, or file an issue at Github..

The only reason why I announce this feature now is to find remaining generator bugs, before officially announcing the feature on debian-devel-announce.

## When will this be officially announced?

I want to give this feature a little bit more testing, and ideally have the integration into the archive ready, so people can see how the metadata looks like when rendered in GNOME-Software/Discover. I also want to write a bit more documentation to help Debian developers and upstreams to improve their metadata.

Ideally, I also want to incorporate some feedback at Debconf when announcing full AppStream support in Debian. So take all the stuff you've read above as a little sneak peek

I will also give a talk at Debconf, titled "AppStream, Limba, XdgApp - Where we are going." The aim of this talk is to give an insight i tnto the new developments happening in the software distribution area, and what the goal of these different projects is (the talk should give an overview of what's in the oven and how it will impact Debian). So if you are interested, please drop by Maybe setting up a BOF would also be a good idea.

11 Aug 2015 6:46am GMT

## 10 Aug 2015

### planet.freedesktop.org

#### Matthias Klumpp: Akademy 2015

I am very late with this, but I still wanted to write a few words about Akademy 2015.

First of all: It was an awesome conference! Meeting all the great people involved with KDE and seeing who I am working with (as in: face to face, not via email) was awesome. We had some very interesting discussions on a wide variety of topics, and also enjoyed quite some beer together. Particularly important to me was of course talking to Daniel Vrátil who is working on XdgApp, and Aleix Pol of Muon Discover fame.

Also meeting with the other Kubuntu members was awesome - I haven't seen some of them for about 3 years, and also met many cool people for the first time.

My talk on AppStream/Limba went well, except that I got slightly confused by the timer showing that I had only 2 minutes left, after I had just completed the first half of my talk. It turned out that the timer was wrong

Another really nice aspect was to be able to get an insight into areas where I am usually not involved with, like visual design. It was really interesting to learn about the great work others are doing and to talk to people about their work - and I also managed to scratch an itch in the Systemsettings application, where three categories had shown the same icon. Now Systemsettings looks like it is supposed to be, finally

The only thing I could maybe complain about was the weather, which was more Scotland/Wales like than Spanish - but that didn't stop us at all, not even at the social event outside. So I actually don't complain

We also managed to discuss some new technical stuff, like AppStream for Kubuntu, and plenty of other things that I'll write about in a separate blog post.

Generally, I got so many new impressions from this year's Akademy, that I could write a 10-pages long blogpost about it while still having to leave out things.

Kudos to the organizers of this Akademy, you did a fantastic job! I also want to thank the Ubuntu community for funding my flight, and the Kubuntu community for pushing me a little to attend :-).

KDE is a great community with people driving the development of Free Software forward. The diversity of people, projects and ideas in KDE is a pleasure, and I am very happy to be part of this community.

10 Aug 2015 2:42pm GMT