25 Jul 2014

feedplanet.freedesktop.org

Pekka Paalanen: Wayland protocol design: object lifespan

Now that we have a few years of experience with the Wayland protocol, I thought I would put some of my observations in writing. This, what will hopefully become a series rather than just one post, considers how to design Wayland protocol extensions the right way.

This first post considers protocol object lifespan and the related races between the compositor/server and the client. I assume that the reader is already aware of the Wayland protocol basics. If not, I suggest reading Chapter 4. Wayland Protocol and Model of Operation.

How protocol objects are created

On a new Wayland connection, the only object that exists is the wl_display which is a specially constructed object. You always have it, and there is no wire protocol for creating it.

The only thing the client can create next is a wl_registry through the wl_display. Registry is the root of the whole interface (class) hierarchy. Wl_registry advertises the global objects by numerical name, and using wl_registry.bind request to bind to a global is the first normal way to create a protocol object.

Binding is slightly special still, as the protocol specification in XML for wl_registry uses the new_id argument type, but does not specify the interface (class) for the new object. In the wire protocol, this special argument gets turned into three arguments: interface name (string), interface version (uint32_t), and the new object ID (uint32_t). This is unique in the Wayland core protocol.

The usual way to create a new protocol object is for the client to send a request that has a new_id type of argument. The protocol specification (XML) defines what the interface is, so there is no need to communicate the interface type over the wire. All that is needed on the wire is the new object ID. Almost all object creation happens this way.

Although rare, also the server may create protocol objects for the client. This happens by having a new_id type of argument in an event. Every time the client receives this event, it receives a new protocol object.

As all requests and events are always part of some interface (like a member of a class), this creates an interface hierarchy. For example, wl_compositor objects are created from wl_registry, and wl_surface objects are created from wl_compositor.

Object creation never fails. Once the request or event is sent, the new objects it creates exists, period. This keeps the protocol asynchronous, as there is no need to reply or check that the creation succeeded.

How protocol objects are destroyed

There are two ways to destroy a protocol object. By far the most common one is to have a request in the interface that is specified to be a destructor. Most often this request is called "destroy". When the client code calls the function wl_foobar_destroy(), the request is sent to the server and the client side proxy (struct wl_proxy) for the object gets destroyed. The server then handles the destructor request at some point in the future.

The other way is to destroy the object by an event. In that case, no destructor must be defined in the interface's protocol specification, and the event must be clearly documented to be destructive as there is no automation nor safeties for this. This is for cases where the server decides when an object dies, and requires extreme care in protocol design to work right in all cases. When a client receives such an event, all it can do is destroy the proxy. The (in)famous example of an interface like this is wl_callback.

Enter the boogeyman: races

It is very important that both the client and the server agree on which protocol objects exist. If the client sends a request on, or references as an argument, an object that does not exist in the server's opinion, the server raises a protocol error, and disconnects the client. Obviously this should never happen, nor should it happen that the server sends an event to an object that the client destroyed.

Wayland being a completely asynchronous protocol, we have no implicit guarantees. The server may send an event at the same time as the client destroys the object, and now the event targets an object the client does not know about anymore. Rather than the client shooting itself dead (that's the server's job), we have a trick in libwayland-client: it silently ignores events to destroyed objects, until the server confirms that the object is truly gone.

This works very well for interfaces where the destructor is a request. If the client first sends the destructor request and then sends another request on the destroyed object, it just shot its own head off - no race needed.

Things get tricky for the other case, destructor events. The server may send the destructor event at the same time the client is sending a request on the same object. When the server finally gets the request, the object is already gone, and the client gets taken behind the shed and shot. Therefore pretty much the only safe way to use destructor events is if the interface does not define any requests at all. Ever, not even in future extensions. Furthermore, objects with that interface should not be used as arguments anywhere, or you may hit the race. That is why destructor events are difficult to use right.

The boogeyman's brother

There is yet another nasty race with events that create objects, i.e. server-created objects. If the client is destroying the (parent) object at the same time as the server is sending an event on that object, creating a new (child) object, the server cannot know if the client actually handled the event or not. If the client ignored the event, it will never tell the server to destroy that new object, and you leak in the server.

You could try to make your way out of that pitfall by writing in your protocol specification, that when the (parent) object is destroyed, all the child objects will be destroyed implicitly. But then the client must not send the destructor request for the child objects after it has destroyed the parent, because otherwise the server sees requests on objects it does not know about, and kicks you in the groin, hard. If the child interface defines a destructor, the client cannot destroy its proxies after destroying the parent object. If the child interface does not define a destructor, you can never free the server-side resources until the parent gets destroyed.

The client could destroy all the child objects with a defined destructor in one go, and then immediately destroy the parent object. I am not sure if that works, but it might. If it does not, you have to specify a whole tear-down protocol sequence. The client tells the server it wants to destroy the parent object, the server acks and guarantees it no longer sends any events on it, then the client actually destroys the parent object. Hey, you have a round-trip and just turned a beautiful asynchronous protocol into synchronous, congratulations!

Concluding with recommendations

Here are my recommendations when designing Wayland protocol extensions:

25 Jul 2014 7:01pm GMT

23 Jul 2014

feedplanet.freedesktop.org

Bastien Nocera: Watch out for DRI3 regressions

DRI3 has plenty of necessary fixes for X.org and Wayland, but it's still young in its integration. It's been integrated in the upcoming Fedora 21, and recently in Arch as well.

If WebKitGTK+ applications hang or become unusably slow when an HTML5 video is supposed to be, you might be hitting this bug.

If Totem crashes on startup, it's likely this problem, reported against cogl for now.

Feel free to add a comment if you see other bugs related to DRI3, or have more information about those.

Update: Wayland is already perfect, and doesn't use DRI3. The "DRI2" structures in Mesa are just that, structures. With Wayland, the DRI2 protocol isn't actually used.

23 Jul 2014 12:18pm GMT

Eric Anholt: vc4 driver month 1

I've just pushed the vc4-sim-validate branch to my Mesa tree. It's the culmination of the last week's worth pondering and false starts since I got my first texture sampling in simulation last Wednesday.

Handling texturing on vc4 safely is a pain. The pointer to texture contents doesn't appear in the normal command stream, and instead it's in the uniform stream. Which uniform happens to contain the pointer depends on how many uniforms have been loaded by the time you get to the QPU_W_TMU[01]_[STRB] writes. Since there's no iommu, I can't trust userspace to tell me where the uniform is, otherwise I'd be allowing them to just lie and put in physical addresses and read arbitrary system memory.

This meant I had to write a shader parser for the kernel, have that spit out a collection of references to texture samples, switch the uniform data from living in BOs in the user -> kernel ABI and instead be passed in as normal system memory that gets copied to the temporary exec bo, and then do relocations on that.

Instead of trying to write this in the kernel, with a ~10 minute turnaround time per test run, I copied my kernel code into Mesa with a little bit of wrapper code to give a kernel-like API environment, and did my development on that. When I'm looking at possibly 100s of iterations to get all the validation code working, it was well worth the day spent to build that infrastructure so that I could get my testing turnaround time down to about 15 sec.

I haven't done actual validation to make sure that the texture samples don't access outside of the bounds of the texture yet (though I at least have the infrastructure necessary now), just like I haven't done that validation for so many other pointers (vertex fetch, tile load/stores, etc.). I also need to copy the code back out to the kernel driver, and it really deserves some cleanups to add sanity to the many different addresses involved (unvalidated vaddr, validated vaddr, and validated paddr of the data for each of render, bin, shader recs, uniforms). But hopefully once I do that, I can soon start bringing up glamor on the Pi (though I've got some major issue with tile allocation BO memory management before anything's stable on the Pi).

23 Jul 2014 12:42am GMT

22 Jul 2014

feedplanet.freedesktop.org

Ben Widawsky: Future PPGTT [part 4] (Dynamic page table allocations, 64 bit address space, GPU “mirroring”, and yeah, something about relocs too)

Preface

GPU mirroring provides a mechanism to have the CPU and the GPU use the same virtual address for the same physical (or IOMMU) page. An immediate result of this is that relocations can be eliminated. There are a few derivative benefits from the removal of the relocation mechanism, but it really all boils down to that. Other people call it other things, but I chose this name before I had heard other names. SVM would probably have been a better name had I read the OCL spec sooner. This is not an exclusive feature restricted to OpenCL. Any GPU client will hopefully eventually have this capability provided to them.

If you're going to read any single PPGTT post of this series, I think it should not be this one. I was not sure I'd write this post when I started documenting the PPGTT (part 1, part2, part3). I had hoped that any of the following things would have solidified the decision by the time I completed part3.

  1. CODE: The code is not not merged, not reviewed, and not tested (by anyone but me). There's no indication about the "upstreamability". What this means is that if you read my blog to understand how the i915 driver currently works, you'll be taking a crap-shoot on this one.
  2. DOCS: The Broadwell public Programmer Reference Manuals are not available. I can't refer to them directly, I can only refer to the code.
  3. PRODUCT: Broadwell has not yet shipped. My ulterior motive had always been to rally the masses to test the code. Without product, that isn't possible.

Concomitant with these facts, my memory of the code and interesting parts of the hardware it utilizes continues to degrade. Ultimately, I decided to write down what I can while it's still fresh (for some very warped definition of "fresh").

Goal

GPU mirroring is the goal. Dynamic page table allocations are very valuable by itself. Using dynamic page table allocations can dramatically conserve system memory when running with multiple address spaces (part 3 if you forgot), which is something which should become pretty common shortly. Consider for a moment a Broadwell legacy 32b system (more details later). TYou would require about 8MB for page tables to map one page of system memory. With the dynamic page table allocations, this would be reduced to 8K. Dynamic page table allocations are also an indirect requirement for implementing a 64b virtual address space. Having a 64b virtual address space is a pretty unremarkable feature by itself. On current workloads [that I am aware of] it provides no real benefit. Supporting 64b did require cleaning up the infrastructure code quite a bit though and should anything from the series get merged, and I believe the result is a huge improvement in code readability.

Current Status

I briefly mentioned dogfooding these several months ago. At that time I only had the dynamic page table allocations on GEN7 working. The fallout wasn't nearly as bad as I was expecting, but things were far from stable. There was a second posting which is much more stable and contains support of everything through Broadwell. To summarize:

Feature Status TODO
Dynamic page tables Implemented Test and fix bugs
64b Address space Implemented Test and fix bugs
GPU mirroring Proof of Concept Decide on interface; Implement interface.1

Testing has been limited to just one machine, mine, when I don't have a million other things to do. With that caveat, on top of my last PPGTT stabilization patches things look pretty stable.

Present: Relocations

Throughout many of my previous blog posts I've gone out of the way to avoid explaining relocations. My reluctance was because explaining the mechanics is quite tedious, not because it is a difficult concept. It's impossible [and extremely unfortunate for my weekend] to make the case for why these new PPGTT features are cool without touching on relocations at least a little bit. The following picture exemplifies both the CPU and GPU mapping the same pages with the current relocation mechanism.

Current PPGTT support

Current PPGTT support

To get to the above state, something like the following would happen.

  1. Create BOx
  2. Create BOy
  3. Request BOx be uncached via (IOCTL DRM_IOCTL_I915_GEM_SET_CACHING).
  4. Do one of aforementioned operations on BOx and BOy
  5. Perform execbuf2.

Accesses to the BO from the CPU require having a CPU virtual address that eventually points to the pages representing the BO2. The GPU has no notion of CPU virtual addresses (unless you have a bug in your code). Inevitably, all the GPU really cares about is physical pages; which ones. On the other hand, userspace needs to build up a set of GPU commands which sometimes need to be aware of the absolute graphics address.

Several commands do not need an absolute address. 3DSTATE_VS for instance does not need to know anything about where Scratch Space Base Offset
is actually located. It needs to provide an offset to the General State Base Address. The General State Base Address does need to be known by userspace:
STATE_BASE_ADDRESS

Using the relocation mechanism gives userspace a way to inform the i915 driver about the BOs which needs an absolute address. The handles plus some information about the GPU commands that need absolute graphics addresses are submitted at execbuf time. The kernel will make a GPU mapping for all the pages that constitute the BO, process the list of GPU commands needing update, and finally submit the work to the GPU.

Future: No relocations

GPU Mirroring

GPU Mirroring

The diagram above demonstrates the goal. Symmetric mappings to a BO on both the GPU and the CPU. There are benefits for ditching relocations. One of the nice side effects of getting rid of relocations is it allows us to drop the use of the DRM memory manager and simply rely on malloc as the address space allocator. The DRM memory allocator does not get the same amount of attention with regard to performance as malloc does. Even if it did perform as ideally as possible, it's still a superfluous CPU workload. Other people can probably explain the CPU overhead in better detail. Oh, and OpenCL 2.0 requires it.

"OpenCL 2.0 adds support for shared virtual memory (a.k.a. SVM). SVM allows the host and 
kernels executing on devices to directly share complex, pointer-containing data structures such 
as trees and linked lists. It also eliminates the need to marshal data between the host and devices. 
As a result, SVM substantially simplifies OpenCL programming and may improve performance."

Makin' it Happen

64b

As I've already mentioned, the most obvious requirement is expanding the GPU address space to match the CPU.

Page Table Hierarchy

Page Table Hierarchy

If you have taken any sort of Operating Systems class, or read up on Linux MM within the last 10 years or so, the above drawing should be incredibly unremarkable. If you have not, you're probably left with a big 'WTF' face. I probably can't help you if you're in the latter group, but I do sympathize. For the other camp: Broadwell brought 4 level page tables that work exactly how you'd expect them to. Instead of the x86 CPU's CR3, GEN GPUs have PML4. When operating in legacy 32b mode, there are 4 PDP registers that each point to a page directory and therefore map 4GB of address space3. The register is just a simple logical address pointing to a page directory. The actual changes in hardware interactions are trivial on top of all the existing PPGTT work.

The keen observer will notice that there are only 256 PML4 entries. This has to do with the way in which we've come about 64b addressing in x86. This wikipedia article explains it pretty well, and has links.

"This will take one week. I can just allocate everything up front." (Dynamic Page Table Allocation)

Funny story. I was asked to estimate how long it would take me to get this GPU mirror stuff in shape for a very rough proof of concept. "One week. I can just allocate everything up front." If what I have is, "done" then I was off by 10x.

Where I went wrong in my estimate was math. If you consider the above, you quickly see why allocating everything up front is a terrible idea and flat out impossible on some systems.

Page for the PML4
512 PDP pages per PML4 (512, ok we actually use 256)
512 PD pages per PDP (256 * 512 pages for PDs)
512 PT pages per PD (256 * 512 * 512 pages for PTs)
(256 * 5122 + 256 * 512 + 256 + 1) * PAGE_SIZE = ~256G = oops

Dissimilarities to x86

First and foremost, there are no GPU page faults to speak of. We cannot demand allocate anything in the traditional sense. I was naive though, and one of the first thoughts I had was: the Linux kernel [heck, just about everything that calls itself an OS] manages 4 level pages tables on multiple architectures. The page table format on Broadwell is remarkably similar to x86 page tables. If I can't use the code directly, surely I can copy. Wrong.

Here is some code from the Linux kernel which demonstrates how you can get a PTE for a given address in Linux.

typedef unsigned long   pteval_t;
typedef struct { pteval_t pte; } pte_t;

static inline pteval_t native_pte_val(pte_t pte)
{
        return pte.pte;
}

static inline pteval_t pte_flags(pte_t pte)
{
        return native_pte_val(pte) & PTE_FLAGS_MASK;
}

static inline int pte_present(pte_t a)
{
        return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
                               _PAGE_NUMA);
}
static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
{
        return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
}
#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))

#define pgd_offset(mm, address) ( (mm)->pgd + pgd_index((address)))
static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
{
        return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

/* My completely fabricated example of finding page presence */
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep;
struct mm_struct *mm = current->mm;
unsigned long address = 0xdefeca7e;

pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
ptep = pte_offset_map(pmd, address);
printk("Page is present: %s\n", pte_present(*ptep) ? "yes" : "no");

X86 page table code has a two very distinct property that does not exist here (warning, this is slightly hand-wavy).

  1. The kernel knows exactly where in physical memory the page tables reside4. On x86, it need only read CR3. We don't know where our pages tables reside in physical memory because of the IOMMU. When VT-d is enabled, the i915 driver only knows the DMA address of the page tables.
  2. There is a strong correlation between a CPU process and an mm (set of page tables). Keeping mappings around of the page tables is easy to do if you don't want to take the hit to map them every time you need to look at a PTE.

If the Linux kernel needs to find if a page is present or not without taking a fault, it need only look to one of those two options. After about of week of making the IOMMU driver do things it shouldn't do, and trying to push the square block through the round hole, I gave up on reusing the x86 code.

Why Do We Actually Need Page Table Tracking?

The IOMMU interfaces were not designed to pull a physical address from a DMA address. Pre-allocation is right out. It's difficult to try to get the instantaneous state of the page tables…

Another thought I had very early on was that tracking could be avoided if we just never tore down page tables. I knew this wasn't a good solution, but at that time I just wanted to get the thing working and didn't really care if things blew up spectacularly after running for a few minutes. There is actually a really easy set of operations that show why this won't work. For the following, think of the four level page tables as arrays. ie.

  1. [mesa] Create a 2M sized BO. Write to it. Submit it via execbuffer
  2. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0 is free.
    2. [i915]Allocate PDP for PML4[0]
    3. [i915]Allocate PD for PDP[0][0]
    4. [i915]Allocate PT for PD[0][0][0]/li>
    5. [i915](condensed)Set pointers from PML4->PDP->PD->PT
    6. [i915]Set the 512 PTEs PT[0][0][0][0][511-0] to point to the BO's backing page.
  3. [i915] Dispatch work to the GPU on behalf of mesa.
  4. [i915] Observe the hardware has completed
  5. [mesa] Create a 4k sized BO. Write to it. Submit both BOs via execbuffer.
  6. [i915] See new BO in the execbuffer list. Allocate page tables for it…
    1. [DRM]Find that address 0×200000 is free.
    2. [i915]Allocate PDP[0][0], PD[0][0][0], PT[0][0][0][1].
    3. Set pointers… Wait. Is PDP[0][0] allocated already? Did we already set pointers? I have no freaking idea!
    4. Abort.

Page Tables Tracking with Bitmaps

Okay. I could have used a sentinel for empty entries. It is possible to achieve this same thing by using a sentinel value (point the page table entry to the scratch page). To implement this involves reading back potentially large amounts of data from the page tables which will be slow. It should work though. I didn't try it.

After I had determined I couldn't reuse x86 code, and that I need some way to track which page table elements were allocated, I was pretty set on using bitmaps for tracking usage. The idea of a hash table came and went - none of the upsides of a hash table are useful here, but all of the downsides are present(space). Bitmaps was sort of the default case. Unfortunately though, I did some math at this point, notice the LaTex!.
\frac{2^{47}bytes}{\frac{4096bytes}{1 page}} = 34359738368 pages \\ 34359738368 pages \times \frac{1bit}{1page} = 34359738368 bits \\ 34359738368 bits \times \frac{8bits}{1byte} = 4294967296 bytes
That's 4GB simply to track every page. There's some more overhead because page [tables, directories, directory pointers] are also tracked.
 256entries + (256\times512)entries + (256\times512^2)entries = 67240192entries \\ 67240192entries \times \frac{1bit}{1entry} = 67240192bits \\ 67240192bits \times \frac{8bits}{1byte} = 8405024bytes \\ 4294967296bytes + 8405024bytes = 4303372320bytes \\ 4303372320bytes \times \frac{1GB}{1073741824bytes} = 4.0078G

I can't remember whether I had planned to statically pre-allocate the bitmaps, or I was so caught up in the details and couldn't see the big picture. I remember thinking, 4GB just for the bitmaps, that will never fly. I probably spent a week trying to figure out a better solution. When we invent time travel, I will go back and talk to my former self: 4GB of bitmap tracking if you're using 128TB of memory is inconsequential. That is 0.3% of the memory used by the GPU. Hopefully you didn't fall into that trap, and I just wasted your time, but there it is anyway.

Sample code to walk the page tables

This code does not actually exist, but it is very similar to the real code. The following shows how one would "walk" to a specific address allocating the necessary page tables and setting the bitmaps along the way. Teardown is a bit harder, but it is similar.

static struct i915_pagedirpo *
alloc_one_pdp(struct i915_pml4 *pml4, int entry)
{
        ...
}

static struct i915_pagedir *
alloc_one_pd(struct i915_pagedirpo *pdp, int entry)
{
        ...
}

static struct i915_tab *
alloc_one_pt(struct i915_pagedir *pd, int entry)
{
        ...
}

/**
 * alloc_page_tables - Allocate all page tables for the given virtual address.
 *
 * This will allocate all the necessary page tables to map exactly one page at
 * @address. The page tables will not be connected, and the PTE will not point
 * to a page.
 *
 * @ppgtt:      The PPGTT structure encapsulating the virtual address space.
 * @address:    The virtual address for which we want page tables.
 *
 */
static void
alloc_page_tables(ppgtt, unsigned long address)
{
        struct i915_pagetab *pt;
        struct i915_pagedir *pd;
        struct i915_pagedirpo *pdp;
        struct i915_pml4 *pml4 = &ppgtt->pml4; /* Always there */

        int pml4e = (address >> GEN8_PML4E_SHIFT) & GEN8_PML4E_MASK;
        int pdpe = (address >> GEN8_PDPE_SHIFT) & GEN8_PDPE_MASK;
        int pde = (address >> GEN8_PDE_SHIFT) & I915_PDE_MASK;
        int pte = (address & I915_PDES_PER_PD);

        if (!test_bit(pml4e, pml4->used_pml4es))
                goto alloc;

        pdp = pml4->pagedirpo[pml4e];
        if (!test_bit(pdpe, pdp->used_pdpes;))
                goto alloc;

        pd = pdp->pagedirs[pdpe];
        if (!test_bit(pde, pd->used_pdes)
                goto alloc;

        pt = pd->page_tables[pde];
        if (test_bit(pte, pt->used_ptes))
                return;

alloc_pdp:
        pdp = alloc_one_pdp(pml4, pml4e);
        set_bit(pml4e, pml4->used_pml4es);
alloc_pd:
        pd = alloc_one_pd(pdp, pdpe);
        set_bit(pdpe, pdp->used_pdpes);
alloc_pt:
        pt = alloc_one_pt(pd, pde);
        set_bit(pde, pd->used_pdes);
}

Here is a picture which shows the bitmaps for the 2 allocation example above.

Bitmaps tracking page tables

Bitmaps tracking page tables

The GPU mirroring interface

I really don't want to spend too much time here. In other words, no more pictures. As I've already mentioned, the interface was designed for a proof of concept which already had code using userptr. The shortest path was to simply reuse the interface.

In the patches I've submitted, 2 changes were made to the existing userptr interface (which wasn't then, but is now, merged upstream). I added a context ID, and the flag to specify you want mirroring.

struct drm_i915_gem_userptr {
        __u64 user_ptr;
        __u64 user_size;
        __u32 ctx_id;
        __u32 flags;
#define I915_USERPTR_READ_ONLY          (1<<0)
#define I915_USERPTR_GPU_MIRROR         (1<<1)
#define I915_USERPTR_UNSYNCHRONIZED     (1<<31)
        /**
         * Returned handle for the object.
         *
         * Object handles are nonzero.
         */
        __u32 handle;
        __u32 pad;
};

The context argument is to tell the i915 driver for which address space we'll be mirroring the BO. Recall from part 3 that a GPU process may have multiple contexts. The flag is simply to tell the kernel to use the value in user_ptr as the address to map the BO in the virtual address space of the GEN GPU. When using the normal userptr interface, the i915 driver will pick the GPU virtual address.

What should be: soft pin

There hasn't been too much discussion here, so it's hard to say. I believe the trends of the discussion (and the author's personal preference) would be to add flags to the existing execbuf relocation mechanism. The flag would tell the kernel to not relocate it, and use the presumed_offset field that already exists. This is sometimes called, "soft pin." It is a bit of a chicken and egg problem since the amount of work in userspace to make this useful is non-trivial, and the feature can't merged until there is an open source userspace. Stay tuned. Perhaps I'll update the blog as the story unfolds.

Wrapping it up (all 4 parts)

As usual, please report bugs or ask questions.

So with the 4 parts you should understand how the GPU interacts with system memory. You should know what the Global GTT is, why it still exists, and how it works. You might recall what a PPGTT is, and the intricacies of multiple address space. Hopefully you remember what you just read about 64b and GPU mirror. Expect a rebased patch series from me soon with all that was discussed (quite a bit has changed around me since my original posting of the patches).

This is the last post I will be writing on how GEN hardware interfaces with system memory, and how that related to the i915 driver. Unlike the Rocky movie series, I will stop at the 4th. Like the Rocky movie series, I hope this is the best. Yes, I just went there.

Unlike the usual, "buy me a beer if you liked this", I would like to buy you a beer if you read it and considered giving me feedback. So if you know me, or meet me somewhere, feel free to reclaim the voucher.

Image links

The images I've created. Feel free to do with them as you please.
https://bwidawsk.net/blog/wp-content/uploads/2014/07/legacy.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/mirrored.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/table_hierarchy.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/addr-bitmap.svg

Download PDF

  1. The patches I posted for enabling GPU mirroring piggyback of of the existing userptr interface. Before those patches were merged I added some info to the API (a flag + context) for the point of testing. I needed to get this working quickly and porting from the existing userptr code was the shortest path. Since then userptr has been merged without this extra info which makes things difficult for people trying to test things. In any case an interface needs to be agreed upon. My preference would be to do this via the existing relocation flags. One could add a new flag called "SOFT_PIN"

  2. The GEM and BO terminology is a fancy sounding wrapper for the notion that we want an interface to coherently write data which the GPU can read (input), and have CPU observe data which the GPU has written (output)

  3. The PDP registers are are not PDPEs because they do not have any of the associated flags of a PDPE. Also, note that in my patch series I submitted a patch which defines the number of these to be PDPE. This is incorrect.

  4. I am not sure how KVM works manages page tables. At least conceptually I'd think they'd have a similar problem to the i915 driver's page table management. I should have probably looked a bit closer as I may have been able to leverage that; but I didn't have the idea until just now… looking at the KVM code, it does have a lot of similarities to the approach I took

  5. Let me be clear that I don't think userptr is a bad thing. It's a very hard thing to get right, and much of the trickery needed for it is *not* needed for GPU mirroring

22 Jul 2014 4:45am GMT

21 Jul 2014

feedplanet.freedesktop.org

Keith Packard: Glamorous Intel

Reworking Intel Glamor

The original Intel driver Glamor support was based on the notion that it would be better to have the Intel driver capture any fall backs and try to make them faster than Glamor could do internally. Now that Glamor has reasonably complete acceleration, and its fall backs aren't terrible, this isn't as useful as it once was, and because this uses Glamor in a weird way, we're making the Glamor code harder to maintain.

Fixing the Intel driver to not use Glamor in this way took a bit of effort; the UXA support is all tied into the overall operation of the driver.

Separating out UXA functions

The first task was to just identify which functions were UXA-specific by adding "_uxa" to their names. A couple dozen sed runs and now a bunch of the driver is looking better.

Next, a pile of UXA-specific functions were actually inside the non-UXA parts of the code. Those got moved out, and a new 'intel_uxa.h" file was created to hold all of the definitions.

Finally, a few non UXA-specific functions were actually in the uxa files; those got moved over to the generic code.

Removing the Glamor paths in UXA

Each one of the UXA functions had a little piece of code at the top like:

if (uxa_screen->info->flags & UXA_USE_GLAMOR) {
    int ok = 0;

    if (uxa_prepare_access(pDrawable, UXA_GLAMOR_ACCESS_RW)) {
        ok = glamor_fill_spans_nf(pDrawable,
                      pGC, n, ppt, pwidth, fSorted);
        uxa_finish_access(pDrawable, UXA_GLAMOR_ACCESS_RW);
    }

    if (!ok)
        goto fallback;

    return;
}

Pulling those out shrank the UXA code by quite a bit.

Selecting Acceleration (or not)

The intel driver only supported UXA before; Glamor was really just a slightly different mode for UXA. I switched the driver from using a bit in the UXA flags to having an 'accel' variable which could be one of three options:

I added ACCEL_NONE to give us a dumb frame buffer mode. That actually supports DRI3 so that we can bring up Mesa and run it under X before we have any acceleration code ready; avoiding a dependency loop when doing new hardware. All that it requires is a kernel that offers mode setting and buffer allocation.

Initializing Glamor

With UXA no longer supporting Glamor, it was time to plug the Glamor support into the top of the driver. That meant changing a bunch of the entry points to select appropriate Glamor or UXA functionality, instead of just calling into UXA. So, now we've got lots of places that look like:

        switch (intel->accel) {
#if USE_GLAMOR
        case ACCEL_GLAMOR:
                if (!intel_glamor_create_screen_resources(screen))
                        return FALSE;
                break;
#endif
#if USE_UXA
        case ACCEL_UXA:
                if (!intel_uxa_create_screen_resources(screen))
                        return FALSE;
        break;
#endif
        case ACCEL_NONE:
                if (!intel_none_create_screen_resources(screen))
                        return FALSE;
                break;
        }

Using a switch means that we can easily elide code that isn't wanted in a particular build. Of course 'accel' is an enum, so places which are missing one of the required paths will cause a compiler warning.

It's not all perfectly clean yet; there are piles of UXA-only paths still.

Making It Build Without UXA

The final trick was to make the driver build without UXA turned on; that took several iterations before I had the symbols sorted out appropriately.

I built the driver with various acceleration options and then tried to count the lines of source code. What I did was just list the source files named in the driver binary itself. This skips all of the header files and the render program source code, and ignores the fact that there are a bunch of #ifdef's in the uxa directory selecting between uxa, glamor and none.

    Accel                    Lines          Size(B)
    -----------             ------          -------
    none                      7143            73039
    glamor                    7397            76540
    uxa                      25979           283777
    sna                     118832          1303904

    none legacy              14449           152480
    glamor legacy            14703           156125
    uxa legacy               33285           350685
    sna legacy              126138          1395231

The 'legacy' addition supports i810-class hardware, which is needed for a complete driver.

Along The Way, Enable Tiling for the Front Buffer

While hacking the code, I discovered that the initial frame buffer allocated for the screen was created without tiling (!) because a few parameters that depend on the GTT size were not initialized until after that frame buffer was allocated. I haven't analyzed what effect this has on performance.

Page Flipping and Resize

Page flipping (or just flipping) means switching the entire display from one frame buffer to another. It's generally the fastest way of updating the screen as you don't have to copy any bits.

The trick with flipping is that a client hands you a random pixmap and you need to stuff that into the KMS API. With UXA, that's pretty easy as all pixmaps are managed through the UXA API which knows which underlying kernel BO is tied with each pixmap. Using Glamor, only the underlying GL driver knows the mapping. Fortunately (?), we have the EGL Image extension, which lets us take a random GL texture and turn it into a file descriptor for a DMA-BUF kernel object. So, we have this cute little dance:

fd = glamor_fd_from_pixmap(screen,
                               pixmap,
                               &stride,
                               &size);


bo = drm_intel_bo_gem_create_from_prime(intel->bufmgr, fd, size);
    close(fd);
    intel_glamor_get_pixmap(pixmap)->bo = bo;

That last bit remembers the bo in some local memory so we don't have to do this more than once for each pixmap. glamorfdfrompixmap ends up calling eglCreateImageKHR followed by gbmbo_import and then a kernel ioctl to convert a prime handle into an fd. It's all quite round-about, but it does seem to work just fine.

After I'd gotten Glamor mostly working, I tried a few OpenGL applications and discovered flipping wasn't working. That turned out to have an unexpected consequence - all full-screen applications would run flat-out, and not be limited to frame rate. Present 'recovers' from a failed flip queue operation by immediately performing a CopyArea; not waiting for vblank. This needs to get fixed in Present by having it re-queued the CopyArea for the right time. What I did in the intel driver was to add a bunch more checks for tiling mode, pixmap stride and other things to catch pixmaps that were going to fail before the operation was queued and forcing them to fall back to CopyArea at the right time.

The second adventure was with XRandR. Glamor has an API to fix up the screen pixmap for a new frame buffer, but that pulls the size of the frame buffer out of the pixmap instead of out of the screen. XRandR leaves the pixmap size set to the old screen size during this call; fixing that just meant getting the pixmap size set correctly before calling into glamor. I think glamor should get fixed to use the screen size rather than the pixmap size.

Painting Root before Mode set

The X server has generally done initialization in one order:

  1. Create root pixmap
  2. Set video modes
  3. Paint root window

Recently, we've added a '-background none' option to the X server which causes it to set the root window background to none and have the driver fill in that pixmap with whatever contents were on the screen before the X server started.

In a pre-Glamor world, that was done by hacking the video driver to copy the frame buffer console contents to the root pixmap as it was created. The trouble here is that the root pixmap is created long before the upper layers of the X server are ready for drawing, so you can't use the core rendering paths. Instead, UXA had kludges to call directly into the acceleration functions.

What we really want though is to change the order of operations:

  1. Create root pixmap
  2. Paint root window
  3. Set video mode

That way, the normal root window painting operation will take care of getting the image ready before that pixmap is ever used for scanout. I can use regular core X rendering to get the original frame buffer contents into the root window, and even if we're not using -background none and are instead painting the root with some other pattern (like the root weave), I get that presented without an intervening black flash.

That turned out to be really easy - just delay the call to I830EnterVT (which sets the modes) until the server is actually running. That required one additional kludge - I needed to tell the DIX level RandR functions about the new modes; the mode setting operation used during server init doesn't call up into RandR as RandR lists the current configuration after the screen has been initialized, which is when the modes used to be set.

Calling xf86RandR12CreateScreenResources does the trick nicely. Getting the root window bits from fbcon, setting video modes and updating the RandR/Xinerama DIX info is now all done from the BlockHandler the first time it is called.

Performance

I ran the current glamor version of the intel driver with the master branch of the X server and there were not any huge differences since my last Glamor performance evaluation aside from GetImage. The reason is that UXA/Glamor never called Glamor's image functions, and the UXA GetImage is pretty slow. Using Mesa's image download turns out to have a huge performance benefit:

1. UXA/Glamor from April
2. Glamor from today

       1                 2                 Operation
------------   -------------------------   -------------------------
     50700.0        56300.0 (     1.110)   ShmGetImage 10x10 square 
     12600.0        26200.0 (     2.079)   ShmGetImage 100x100 square 
      1840.0         4250.0 (     2.310)   ShmGetImage 500x500 square 
      3290.0          202.0 (     0.061)   ShmGetImage XY 10x10 square 
        36.5          170.0 (     4.658)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     49800.0        50200.0 (     1.008)   GetImage 10x10 square 
      5690.0        19300.0 (     3.392)   GetImage 100x100 square 
       609.0         1360.0 (     2.233)   GetImage 500x500 square 
      3100.0          206.0 (     0.066)   GetImage XY 10x10 square 
        36.4          183.0 (     5.027)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square

Running UXA from today the situation is even more dire; I suspect that enabling tiling has made CPU reads through the GTT even worse than before?

1: UXA today
2: Glamor today

       1                 2                 Operation
------------   -------------------------   -------------------------
     43200.0        56300.0 (     1.303)   ShmGetImage 10x10 square 
      2600.0        26200.0 (    10.077)   ShmGetImage 100x100 square 
       130.0         4250.0 (    32.692)   ShmGetImage 500x500 square 
      3260.0          202.0 (     0.062)   ShmGetImage XY 10x10 square 
        36.7          170.0 (     4.632)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     41700.0        50200.0 (     1.204)   GetImage 10x10 square 
      2520.0        19300.0 (     7.659)   GetImage 100x100 square 
       125.0         1360.0 (    10.880)   GetImage 500x500 square 
      3150.0          206.0 (     0.065)   GetImage XY 10x10 square 
        36.1          183.0 (     5.069)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square

Of course, this is all just x11perf, which doesn't represent real applications at all well. However, there are applications which end up doing more GetImage than would seem reasonable, and it's nice to have this kind of speed up.

Status

I'm running this on my crash box to get some performance numbers and continue testing it. I'll switch my desktop over when I feel a bit more comfortable with how it's working. But, I think it's feature complete at this point.

Where's the Code

As usual, the code is in my personal repository. It's on the 'glamor' branch.

git://people.freedesktop.org/~keithp/xf86-video-intel  glamor

21 Jul 2014 7:39am GMT

19 Jul 2014

feedplanet.freedesktop.org

Samuel Pitoiset: Implement MP counters for nv50 (compute only)

Hello,

As part of my Google Summer of Code project I implemented MP counters (for compute only) on nv50/tesla. This work follows the implementation of MP counters for nvc0/fermi I did the last year.

Compute counters are used by OpenCL while graphics counters are used to count hardware-related activities of OpenGL applications. The distinction between these two types of counters made by NVIDIA is arbitrary and won't be present in my implementation. That's why compute counters can also be used to give detailed information of OpenGL applications like the number of instructions processed per frame or the number of launched warps.

MP performance counters are local and per-context while performance counters, programmed through the PCOUNTER engine, are global. A MP counter is more accurate than a global counter because it counts hardware-related activities for each context separately while a global counter reports activities regardless of the context that generates it.

All of these MP counters have been reverse engineered using CUPTI, the NVIDIA CUDA profiling tools interface which only exposes compute counters. On nv50/tesla, CUPTI exposes 13 performance counters like instructions or warp_serialize. The nv50 family has 4 MP counters per TPC (Texture Processing Cluster).

Currently, this prototype implements an interface between the kernel and mesa which exposes these MP performance counters to the user through the Gallium HUD. Basically, this interface can configure and poll a counter using the push buffer and a set of software methods.

To configure a MP counter we use the command stream like the blob does. We have two methods, the first one is for configuring the counter (mode, signal, unit and logic operation) and the second one is just used to reinitialize the counter. Then, to select the group of the MP counter we have added a software method. To poll counters we use a notifier buffer object which is allocated along a channel. This notifier allows to communicate between the kernel and mesa. This approach has already been explained in my latest article.

To sum up, this prototype adds support for 13 performance counters on nv50/tesla. All of the code is available on my github account. If you are interested, you can take a look at the mesa and the nouveau code.

Have a good day.


19 Jul 2014 5:41pm GMT

17 Jul 2014

feedplanet.freedesktop.org

Vincent Untz: Stepping down as openSUSE Board Chairman

Two years ago, I got appointed as chairman of the openSUSE Board. I was very excited about this opportunity, especially as it allowed me to keep contributing to openSUSE, after having moved to work on the cloud a few months before. I remember how I wanted to find new ways to participate in the project, and this was just a fantastic match for this. I had been on the GNOME Foundation board for a long time, so I knew it would not be easy and always fun, but I also knew I would pretty much enjoy it. And I did.

Fast-forward to today: I'm still deeply caring about the project and I'm still excited about what we do in the openSUSE board. However, some happy event to come in a couple of months means that I'll have much less time to dedicate to openSUSE (and other projects). Therefore I decided a couple of months ago that I would step down before the end of the summer, after we'd have prepared the plan for the transition. Not an easy decision, but the right one, I feel.

And here we are now, with the official news out: I'm no longer the chairman :-) (See also this thread) Of course I'll still stay around and contribute to openSUSE, no worry about that! But as mentioned above, I'll have less time for that as offline life will be more "busy".

openSUSE Board Chairman at oSC14

openSUSE Board Chairman at oSC14

Since I mentioned that we were working on a transition... First, knowing the current board, I have no doubt everything will be kept pushed in the right direction. But on top of that, my good friend Richard Brown has been appointed as the new chairman. Richard knows the project pretty well and he has been on the board for some time now, so is aware of everything that's going on. I've been able to watch his passion for the project, and that's why I'm 100% confident that he will rock!

17 Jul 2014 2:40pm GMT

Luc Verhaegen: ARM Mali Midgard, as featured on Anandtech.

Anandtech recently went all out on the ARM midgard architecture (Mali T series). This was quite astounding, as ARM MPD tends to be a pretty closed shop. The Anandtech coverage included an in-depth view of the Mali Midgard GPU, a (short) Q&A session with Jem Davies (the head honcho of ARM MPD, ARMs Media Processing Division, the part of ARM that develops the Mali and display and video engines) and a google hangout with Jem Davies a week later.

This set of articles does not seem like the sort of thing that ARM MPD would have initiated itself. Since both Imagination Technologies and NVidia did something similar months earlier, my feeling is that this was either initiated by Anand Lal Shimpi himself, or that this was requested by ARM marketing in response to the other articles.

Several interesting observations can be made from this though, especially from the answers (or sometimes, lack thereof) to the Q&A and google hangout sessions.

Hiding behind Linaro.

First off, Mr Davies still does not see an open source driver as a worthwhile endeavour for ARM MPD, and this is a position that hasn't changed since i started the lima driver, when my former employer went and talked to ARM management. Rumour has it that most of ARMs engineers both in MPD and other departments would like this to be different, and that Mr Davies is mostly alone with his views, but that's currently just hearsay. He himself states that there are only business reasons against an open source driver for the mali.

To give some weight to this, Mr Davies stated that he contributed to the linux kernel, and i called him out on that one, as i couldn't find any mention of him in a kernel git tree. It seems however that his contributions are from during the Bitkeeper days, and that the author trail on those changes probably got lost. But, having contributed to a project at one point or another is, to me, not proof that one actively supports the idea of open source software, at best it proves that adding support to the kernel for a given ARM device or subsystem was simply necessary at one point.

Mr Davies also talked about how ARM is investing a lot in linaro, as a proof of ARMs support of open source software. Linaro is a consortium to further linux on ARM, so per definition ARM plays a very big role in it. But it is not ARM MPD that drives linaro, it is ARM itself. So this is not proof of ARM MPD actively supporting open source software. Mr Davies did not claim differently, but this distinction should be made very clear in this context.

Then, linaro can be described as an industry consortium. For non-founding members of a consortium, such a construction is often used to park some less useful people while gaining the priviledge to claim involvement as and when desired. The difference to other consortiums is that most of the members come from a deeply embedded background, where the word "open" never was spoken before, and, magically, simply by having joined linaro, those deeply embedded companies now feel like they succesfully ticked the "open source" box on their marketing checklist. Several of linaros members are still having severe difficulty conforming to the GPL, but they still proudly wear the linaro badge as proof of their open source...ness?

As a prominent member of the sunxi community, I am most familiar with Allwinner, a small chinese cheap SoC designer. At the start of the year, we were seeing some solid signs of Allwinner opening up to our community directly. In March however, Allwinner joined linaro and people were hopeful that this meant that a new era of openness had started for Allwinner. As usual, I was the only cynical voice and i warned that this could mean that Allwinner now wouldn't see the need to further engage with us. Ever since, we haven't been able to reach our contacts inside Allwinner anymore, and even our requests for compliance with the GPL get ignored.

Linaro membership does not absolve from limited open source involvement or downright license violation, but for many members, this is exactly how it is used. Linaro seems to be a get-out-of-jail-free card for several of its members. Linaro membership does not need to prove anything, linaro membership even seems to have the opposite effect in several cases.

ARM driving linaro is simply no proof that ARM MPD supports open source software.

The patent excuse.

I am amazed that people still attempt to use this as an argument against open source graphics drivers.

Usually this is combined with the claim that open source drivers are exposing too much of the inner workings of the hardware. But this logic in itself states that the hardware is the problem, not the software. The hardware itself might or might not have patent issues, and it is just a matter of time before the owner of said breached patents will come a-knocking. At best, an open source driver might speed up the discovery of said issues, but the driver itself never is the cause, as the problems will have been there all along.

One would actually think that the Anandtech article about the midgard architecture would reveal more about the hardware, and trigger more litigation, than the lima driver could ever do, especially given how neatly packaged an in depth anandtech article is. Yet ARM MPD seemed to have had no issue with exposing this much information in their marketing blitz.

I also do not believe that patents are such a big issue. If graphics hardware patents were such big business, you would expect that an industry expert in graphics, especially one who is a dab hand at reverse engineering, would be contacted all the time to help expose potential patent issues. Yet i never have been contacted, and i know of no-one who ever has been.

Similarly. the first bits of lima code were made available 2.5 years ago, with bits trickling out slowly (much to my regret), and there are still several unknowns today. If lima played any role in patent disputes, you would again expect that i would be asked to support those looking to assert their patents. Again, nothing.

GPU Patents are just an excuse, nothing more.

When I was at SuSE, we freed ATI for AMD, and we never did hear that excuse. AMD wanted a solid open source strategy for ATI as ATI was not playing ball after the merger, and the bad publicity was hurting server (CPU) sales. Once the decision was made to go the open source route, patents suddenly were not an issue anymore. We did however have to deal with IP issues (or actually, AMD did - we made very sure we didn't get anything that wasn't supposed to be free), such as HDCP and media decoding, which ATI was not at liberty to make public. Given the very heated war that ATI and Nvidia fought at the time, and the huge amount of revenue in this market, you would think that ATI would be a very likely candidate for patent litigation, yet this never stood in the way of an open source driver.

There is another reason as to why patents are that popular an excuse. The words "troll" and "legal wrangling" are often sprinkled around as well so that images of shady deals being made by lawyers in smokey backrooms usually come to mind. Yet we never get to hear the details of patent cases, as even Mr Davies himself states that ARM is not making details available of ongoing cases. I also do not know of any public details on cases that have been closed already (not that i have actively looked - feel free to enlighten me). Patents are a perfect blanket excuse where proof apparently does not seem to be required.

We open source developers are very much aware of the damage that software patents do, and this makes the patent weapon perfect for deployment against those who support open source software. But there is a difference between software patents and the patent cases that ARM potentially has to deal with on the Mali. Yet we seem to have made patents our own kryptonite, and are way too easily lulled into backing off at the first mention of the word patent.

Patents are a poor excuse, as there is no direct relationship between an open source driver and the patent litigation around the hardware.

The Resources discussion.

As a hardware vendor (or IP provider) doing a free software driver never is for free. A lot of developer time does need to be invested, and this is an ongoing commitment. So yes, a viable open source driver for the Mali will consume some amount of resources.

Mr Davies states that MPD would have to incur this cost on its own, as MPD seems to be a completely separate unit and that further investment can only come from profit made within this group. In light of that information, I must apologize for ever having treated ARM and ARM MPD as one and the same with respect to this topic. I will from now on make it very clear that it is ARM MPD, and ARM MPD alone, that doesn't want an open source mali driver.

I do believe that Mr Davies his cost versus gain calculations are too direct and do not allow for secondary effects.

I also believe that an ongoing refusal to support an open source strategy for Mali will reflect badly on the sale of ARM processors and other IP, especially with ARM now pushing into the server market and getting into intel territory. The actions of ARM MPD do affect ARM itself, and vice versa. Admittedly, not as much as some with those that more closely tie the in-house GPU with the rest of the system, but that's far from an absolute lack of shared dependency and responsibility.

The Mali binary problem.

One person in the Q&A section asked why ARM isn't doing redistributable drivers like Nvidia does for the Tegra. Mr Davies answered that this was a good idea, and that linaro was doing something along those lines.

Today, ironically, I am the canonical source for mali-400 binaries. At the sunxi project, we got some binaries from the Cubietech people, built from code they received from allwinner, and the legal terms they were under did not prevent them from releasing the built binaries to the public. Around them (or at least, using the binaries as a separate git module) I built a small make based installation system which integrates with ARMs open source memory manager (UMP) and even included a quick GLES test from the lima tests. I stopped just short of debian packaging. The sunxi-mali repository, and the wiki tutorial that goes with it, now is used by many other projects (like for instance linux-rockchip) as their canonical source for (halfway usable) GPU support.

There are several severe problems with these binaries, which we have either fixed directly, have been working around or just have to live with. Direct fixes include adding missing library dependencies, and hollowing out a destructor function which made X complain. These are binary hacks. The xf86-video-fbturbo driver from Siarhei Siamashka works around the broken DRI2 buffer management, but it has to try to autodetect how to work around the issues, as it is differently broken on the different versions of the X11 binaries we have. Then there is the flaky coverage, as we only have binaries for a handful of kernel APIs, making it impossible to match them against all vendor provided SoC/device kernels. We also only have binaries for fbdev or X11, and sometimes for android, mostly for armhf, but not always... It's just one big mess, only slightly better than having nothing at all.

Much to our surprise, in oktober of last year, ARM MPD published a howto entry about setting up a working driver for mali midgard on the chromebook. It was a step in the right direction, but involved quite a bit off faff, and Connor Abbott (the brilliant teenager REing the mali shaders) had to go and pour things into a proper git repository so that it would be more immediately useful. Another bout of insane irony, as this laudable step in the right direction by ARM MPD ultimately left something to be desired.

ARM MPD is not like ATI, Nvidia, or even intel, qualcomm or broadcom. The Mali is built into many very different SoC families, and needs to be integrated with different display engines, 2D engines, media engines and memory/cache subsystems.

Even the distribution of drivers is different. From what i understand, mali drivers are handled as follows. The Mali licensees get the relevant and/or latest mali driver source code and access to some support from ARM MPD. The device makers, however, only rarely get their hands on source code themselves and usually have to make do with the binaries provided by the SoC vendor. Similarly, the device maker only rarely gets to deal with ARM MPD directly, and usually needs to deal with some proxy at the SoC vendor. This setup puts the responsibility of SoC integration squarely at the SoC vendor, and is well suited for the current mobile market: one image per device at release, and then almost no updates. But that market is changing with the likes of Cyanogenmod, and other markets are opening or are actively being opened by ARM, and those require a completely different mode of operation.

There is gap in Mali driver support that ARM MPDs model of driver delivery does not cater for today, and ARM MPD knows about this. But MPD is going to be fighting an upbill battle to try to correct this properly.

Binary solutions?

So how can ARM MPD try to tackle this problem?

Would ARM MPD keep the burden of making suitable binaries available solely with SoC vendors or device makers? Not likely as that is a pretty shakey affair that's actively hurting the mali ecosystem. SoCs for the mobile market have incredibly short lives, and SoC and device software support is so fragmented that these vendors would be responsible for backporting bugfixes to a very wide array of kernels and SoC versions. On top of that, those vendors would only support a limited subset of windowing systems, possibly even only android as this is their primary market. Then, they would have to set up the support infrastructure to appropriately deal with user queries and bug reports. Only very few vendors will end up even attempting to do this, and none are doing so today. In the end, any improvement at this end will bring no advantages to the mali brand or ARM MPD. If this path is kept, we will not move on from the abysmal situation we are in today, and the Mali will remain to be seen as a very fragmented product.

ARM MPD has little other option but to try to tackle this itself, directly, and it should do so more proactively than by hiding behind linaro. Unfortunately, to make any real headway here, this means providing binaries for every kernel driver interface, and the SoC vendor changes to those interfaces, on top of other bits of SoC specific integration. But this also means dealing with user support directly, and these users will of course spend half their time asking questions which should be aimed at the SoC vendor. How is ARM MPD going to convince SoC vendors to participate here? Or is ARM MPD going to maintain most of the SoC integration work themselves? Surely it will not keep the burden only at linaro, wasting the resources of the rest of ARM and of linaro partners?

ARM MPD just is in a totally different position than the ATIs and Nvidias of this world. Providing binaries that will satisfy a sufficient part of the need is going to be a huge drain on resources. Sure, MPD is not spending the same amount of resources on optimizing for specific setups and specific games like ATI or Nvidia are doing, but they will instead have to spend it on the different SoCs and devices out there. And that's before we start talking about different windowing infrastructure, beyond surfaceflinger, fbdev or X11. Think wayland, mir, even directFB, or any other special requirements that people tend to have for their embedded hardware.

At best, ARM MPD itself will manage to support surfaceflinger, fbdev and X11 on just a handful of popular devices. But how will ARM MPD know beforehand which devices are going to be popular? How will ARM MPD keep on making sure that the binaries match the available vendor or device kernel trees? Would MPD take the insane route of maintaining their own kernel repositories with a suitable mali kernel driver for those few chosen devices, and backporting changes from the real vendor trees instead? No way.

Attempting to solve this very MPD specific problem with only binaries, to any degree of success, is going to be a huge drain on MPD resources, and in the end, people will still not be satisfied. The problem will remain.

The only fitting solution is an open source driver. Of course, the Samsungs of this world will not ship their flagship phones with just an open source GPU driver in the next few years. But an open source driver will fundamentally solve the issues people currently have with Mali, the issues which fuel both the demand for fitting distributable binaries and for an open source driver. Only an open source driver can be flexible and cost-effective enough to fill that gap. Only an open source driver can get silicon vendors, device makers, solution creators and users chipping in, satisfying their own, very varied, needs.

Change is coming.

The ARM world is rapidly changing. Hardware review sites, which used to only review PC hardware, are more and more taking notice of what is happening in the mobile space. Companies that are still mostly stuck in embedded thinking are having to more and more act like PC hardware makers. The lack of sufficiently broad driver support is becoming a real issue, and one that cannot be solved easily or cheaply with a quick binary fix, especially for those who sell no silicon of their own.

The Mali marketing show on Anandtech tells us that things are looking up. The market is forcing ARM MPD to be more open, and MPD has to either sink or swim. The next step was demonstrated by yours truly and some other very enterprising individuals, and now both Nvidia and Broadcom are going all the way. It is just a matter of time before ARM MPD has to follow, as they need this more than their more progressive competitors.

To finish off, at the end of the Q&A session, someone asked: "Would free drivers gives greater value to the shareholders of ARM?". After a quick braindump, i concluded "Does ARMs lack of free drivers hurt shareholder value?" But we really should be stating "To what extent does ARMs lack of free drivers hurt shareholder value?".

17 Jul 2014 12:35am GMT

16 Jul 2014

feedplanet.freedesktop.org

Matthias Klumpp: AppStream 0.7 specification and library released

appstream-logoToday I am very happy to announce the release of AppStream 0.7, the second-largest release (judging by commit number) after 0.6. AppStream 0.7 brings many new features for the specification, adds lots of good stuff to libappstream, introduces a new libappstream-qt library for Qt developers and, as always, fixes some bugs.

Unfortunately we broke the API/ABI of libappstream, so please adjust your code accordingly. Apart from that, any other changes are backwards-compatible. So, here is an overview of what's new in AppStream 0.7:

Specification changes

Distributors may now specify a new <languages/> tag in their distribution XML, providing information about the languages a component supports and the completion-percentage for the language. This allows software-centers to apply smart filtering on applications to highlight the ones which are available in the users native language.

A new addon component type was added to represent software which is designed to be used together with a specific other application (think of a Firefox addon or GNOME-Shell extension). Software-center applications can group the addons together with their main application to provide an easy way for users to install additional functionality for existing applications.

The <provides/> tag gained a new dbus item-type to expose D-Bus interface names the component provides to the outside world. This means in future it will be possible to search for components providing a specific dbus service:

$ appstream-index what-provides dbus org.freedesktop.PackageKit.desktop system

(if you are using the cli tool)

A <developer_name/> tag was added to the generic component definition to define the name of the component developer in a human-readable form. Possible values are, for example "The KDE Community", "GNOME Developers" or even the developer's full name. This value can be (optionally) translated and will be displayed in software-centers.

An <update_contact/> tag was added to the specification, to provide a convenient way for distributors to reach upstream to talk about changes made to their metadata or issues with the latest software update. This tag was already used by some projects before, and has now been added to the official specification.

Timestamps in <release/> tags must now be UNIX epochs, YYYYMMDD is no longer valid (fortunately, everyone is already using UNIX epochs).

Last but not least, the <pkgname/> tag is now allowed multiple times per component. We still recommend to create metapackages according to the contents the upstream metadata describes and place the file there. However, in some cases defining one component to be in multiple packages is a short way to make metadata available correctly without excessive package-tuning (which can become difficult if a <provides/> tag needs to be satisfied).

As small sidenote: The multiarch path in /usr/share/appdata is now deprecated, because we think that we can live without it (by shipping -data packages per library and using smarter AppStream metadata generators which take advantage of the ability to define multiple <pkgname/> tags)

Documentation updates

In general, the documentation of the specification has been reworked to be easier to understand and to include less duplication of information. We now use excessive crosslinking to show you the information you need in order to write metadata for your upstream project, or to implement a metadata generator for your distribution.

Because the specification needs to define the allowed tags completely and contain as much information as possible, it is not very easy to digest for upstream authors who just want some metadata shipped quickly. In order to help them, we now have "Quickstart pages" in the documentation, which are rich of examples and contain the most important subset of information you need to write a good metadata file. These quickstart guides already exist for desktop-applications and addons, more will follow in future.

We also have an explicit section dealing with the question "How do I translate upstream metadata?" now.

More changes to the docs are planned for the next point releases. You can find the full project documentation at Freedesktop.

AppStream GObject library and tools

The libappstream library also received lots of changes. The most important one: We switched from using LGPL-3+ to LGPL-2.1+. People who know me know that I love the v3 license family of GPL licenses - I like it for tivoization protection, it's explicit compatibility with some important other licenses and cosmetic details, like entities not loosing their right to use the software forever after a license violation. However, a LGPL-3+ library does not mix well with projects licensed under other open source licenses, mainly GPL-2-only projects. I want libappstream to be used by anyone without forcing the project to change its license. For some reason, using the library from proprietary code is easier than using it from a GPL-2-only open source project. The license change was also a popular request of people wanting to use the library, so I made the switch with 0.7. If you want to know more about the LGPL-3 issues, I recommend reading this blogpost by Nikos (GnuTLS).

On the code-side, libappstream received a large pile of bugfixes and some internal restructuring. This makes the cache builder about 5% faster (depending on your system and the amount of metadata which needs to be processed) and prepares for future changes (e.g. I plan to obsolete PackageKit's desktop-file-database in the long term).

The library also brings back support for legacy AppData files, which it can now read. However, appstream-validate will not validate these files (and kindly ask you to migrate to the new format).

The appstream-index tool received some changes, making it's command-line interface a bit more modern. It is also possible now to place the Xapian cache at arbitrary locations, which is a nice feature for developers.

Additionally, the testsuite got improved and should now work on systems which do not have metadata installed.

Of course, libappstream also implements all features of the new 0.7 specification.

With the 0.7 release, some symbols were removed which have been deprecated for a few releases, most notably as_component_get/set_idname, as_database_find_components_by_str, as_component_get/set_homepage and the "pkgname" property of AsComponent (which is now a string array and called "pkgnames"). API level was bumped to 1.

Appstream-Qt

A Qt library to access AppStream data has been added. So if you want to use AppStream metadata in your Qt application, you can easily do that now without touching any GLib/GObject based code!

Special thanks to Sune Vuorela for his nice rework of the Qt library!

And that's it with the changes for now! Thanks to everyone who helped making 0.7 ready, being it feedback, contributions to the documentation, translation or coding. You can get the release tarballs at Freedesktop. Have fun!

16 Jul 2014 3:03pm GMT

14 Jul 2014

feedplanet.freedesktop.org

Peter Hutterer: pointer acceleration in libinput - an analysis

Following Christian's Wayland in Fedora Update post, and after Hans fixed the touchpad acceleration, I've been playing with pointer acceleration in libinput a bit. The main focus was not yet on changing it but rather on figuring out what we actually do and where the room for improvement is. There's a tool in my (rather messy) github wip/ptraccel-work branchto re-generate the graphs below.

This was triggered by a simple plan: I want a configuration interface in libinput that provides a sliding scale from -1 to 1 to adjust a device's virtual speed from slowest to fastest, with 0 being the default for that device. A user should not have to worry about the accel mechanism itself, which may be different for any given device, all they need to know is that the setting -0.5 means "halfway between default and 'holy cow this moves like molasses!'". The utopia is of course that for any given acceleration setting, every device feels equally fast (or slow). In order to do that, I needed the right knobs to tweak.

The code we currently have in libinput is pretty much 1:1 what's used in the X server. The X server sports a lot more configuration options, but what we have in libinput 0.4.0 is essentially what the default acceleration settings are in X. Armed with the knowledge that any #define is a potential knob for configuration I went to investigate. There are two defines that are labelled as adjustible parameters:

But what do they mean, exactly? And what exactly does a value of 0.4 represent?
[side-note: threshold was 4 until I took the constant multiplier out, it's now 0.4 upstream and all the graphs represent that.]

Pointer acceleration is nothing more than mapping some input data to some potentially faster output data. How much faster depends on how fast the device moves, and to get there one usually needs a couple of steps. The trick of course is to make it predictable, so that despite the acceleration, your brain thinks that the visible cursor is an extension of your hand at all speeds.

Let's look at a high-level outline of our pointer acceleration code:

Calculating pointer speed

We don't just use dx/dy as values, rather, we use the pointer velocity. There's a simple reason for that: dx/dy depends on the device's poll rate (or interrupt frequency). A device that polls twice as often sends half the dx/dy values in each event for the same physical speed.

Calculating the velocity is easy: divide dx/dy by the delta time. We use a set of "trackers" that store previous dx/dy values with their timestamp. As long as we get movement in the same cardinal direction, we take those into account. So if we have 5 events in direction NE, the speed is averaged over those 5 events, smoothing out abrupt speed changes.

The acceleration function

The speed we just calculated is passed to the acceleration function to calculate an acceleration factor.

Figure 1: Mapping of velocity in unit/ms to acceleration factor (unitless). X axes here are labelled in units/ms and mm/s.

This function is the only place where DEFAULT_THRESHOLD/DEFAULT_ACCELERATION are used, but they mostly just stretch the graph. The shape stays the same.

The output of this function is a unit-less acceleration factor that is applied to dx/dy. A factor of 1 means leaving dx/dy untouched, 0.5 is half-speed, 2 is double-speed.

Let's look at the graph for the accel factor output (red): for very slow speeds we have an acceleration factor < 1.0, i.e. we're slowing things down. There is a distinct plateau up to the threshold of 0.4, after that it shoots up to roughly a factor of 1.6 where it flattens out a bit until we hit the max acceleration factor

Now we can also put units to the two defaults: Threshold is clearly in units/ms, and the acceleration factor is simply a maximum. Whether those are mentally easy to map is a different question.

We don't use the output of the function as-is, rather we smooth it out using the Simpson's rule. The second (green) curve shows the accel factor after the smoothing took effect. This is a contrived example, the tool to generate this data simply increased the velocity, hence this particular line. For more random data, see Figure 2.

Figure 2: Mapping of velocity in unit/ms to acceleration factor (unitless) for a random data set. X axes here are labelled in units/ms and mm/s.

For the data set, I recorded the velocity from libinput while using Firefox a bit.

The smoothing takes history into account, so the data points we get depend on the usage. In this data set (and others I tested) we see that the majority of the points still lie on or close to the pure function, apparently the delta doesn't matter that much. Nonetheless, there are a few points that suggest that the smoothing does take effect in some cases.

It's important to note that this is already the second smoothing to take effect - remember that the velocity (may) average over multiple events and thus smoothens the input data. However, the two smoothing effects somewhat complement each other: velocity smoothing only happens when the pointer moves consistently without much change, the Simpson's smoothing effect is most pronounced when the pointer moves erratically.

Ok, now we have the basic function, let's look at the effect.

Pointer speed mappings

Figure 3: Mapping raw unaccelerated dx to accelerated dx, in mm/s assuming a constant pysical device resolution of 400 dpi that sends events at 125Hz. dx range mapped is 0..127

The graph was produced by sending 30 events with the same constant speed, then dividing by the number of events to reduce any effects tracker feeding has at the initial couple of events.

The two lines show the actual output speed in mm/s and the gain in mm/s, i.e. (output speed - input speed). We can see that the little nook where the threshold kicks in and after the acceleration is linear. Look at Figure 1 again: the linear acceleration is caused by the acceleration factor maxing out quickly.

Most of this graph is theoretical only though. On your average mouse you don't usually get a delta greater than 10 or 15 and this graph covers the theoretical range to 127. So you'd only ever be seeing the effect of up to ~120 mm/s. So a more realistic view of the graph is:

Figure 4: Mapping raw unaccelerated dx to accelerated dx, see Figure 3 for details. Zoomed in to a max of 120 mm/s (15 dx/event).

Same data as Figure 3, but zoomed to the realistic range. We go from a linear speed increase (no acceleration) to a quick bump once the threshold is hit and from then on to a linear speed increase once the maximum acceleration is hit.

And to verify, the ratio of output speed : input speed:

Figure 5: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, i.e. the ratio of accelerated:unaccelerated.

Looks pretty much exactly like the pure acceleration function, which is to be expected. What's important here though is that this is the effective speed, not some mathematical abstraction. And it shows one limitation: we go from 0 to full acceleration within really small window.

Again, this is the full theoretical range, the more realistic range is:

Figure 6: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, i.e. the ratio of accelerated:unaccelerated. Zoomed in to a max of 120 mm/s (15 dx/event).

Same data as Figure 5, just zoomed in to a maximum of 120 mm/s. If we assume that 15 dx/event is roughly the maximum you can reach with a mouse you'll see that we've reached maximum acceleration at a third of the maximum speed and the window where we have adaptive acceleration is tiny.

Tweaking threshold/accel doesn't do that much. Below are the two graphs representing the default (threshold=0.4, accel=2), a doubled threshold (threshold=0.8, accel=2) and a doubled acceleration (threshold=0.4, accel=4).

Figure 6: Mapping raw unaccelerated dx to accelerated dx, see Figure 3 for details. Zoomed in to a max of 120 mm/s (15 dx/event). Graphs represent thresholds:accel settings of 0.4:2, 0.8:2, 0.4:4.
Figure 7: Mapping of the unit-less gain of raw unaccelerated dx to accelerated dx, see Figure 5 for details. Zoomed in to a max of 120 t0.4 a4 (15 dx/event). Graphs represent thresholds:accel settings of 0.4:2, 0.8:2, 0.4:4.

Doubling either setting just moves the adaptive window around, it doesn't change that much in the grand scheme of things.

Now, of course these were all fairly simple examples with constant speed, etc. Let's look at a diagram of what is essentially random movement, me clicking around in Firefox for a bit:

Figure 8: Mapping raw unaccelerated dx to accelerated dx on a fixed random data set.

And the zoomed-in version of this:

Figure 9: Mapping raw unaccelerated dx to accelerated dx on a fixed random data set, zoomed in to events 450-550 of that set.

This is more-or-less random movement reflecting some real-world usage. What I find interesting is that it's very hard to see any areas where smoothing takes visible effect. the accelerated curve largely looks like a stretched input curve. tbh I'm not sure what I should've expected here and how to read that, pointer acceleration data in real-world usage is notoriously hard to visualize.

Summary

So in summary: I think there is room for improvement. We have no acceleration up to the threshold, then we accelerate within too small a window. Acceleration stops adjusting to the speed soon. This makes us lose precision and small speed changes are punished quickly.

Increasing the threshold or the acceleration factor doesn't do that much. Any increase in acceleration makes the mouse faster but the adaptive window stays small. Any increase in threshold makes the acceleration kick in later, but the adaptive window stays small.

We've already merged a number of fixes into libinput, but some more work is needed. I think that to get a good pointer acceleration we need to get a larger adaptive window [Citation needed]. We're currently working on that (and figuring out how to evaluate whatever changes we come up with).

A word on units

The biggest issue I was struggling with when trying to understand the code was that of units. The code didn't document used units anywhere but it turns out that everything was either in device units ("mickeys"), device units/ms or (in the case of the acceleration factors) was unitless.

Device units are unfortunately a pretty useless base entity, only slightly more precise than using the length of a piece of string. A device unit depends on the device resolution and of course that differs between devices. An average USB mouse tends to have 400 dpi (15.75 units/mm) but it's common to have 800 dpi, 1000 dpi and gaming mice go up to 8200dpi. A touchpad can have resolutions of 1092 dpi (43 u/mm), 3277 dpi (129 u/mm), etc. and may even have different resolutions for x and y.

This explains why until commit e874d09b4 the touchpad felt slower than a "normal" mouse. We scaled to a magic constant of 10 units/mm, before hitting the pointer acceleration code. Now, as said above the mouse would likely have a resolution of 15.75 units/mm, making it roughly 50% faster. The acceleration would kick in earlier on the mouse, giving the touchpad and the mouse not only different speeds but a different feel altogether.

Unfortunately, there is not much we can do about mice feeling different depending on the resolution. To my knowledge there is no way to query the resolution on a device. But for absolute devices that need pointer acceleration (i.e. touchpads) we can normalize to a fake resolution of 400 dpi and base the acceleration code on that. This provides the same feel on the mouse and the touchpad, as much as that is possible anyway.

14 Jul 2014 10:02pm GMT

13 Jul 2014

feedplanet.freedesktop.org

Ben Widawsky: True PPGTT [part 3]

  • EDIT1: I forgot to include a diagram I did of the software state machine for some presentation. I long lost the SVG, and it got kind of messed up, but it's there at the bottom.
  • EDIT2: (Apologies to aggregators) Grammar fixes. Fixed some bugs in a couple of the images.
  • EDIT3: (Again, apologies to aggregators) s/indirect rendering/direct rendering. I had to fix this or else the sentence made no sense.
  • EDIT4 (2017-07-13): I was under the impression we were not yet allowed to talk about preemption. But apparently we are. So feature matrix at the bottom is updated.

The Per-Process Graphics Translation Tables provide real process isolation among the various graphics processes running within an i915 based system. When in use, the combination of the PPGTT and the Hardware Context provide the equivalent of the traditional CPU process. Most of the same capabilities can be provided, and most of the same limitations come with it. True PPGTT encompasses all of the functionality currently merged into the i915 kernel driver that support page tables and address spaces. It's called, "true" because the Aliasing PPGTT was introduced first and often was simply called, "PPGTT."

The True PPGTT patches represent one of the more challenging aspects of working on a project like the Linux kernel. The feature couldn't realistically be enabled in isolation of the existing driver. When regressions occur it's likely that the user gets no display. To say we get chided on occasion would be an understatement. Ipso facto, this feature is not enabled by default. There are quite a few patches on the mailing list that build new functionality on top of this support, and to help stabilize existing support. If one wishes to try enabling the real PPGTT, one must simply use the i915 module parameter: enable_ppgtt=2. I highly recommended that the stability patches be used unless you're reading this in some future where the stability problems are fixed upstream.

Unlike the previous posts where I tried to emphasize the hardware architecture for this feature, the following will not go into almost no detail about how hardware works. There won't be PRM references, or hardware state machines. All of those mechanics have been described in parts 1 and part 2

A Brief History of the i915 Graphics Process

There have been three stages of the definition of a graphics process within the i915 driver. I believe that by explaining the stages one can get a better appreciation for the capabilities. In the following pictures there is meant to be a highlighted region (yellow in the first two, yellow, orange and blue in the last) that denote the scope of a GPU context/process with the specified feature. Incrementally the definition of a process begins to bleed between the CPU, and the GPU.

Unfortunately I have some overlap with my earlier post about Hardware Contexts. I found no good way to write this post without doing so. If you read that post, consider this a refresher.

File Descriptors

Initially all GPU state was shared by every GPU client. The only partition was done via the operating system. Every process that does direct rendering will get a file descriptor for the device. The file descriptor is the thing through which commands are submitted. This could be used by the i915 driver to help disambiguate "who" was doing "what." This permitted the i915 kernel driver to prevent one GPU client from directly referencing the buffers owned by a different GPU client. By making the buffer object handles per file descriptor (this is very easy to implement, it's just an idr in the kernel) there exist no mechanism to reference buffer handles from a different file descriptor. For applications which do not require context saved, non-buggy apps, or non-malicious apps, this separation is still perfectly sufficient. As an example, BO handle #1 for the X server is not the same as BO handle #1 for xonotic since each has a different file descriptor1. Even though we had this partition at the software level, nothing was enforced by the hardware. Provided a GPU client could guess where another buffer resided, it could easily operate on that buffer. Similarly, a GPU client could not expect the GPU state it had set previously to be preserved for any amount of time.

File descriptor isolation. Before hardware contexts.

File descriptor isolation.Before hardware contexts.

Hardware Contexts

The next step towards isolation was the Hardware Context2. The hardware contexts built upon the isolation provided by the original file descriptor mechanism. The hardware context was an opt-in interface which meant that those not wishing to use the interface received the old behavior: they could purposefully or accidentally use the state from another GPU client3. There was quite a bit of discussion around this at the time the patches were in review, and there's not really any point in lamenting about how it could be better, now.

The context exists within the domain of the process/file descriptor in the same way that a BO exists in that domain. Contexts cannot be shared [intentionally]. The interface created was, and remains extremely simple.

struct drm_i915_gem_context_create {
        /* output: id of new context*/
        __u32 ctx_id;
        __u32 pad;
};

struct drm_i915_gem_context_destroy {
        __u32 ctx_id;
        __u32 pad;
};

As you can see from the two IOCTL payloads above, I wasn't lying about the simplicity. Because there was not a great deal of variable functionality, there just wasn't a lot to add in terms of the interface. Destroy is an optional call because we have the file descriptor and can clean up if a process does not. The primary motivation for destroy() is simply to allow very meticulous and memory conscious GPU clients to keep things tidy. Earlier I had a list of 3 types of GPU clients that could survive without this separation. Considering their inverse; this takes one of those off the list.

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

The block diagram is quite similar to above diagram with the exception that now there are discrete blocks for the persistent state. I was a bit lazy with the separation on this drawing. Hopefully, you get the idea.

Hardware context isolation

Hardware context isolation

Full PPGTT

The last piece was to provide a discrete virtual address space for each GPU client. For completeness, I will provide the diagram, but by now you should already know what to expect.

PPGTT, full isolation

PPGTT, full isolation

If I write about this picture, there would be no point in continuing with an organized blog post :-). So I'll continue to explain this topic. Take my word for it that this addresses the other two types of GPU clients

  • GPU clients needed HW context preserved
  • Buggy applications writing to random memory
  • Malicious applications

Since the GGTT isn't really mentioned much in this post, I'd like to point out that the GTT still exists as you can see in this diagram. It is required for several components that were listed in my previous blog post.

VMAs and Address Spaces (AKA VMs)

The patch series which began to implement PPGTT was actually a separate series. It was the one that introduced the Virtual Memory Area for the PPGTT, simply referred to as, VMA4. You can think of a VMA in a very similar way to a GEM BO. It is an identifiable, continuous range within an address space. Conceptually there isn't much difference between a GEM BO. To try to define it in my horrible math jargon: a logical grouping of virtual addresses representing an operand for some GPU operation within a given PPGTT domain. A VMA is uniquely identified via the tuple (BO, Address space). In the likely case that I made no sense just there, a VMA is just another handle on a chunk of GPU memory used for rendering.

Sharing VMAs

You can't (see the note at the bottom). There's not a whole lot I can say without doing another post about DMA-Buf, and/or Flink. Perhaps someday I will, but for now I'll keep things general and brief.

It is impossible to share a VMA. To repeat, a VMA is uniquely identifiable by the address space, and a BO. It remains possible to share a BO. An address space exists for an individual GPU client's process. Therefore it makes no sense to share a VMA since the address space cannot be shared5. As a result of using the existing sharing interfaces a GPU will get multiple VMAs that reference the same BO. Trying to go back to the math jargon again:

  1. VMA: (BO, Address Space) // Some BO mapped by the address space.
  2. VMA′: (BO′, Address Space) // Another BO mapped into the address space
  3. VMA″: (BO, Address Space′) // The same BO as 1, mapped into a different address space.
VMA : PPGTT :: BO : GGTT

M = {1,2,3,…} N = {1,2,3,…}

In case it's still unclear, I'll use an example (which is kind of a simplified/false demonstration). The scanout buffer is the thing which is displayed on the screen. When doing frontbuffer rendering, one directly renders to that buffer. If we remember my previous post, the Display Engine requires a GGTT mapping. Therefore we know we have VMAglobal. Jumping ahead, a GPU client cannot have a global mapping, therefore, to render to the frontbuffer it too has a VMA, VMApp. There you have two VMAs pointing to the same Buffer Object.

NOTE: You can actually share VMAs if you are already sharing a Context/PPGTT. I can't think of any real world examples off of the top of my head, but it is possible, and potentially a useful thing to do.

Data Structures

Here are the relevant data structures cropped for the sake of brevity.

struct i915_address_space {
        struct drm_mm mm;
        unsigned long start;            /* Start offset always 0 for dri2 */
        size_t total;           /* size addr space maps (ex. 2GB for ggtt) */
        struct list_head active_list;
        struct list_head inactive_list;
};

struct i915_hw_ppgtt {
        struct i915_address_space base;
        int (*switch_mm)(struct i915_hw_ppgtt *ppgtt,
                         struct intel_engine_cs *ring,
                         bool synchronous);

};
struct i915_vma {
        struct drm_mm_node node;
        struct drm_i915_gem_object *obj;
        struct i915_address_space *vm;
};

The struct i915_hw_ppgtt is a subclass of a struct i915_address_space. Only two implementors of i915_address space exist: the i915_hw_ppgtt (a PPGTT), and the i915_gtt (the GGTT). It might make some sense to create a new PPGTT subclass for GEN8+ but I've not opted to do this. I feel there is too much duplication for not enough benefit.

I've already explained in different words that a range of used address space is the VMA. If the address space has the drm_mm, then it should make direct sense that the VMA has the drm_mm_node because this is the used part of the address space6. In the i915_vma struct above is a pointer to the address space for which the VMA exists, and the object the VMA is referencing. This provides the tuple that define the VMA.

HOLE 0x0->0x64000 VMA 1 0x64000->0x69000 HOLE 0x69000->512M VMA 2 512M->512.004M HOLE 1.5GB

HOLE 0×0->0×64000
VMA 1 0×64000->0×69000
HOLE 0×69000->512M
VMA 2 512M->512.004M
HOLE ~512M->2GB
Allocated space: 0×6000 Free space: 0x7fffa000

Relation to the Hardware Context

struct intel_context {
        struct kref ref;
        int id;
        ...
        struct i915_address_space *vm;
};

With the 3 elements discussed a few times already: file descriptor, context, PPGTT, we get real GPU process isolation. Since the context was historically an opt-in interface, changes needed to be made in order to keep the opt-in behavior yet provide isolation behind the scenes regardless of what the GPU client tried to do. If this was not done, then innocent GPU clients could feel the wrath. The file descriptor was already intimately connected with the direct rendering process (one cannot render without getting a file descriptor), it made sense to hook off of that to create the contexts and PPGTTs.

Implicit Context ("private default context")

From here on out we can consider a, "context" as the 3 elements: fd, HW context, and a PPGTT. In the driver as it exists today if a GPU client does not provide a context for rendering, it cannot rely on GPU state being preserved. A context is created for GPU clients that do not provide one, but the state of this context should be considered completely opaque to all GPU clients. I've called this the Private Default Context as it very much resembles the default context that exists for the whole system (again, let me point you to the previous blog post on contexts). The driver will isolate the various contexts within the system from implicit contexts, and vice versa. Hardware state is undefined while using the private default context. Hardware state maintains it's state from the previous render operation when using the IOCTLs.

The behavior of the implicit context does result in waste when userspace uses contexts (as mesa/libgl does). There are a few solutions to this problem, and I've submitted patches for all of them (I can count 3 off the top of my head). Perhaps one day in the not too distant future, this above section will be false and we can just say - every process will get a context when they open the DRI file. If they want more contexts, they can use the IOCTL.

Multi Context

A GPU client can create more than one context. The context they wish to use for a given rendering command is built into the execbuffer2 API (note that KMS is not context savvy).

struct drm_i915_gem_execbuffer2 {
        /**
         * List of gem_exec_object2 structs
         */
        __u64 buffers_ptr;
        __u32 buffer_count;

        /** Offset in the batchbuffer to start execution from. */
        __u32 batch_start_offset;
        /** Bytes used in batchbuffer from batch_start_offset */
        __u32 batch_len;
        ...
        __u64 flags;
        __u64 rsvd1; /* now used for context info */
        __u64 rsvd2;
};

A process may wish to create several GL contexts. The API allows this, and for reasons I don't understand, it's something some applications wish to do. If there was no mechanism to create a new contexts, userspace would be forced to open a new file descriptor for each GL context or else they would not reap the benefits of everything we've discussed for a GL context.

The Big Picture - literally

Overview

Overview

Context:PPGTT

One of the more contentious topics in the very early stages of development was the relationship and connection of a PPGTT and a HW context.

Quoting myself from one of earlier public declarations, here:

My long term vision is for contexts to have a 1:1 relationship with a PPGTT. Sharing objects between address spaces would work similarly to the flink/dmabuf model if needed.

My idea was to embed the PPGTT within the context structure, and creating a context always resulted in a new PPGTT. Creating a PPGTT by itself would have been impossible. This is not what we ended up doing. The implementation allows multiple hardware contexts to share a PPGTT. I'm still unclear exactly what is needed to support share groups within OpenGL, but it has been speculated that this is a requirement for share groups. Fundamentally this would allow the client to create multiple GPU contexts that share an address space (it resembles what you'd get back when there was only HW contexts). The execbuffer2 IOCTL allows one to specify the context. Behaviorally however, my proposal matches what is in use currently. I think it's a bit easier to think of things this way too.

Current Mesa Current DDX 2 hypothetical scenarios

Current Mesa
Current DDX
2 hypothetical scenarios

Conclusion

Please feel free to send me issues or questions.
Oh yeah. Here is a state machine that I did for a presentation on this. Things got rendered weird, and I lost the original SVG file, but perhaps it will be of some value to someone.

State Machine

State Machine

TODO

As I alluded to earlier, there is still some work left to do in order to get this feature turned on by default. I gave the links to some patches, and the parameter to make it happen. If you feel motivated to help get this stuff moving forward, test it, report bugs, try to fix stuff, don't yell at me when things break :-).

Summary

That's most of it. I like to give the 10 second summary.

  1. i915_vma, i915_hw_ppgtt, i915_address_space: important things.
  2. The GPU has a virtual address space per DRI file descriptor.
  3. There is a connection between the PPGTT, and a Hardware Context.
  4. VMAs are backed by BOs which are backed by physical pages.
  5. GPU clients have some flexibility with how they interact with contexts, and therefore the PPGTT.

And finally, since I compared our now well defined notion of a GPU process to the traditional CPU process, I wanted to create a quick list of what I think are some interesting data points regarding the capabilities of the processors.

Thing Modern X86 CPU Modern i915 GPU
Phys Address Limit 48b? ~40b
Process Isolation Yes Yes (with True PPGTT)
Virtual Address Space Yes Yes
64b VA Space Yes GEN8+ 48b only
PTE access controls Yes No
Page Fault Handling Yes No
Preemption7 Yes *With execlists

So while True PPGTT brings the GPU closer to having all of the [what I consider to be] interesting features of a modern x86 CPU - it still has a ways to go. I would be surprised if things didn't continue going in this direction.

SVG Links

As usual, please feel free to do something useful with the images I've created. Also as usual, they are really poorly named.
https://bwidawsk.net/blog/wp-content/uploads/2014/07/pre-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/post-ppgtt.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma-bo-page.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/vma.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/ppgtt-context.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/multi-context.svg

Download PDF

  1. It's technically possible to make them be the same BO through the two buffer sharing mechanisms.

  2. Around the same time Hardware Contexts were introduced, so was the Aliasing PPGTT. The Aliasing PPGTT was interesting, however it does not contribute to any part of the GPU "process"

  3. Hardware contexts use a mechanism which will inhibit the restoration of state when not opted-in. This means if one GPU client does opt-in, and another does not, the client without contexts can reuse the state of the client with contexts. As the address space is still shared, this is actually a really dangerous thing to allow.

  4. I would have preferred the reservation of a space within the address space be called a, "GVMA", but that was shot down during review

  5. There's a whole section below describing how this statement could be false. For now, let's pretend address spaces can't be shared

  6. For those unfamiliar with the Direct Render Manager memory manager, a drm_mm is the structure for the memory manager provided by the DRM midlayer helper layer. It does all the things you'd expect out of a memory manager like, find free nodes, allocate nodes, free up nodes… A drm_mm_node is a structure representing an allocation from the memory manager. The PPGTT code relies entirely on thedrm_mm and the DRM helper functions in order to actually do the address space allocations and frees.

  7. I am defining the word preemption as the ability to switch at an arbitrary point in time between contexts. On the CPU this is easily accomplished. The GPU running the i915 driver as of today has no way to do this. Once a batch is running it cannot be interrupted except for RC6.

13 Jul 2014 5:48pm GMT

12 Jul 2014

feedplanet.freedesktop.org

Ben Widawsky: i915 command submission via gem_exec_nop

EDIT1 (2014-07-12): Apologies to planets for update.

Disclaimer: Everything documented below is included in the Intel public documentation. Anything I say which offends you are my own words and not those of Intel. Sadly, anything I say that is of monetary value are those if Intel.

Intro

Goal

My goal is to lay down a basic understanding of how GEN GPU execution works using gem_exec_nop from the intel-gpu-tools suite as an example. One who puts in the time to read this should understand how command submission works for the i915 driver, and how gem_exec_nop tests command submission. You should also have a decent idea of how the hardware handles execution. I intentionally skip topics like relocations, and how graphics virtual addresses are maintained. They are not directly related towards execution, and would make the blog entry too long.

Ideally, I am hoping this will enable people who are interested to file better bugs, improve our tests, or write their own tests.

Terminology

Source Code

The source code in this post is found primarily in two places. Note that the links below are both from very fast moving code bases.

The test case: http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/tree/tests/gem_exec_nop.c

The driver internals: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_execbuffer.c

GEN Hardware

Before going over gem_exec_nop, I'd like to give an overview of modern GEN hardware:

Coarse GEN block diagram.

Coarse GEN block diagram.

I don't want to say this is the exhaustive list, and indeed, each block above has many sub-components. In the part of the driver I work on this is a pretty logical way to split it. Each of the blocks share very little. The common denominator is a Graphics Virtual Address which is understood by all blocks. This provides an easy communication for work needing to be sent between components. As an example, the command streamer might want the display engine to flip to a new surface. It does so by sending a special message to the display engine along with the address of the surface to flip to. The display engine may respond "out of band" via interrupts (flip completion). There are also built in synchronization primitives that allow the command streamer to wait on events sent by the display engine (we'll get to the command streamer in more detail later).

Excluding audio, since I know nothing about audio… by a very rough estimate, 85% of the Linux i915.ko code falls into "Other." Of the remaining 15% in graphics processing engine, the kernel driver tends to utilize very little of the Fixed Func/EU block above. Total lines of code outside of the kernel driver for the EU block is enormous, given that the X 2d driver (DDX), mesa, libva, and beignet all have tons of lines of code just for utilizing that part of the hardware.

gem_exec_nop

gem_exec_nop is one of my favorite tests. For me, it's the first test I run to determine whether or not to even bother with the rest of the test suite.

It's not a perfect test, some of the things which are missing:

gem_exec_nop flowchart

NOTE: I will explain more about what a batchbuffer is later.

execbuf_5_steps

* (step 1) The docs say we must always follow MI_BATCH_BUFFER_END with an MI_NOOP. The presumed reason for this is that the hardware may prefetch the next instruction, and I think the designers wanted to dumb down the fact that they can't handle a pagefault on the prefetch, so they simply demand a MI_NOOP.
** (step1) MI_NOOP is defined as a dword of value 0x00000000. GEM BOs are 0 by default. So we have an implicit buffer full of MI_NOOP.
  1. Creating a batchbuffer is done using GEM APIs. Here we create a batchbuffer of size 4096, and fill in two instructions. The batchbuffer is the basic unit of execution. The only pertinent point to keep in mind is this is the only buffer being created for this test. Note that this step, or a similar one, is done in almost every test.
  2. Here we set up the data structure that will be passed to the kernel in an IOCTL. There's a pointer to the list of buffers, in our case, just the one batchbuffer created instead one. The size of 8 (each of the two instructions is 4 bytes), and some flags which we'll skip for now are also included in the struct.
  3. The dotted line through step 3 denotes the userspace/kernel barrier. Above the line is gem_exec_nop.c, below is i915_gem_execbuffer.c. DRM, which is a common subsystem interface, actually dispatches the IOCTLs to the i915 driver.
  4. The kernel handling the data is received. Talked about in more detail later.
  5. Submit to the GPU for execution. Also, detailed later.

Execbuf2 IOCTL and Hardware Execution

i915.ko execbuffer2 handling (step 4 and 5 in the picture above)

The eventual goal of the kernel driver is to take the batchbuffer passed in from userspace, make sure it is visible to the GPU by mapping it, and then submit it to the GPU for execution. The aforementioned operations are synchronous with respect to the IOCTL1. In other words, by the time the execution returns to the application, the GPU knows about the work. The work is completed asynchronously.

I'll detail some of the steps a bit. Unfortunately, I do not have pretty pictures for this one. You can follow along in i915_gem_execbuffer.c; i915_gem_do_execbuffer()

  1. copy_from_user - Copy the BOs in from userspace. Remember that the BO is a handle and not actual memory being copied; this allows a relatively small and fast copy to take place. In gem_exec_nop, there is exactly 1 BO: the batchbuffer.
  2. some sanity checks - not interesting
  3. look up - Do a lookup of all the handles for the BOs passed in via the buffers_ptr member (copied in during #1). Make sure the buffers still exist and so on. In our case this is only one buffer and it's unlikely that it would be destroyed before execbuffer completes2
  4. Space reservation - Make sure there is enough address space in the GPU for the objects. This also includes checking for various alignment restrictions, and a few other details not really relevant to this specific topic. For our example, we'll have to make sure we have enough space for 1 buffer of size 4096, and no special alignment requirements. It's the second simplest request possible (first would be to have no buffers).
  5. Relocations - save for another day.
  6. Ring synchronization - Also not pertinent to gem_exec_nop. Since it involves the command streamer, I'll include a brief description as a footnote3
  7. Dispatch - Finally we can tell the GEN hardware about the work that we just got. This means using some architectural registers to point the hardware at the batchbuffer which was submitted by userspace. More on this shortly…
  8. Some more relocation stuff - save for another day

Execution part I (Command Streamer/Ringbuffer)

Fundamentally, all work is submitted via a hardware ringbuffer, and fetching via the command streamer. A command streamer is many things, but for now, saying it's a DMA engine for copying in commands and associated data is a good enough definition. The ringbuffer is a canonical ringbuffer with a HEAD and TAIL pointer (to be clear: TAIL is the one incremented by the CPU, and read by the GPU. HEAD is written by the GPU and read by the CPU). There is a third pointer known as ACTHD (or Active HEAD) - more on this later. At driver initialization, the space for the ringbuffer is allocated, and the address and size is written to hardware registers. When the driver wants to submit work, it writes data at the current TAIL pointer, and increments the TAIL pointer. Once the TAIL is incremented, the hardware will start reading in commands (via DMA), and increment the HEAD (and ACTHD) pointer as commands are retired.

Early GEN hardware had only 1 command streamer. It was referred to as, "CS." When Ironlake introduced the VCS, or video engine command streamer, they renamed (in some places) the original CS to RCS, for render engine command streamer. Sandybridge introduced the blit engine command streamer BCS, and Haswell the video enhancement command streamer, or VECS. Each command streamer supports its own instruction set, though many instructions are the same on multiple command streamers, MI_NOOP is supported on all of them :P Having multiple command streamers not only provides an easy way to add new instructions, but it also allows an asynchronous way to submit work, which can be very useful if you are trying to do two orthogonal tasks. As an example, take an OpenCL application running in conjunction with your favorite 3d benchmark. The 3d benchmark internally will only use the 3d and blit hardware, while the OCL application will use the GPGPU hardware. It doesn't make sense to have either one wait for a single command streamer to fetch the data (especially since I glossed over some other details which make it an even worse idea) if there won't be any [or few] data dependencies.

The kernel driver is the only entity which can insert commands into the ringbuffer. The ringbuffer is therefore considered trusted, and all commands supported by the hardware may be run here (the docs use the word, "secure" but this gets confusing quickly). The way in which the batchbuffer we created in gem_exec_nop gets executed will be explained a bit further shortly, but the contents of that batchbuffer are not directly inserted into the ringbuffer4. Take a quick peek at the text in the image below for how it works.

Here is a pretty basic picture describing the above. The HEAD and TAIL point to the next instruction to be executed, therefore this would be midway through step #5 in the flowchart above.

ringbuffer

Execution part II (MI_BATCH_BUFFER_START, batchbuffer)

A batchbuffer is the way in which we can submit work to the GPU without having to write into the hardware ringbuffer (since only the kernel driver can do that). A batchbuffer is submitted to the GPU for execution via a command called MI_BATCH_BUFFER_START which is inserted into the ringbuffer and read by the command streamer. Batchbuffers share an instruction set with the command streamer that dispatched them (ie. batches run by the blit engine can issue blit commands), and the execution flow is very similar to that of the command streamer as described in the first diagram, and subsequently. On the other hand, there are quite a few differences. Batchbuffer execution is not guided by HEAD, and TAIL pointers. The hardware will continue to execute every instruction in a batchbuffer until it hits another MI_BATCH_BUFFER_START command, or an MI_BATCH_BUFFER_END. Yes, you can get into an infinite loop of batchbuffers with this nesting of MI_BATCH_BUFFER_START commands. The hardware has an internal HEAD pointer which is exposed for debug purposes called ACTHD. This pointer works exactly like a HEAD point would, except it is never compared against TAIL to determine the end of execution5. MI_BATCH_BUFFER_END will directly guide execution back the hardware ringbuffer. In other words you need only one MI_BATCH_BUFFER_END to break the chain of n MI_BATCH_BUFFER_STARTs.

Getting back to gem_exec_nop specifically for a sec: this is what we set up in step #1. Recall it had 2 instructions MI_BATCH_BUFFER_END, MI_NOOP.

batch

Here is our graphical representation of the batchbuffer from gem_exec_nop. Notice that the batchbuffer doesn't have a tail pointer, only ACTHD.

Hardware states

The follow macro-level state machine/flowchart hybrid can be used to describe both ringbuffer execution and batchbuffer execution, though the descriptions differ slightly. By "macro-level" I mean each state may not match exactly to a state within the hardware's state machines. It's more of a state in the data flow. The "state machines" for both ringbuffers and batchbuffers are pretty similar. What follows is a diagram that mostly works for both, and a description of each state.

cs_state_machine

I'll use "RSn" for ringbuffer state n, and "BSn" for batchbuffer state n.

gem_exec_nop state walkthrough

With the above knowledge, we can now step through the actual stuff from gem_exec_nop. This combines pretty much all the diagrams above (ie. you might want to reference them), I tried to keep everything factually correct along the way minus the address I make up below. Assume HEAD = 0×30, TAIL = 0×30, ACTHD=0×30

  1. Hardware is in Rs0.
  2. gem_exec_nop runs; submits previously discussed setup to i915.
  3. *** kernel picks address 0×22000 for the batchbuffer (remember I said we're ignoring how graphics addresses work for now, so just play along)
  4. i915.ko writes 4 bytes, MI_BATCH_BUFFER_START to hardware ringbuffer.
  5. i915.ko writes 4 bytes, 0×22000 to hardware ringbuffer.
  6. i915.ko increments the tail pointer by command length (8). TAIL := 0×38
  7. RS0->RS1: DMA fetches TAIL-HEAD bytes. (0×38-0×30) = 8B
  8. RS1->RS2: DMA completes. Parsing will find that the command is MI_BATCH_BUFFER_START, and it needs 1 extra dword to proceed. This 8B command is then ready to move on.
  9. RS2->RS3: Command was successfully parsed. There is a batchbuffer to be fetched, and once that completes we need to execute it.
  10. RS3->RS4: Execution was okay, DMA fetch of the batchbuffer at 0×22000 starts…completes
  11. RS4->BS0: ACTHD := 0×22000
  12. BS0->BS1: We're in a batchbuffer. The commands we need to fetch are in our local cache, fetched by the ringbuffer just before so no need to do anything more.
  13. BS1->BS2: Parsing of the batchbuffer begins. The first command pointed to by ACTHD is MI_BATCH_BUFFER_END. It is only 4b.
  14. BS2->BS3: Parse was successful. Execute the command MI_BATCH_BUFFER_END. ACTHD := 4. There are no extra requirements for this command.
  15. BS3->RS0: Batchbuffer told us to end, so we go back to the ring. Increment our HEAD pointer by the size of the last command (8B). Set ACTHD equal to HEAD. HEAD := 0×38. ACTHD := 0×38.
  16. HEAD == TAIL… we're idle.

Summary

User space builds up a command, and list of buffers. Then the userspace tells the kernel about it via IOCTL. Kernel does some work on the command to find all the buffers and so on, then submits it to the hardware. Some time later, userspace can see the results of the commands (not discussed in detail). On the hardware side, we've got a ringbuffer with a head and tail pointer, a way to dispatch commands which are located sparsely in our address space, and a way to get execution back to the ringbuffer.

SVG links

https://bwidawsk.net/blog/wp-content/uploads/2014/07/gen_block_diagram.svg
https://bwidawsk.net/blog/wp-content/uploads/2014/07/execbuf_5_steps.svg

Download PDF

  1. The synchronous nature of the IOCTL is something which has been discussed several times. One usage model which would really like to break that is a GPU scheduler. In the case of a scheduler, we'd want to queue up work and return to userspace as soon as possible; but that work may not yet make it into the hardware.

  2. Buffer objects are managed with a reference count. When a buffer is created, it gets a ref count of 1, and the refcount is decremented either when the object is explicitly destroyed, or the application ceases to exist. Therefore, the only way gem_exec_nop can fail during the look up portion of execbuffer, is if the application somehow dies after creating the batchbuffer, but before calling the execbuffer IOCTL.

  3. As I showed in the first diagram we consider the command executed to be "in order." Here this means that commands are executed sequentially, (and hand waving over some caching stuff) the side effects of commands are completed by execution of the later commands. This made the implicit synchronization that is baked in to the GEM API really easy to handle (the API has no ways to explicitly add synchronization objects). To put this another way, if a GPU client submits a command that operates on object X, then a second command also operating on object X; it was guaranteed to execute in that order (as long as there was no race condition in userspace submitting commands). However, when you have multiple instances of the in-order command streamers, synchronization is no longer free. If a command is submitted to command streamer1 referencing object X, and then a second command is submitted to command streamer2 also referencing object X… no guarantees are made by hardware about the order the of the commands. In this case, synchronization can be achieved in two ways: hardware based semaphores, or by stalling on the second command until that first one completes.

  4. Certain commands which may provide security risks are not allowed to be executed by untrusted entities. If the hardware parses such a command from an untrusted entity, it will convert it into an MI_NOOP. Batchbuffers can be executed in a trusted manner, but implementing such a thing is complex.

  5. When the CS is execution from the ring, HEAD == ACTHD. Once the CS jumps into the batchbuffer, ACTHD will take on the address within the batchbuffer, while HEAD will remain only relevant to it's position in the ring. We use this fact to help us debug whether we hung in the batch, or in the ring.

12 Jul 2014 5:57pm GMT

10 Jul 2014

feedplanet.freedesktop.org

Christian Schaller: Desktop Containers – The Way Forward

One feature we are spending quite a bit of effort in around the Workstation is container technologies for the desktop. This has been on the wishlist for quite some time and luckily the pieces for it are now coming together. Thanks to strong collaboration between Red Hat and Docker we have a great baseline to start from. One of the core members of the desktop engineering team, Alex Larsson, has been leading the Docker integration effort inside Red Hat and we are now preparing to build onwards on that work, using the desktop container roadmap created by Lennart Poettering.

So while Lennarts LinuxApps ideas predates Docker they do provide a great set of steps we need to turn Docker into a container solution not just for server and web applications, but also for desktop applications. And luckily a lot of the features we need for the desktop are also useful for the other usecases, for instance one of the main things Red Hat has been working with our friends at Docker on is integrating systemd with Docker.

There are a set of other components as part of this plan too. One of the big ones is Wayland, and I assume that if you are reading this you
have already seen my Wayland in Fedora updates.

Two other core technologies we identified are kdbus and overlayfs. Alex Larsson has already written an overlayfs backend for Docker, and Fedora Workstation Steering committee member, Josh Bowyer, just announced the availability of a Copr which includes experimental kernels for Fedora with overlayfs and kdbus enabled.

In parallel with this, David King has been prototyping a version of Cheese that can be run inside a container and that uses this concept that in the LinuxApps proposal is called 'Portals', which is basically dbus APIs for accessing resources outside the container, like the webcam and microphone in the case of Cheese. For those interested he will be presenting on his work at GUADEC at the end of the Month, on Monday the 28th of July. The talk is called 'Cheese: TNG (less libcheese, more D-Bus)'

So all in all the pieces are really starting to come together now and we expect to have some sessions during both GUADEC and Flock this year to try hammer out the remaining details. If you are interested in learning more or join the effort be sure to check the two conferences notice boards for time and place for the container sessions.

There is still a lot of work to do, but I am confident we have the right team assembled to do it. In addition to the people already mentioned we for instance have Allan Day who is kicking off an effort to look at the user experience we want to provide around the container hosted application bundles in terms of upgrades and installation for instance. And we will also work with the wider Docker community to make sure we have great composition tools for creating these container images available for developers on Fedora.

10 Jul 2014 8:48am GMT

04 Jul 2014

feedplanet.freedesktop.org

Lennart Poettering: FUDCON + GNOME.Asia Beijing 2014

Thanks to the funding from FUDCON I had the chance to attend and keynote at the combined FUDCON Beijing 2014 and GNOME.Asia 2014 conference in Beijing, China.

My talk was about systemd's present and future, what we achieved and where we are going. In my talk I tried to explain a bit where we are coming from, and how we changed focus from being purely an init system, to more being a set of basic building blocks to build an OS from. Most of the talk I talked about where we still intend to take systemd, which areas we believe should be covered by systemd, and of course, also the always difficult question, on where to draw the line and what clearly is outside of the focus of systemd. The slides of my talk you find online. (No video recording I am aware of, sorry.)

The combined conferences were a lot of fun, and as usual, the best discussions I had in the hallway track, discussing Linux and systemd.

A number of pictures of the conference are now online. Enjoy!

After the conference I stayed for a few more days in Beijing, doing a bit of sightseeing. What a fantastic city! The food was amazing, we tried all kinds of fantastic stuff, from Peking duck, to Bullfrog Sechuan style. Yummy. And one of those days I am sure I will find the time to actually sort my photos and put them online, too.

I am really looking forward to the next FUDCON/GNOME.Asia!

04 Jul 2014 4:43pm GMT

Christian Schaller: Transmageddon 1.2 ‘All bugs must die’

Update: I had actually managed to disable the VAAPI encoding in 1.2, so I just rolled a 1.3 release which re-enabled it. Apart from that it is identical

So I finally managed to put out a new Transmageddon release today. It is primarily a bugfix release, but considering how many critical bugs I ended up fixing for this release I am actually a bit embarassed about my earlier 1.x releases. There was for instances some stupidity in my code that triggered thread safety issues, which I know hit some of my users quite badly. But there were other things not working properly either, like dropping the video stream from a file. Anyway, I know some people think that filing bugs doesn't help, but I think I fixed every reported Transmageddon bug with this release (although not every feature request bugzilla item). So if you have issues with Transmageddon 1.2 please let me know and I will try my best to fix them. I do try to keep a policy that it is better to have limited functionality, but what is there is solid as opposed to have a lot of features that are unreliable or outright broken.

That said I couldn't help myself so there are a few new features in this release. First of all if you have the GStreamer VAAPI plugins installed (and be sure to have the driver too) then the VAAPI GPU encoder will be used for h264 and MPEG2.

Secondly I brought back the so called 'xvid' codec (even though xvid isn't really a separate codec, but a name used to refer to MPEG4 Video codec using the advanced-simple profile.).

So as screenshot blow shows, there is not a lot of UI changes since the last version, just some smaller layout and string fixes, but stability is hopefully greatly improved.
transmageddon-1.2

I am currently looking at a range of things as the next feature for Transmageddon including:

If you have any preference for which I should tackle first feel free to let me know in the comments and I will try to allow
popular will decide what I do first :)

P.S. I would love to have a high contrast icon for Transmageddon (HighContrast App icon guidelines) - So if there is any graphics artists out there willing to create one for me I would be duly greatful

04 Jul 2014 8:47am GMT

03 Jul 2014

feedplanet.freedesktop.org

Christian Schaller: Wayland in Fedora Update

As we are approaching Fedora Workstation 21 we held a meeting to review our Wayland efforts for Fedora Workstation inside Red Hat recently. Switching to a new core technology like Wayland is a major undertaking and there are always big and small surprises that comes along the way. So the summary is that while we expect to have a version of Wayland in Fedora Workstation 21 that will be able to run a fully functional desktop, there are some missing pieces we now know that will not make it. Which means that since we want to ship at least one Fedora release with a feature complete Wayland as an option before making it default, that means that Fedora Workstation 23 is the earliest Wayland can be the default.

Anyway, here is what you can expect from Wayland in Fedora 21.

We hope to have F21 testbuilds available soon that the wider community can use to help us test, because even when all the big 'checkboxes' are filled in there will of course be a host of smaller issues and outright bugs that needs ironing out before Wayland is ready to replace X completely. We really hope the community will get involved with testing Wayland so that we can iron out all major bugs before F21.

How to get involved with the Fedora Workstaton effort

To help more people get involved we recently put up a tasklist for the Fedora Workstation. It is a work in progress, but we hope that it will help more people get involved and help move the project forward.

UpdatePeter Hutterer posted this blog entry explaining pointer acceleration and what are looking at to improve it.

03 Jul 2014 2:55pm GMT