18 Jun 2013

feedplanet.freedesktop.org

Tollef Fog Heen: An otter, please (or, a better notification system)

Recently, there's been discussions on IRC and the debian-devel mailing list about how to notify users, typically from a cron script or a system daemon needing to tell the user their hard drive is about to expire. The current way is generally "send email to root" and for some bits "pop up a notification bubble, hoping the user will see it". Emailing me means I get far too many notifications. They're often not actionable (apt-get update failed two days ago) and they're not aggregated.

I think we need a system that at its core has level and edge triggers and some way of doing flap detection. Level interrupts means "tell me if a disk is full right now". Edge means "tell me if the checksums have changed, even if they now look ok". Flap detection means "tell me if the nightly apt-get update fails more often than once a week". It would be useful if it could extrapolate some notifications too, so it could tell me "your disk is going to be full in $period unless you add more space".

The system needs to be able to take in input in a variety of formats: syslog, unstructured output from cron scripts (including their exit codes), snmp, nagios notifications, sockets and fifos and so on. Based on those inputs and any correlations it can pull out of it, it should try to reason about what's happening on the system. If the conclusion there is "something is broken", it should see if it's something that it can reasonably fix by itself. If so, fix it and record it (so it can be used for notification if appropriate: I want to be told if you restart apache every two minutes). If it can't fix it, notify the admin.

It should also group similar messages so a single important message doesn't drown in a million unimportant ones. Ideally, this should be cross-host aggregation. The notifications should be possible to escalate if they're not handled within some time period.

I'm not aware of such a tool. Maybe one could be rigged together by careful application of logstash, nagios, munin/ganglia/something and sentry. If anybody knows of such a tool, let me know, or if you're working on one, also please let me know.

18 Jun 2013 8:15am GMT

14 Jun 2013

feedplanet.freedesktop.org

José Fonseca: Basic GLX_EXT_texture_from_pixmap support in ApiTrace

Apitrace now has basic support for the GLX_EXT_texture_from_pixmap extension, frequently used by Linux desktop compositors.

The difficulty with this extension was that it basically allows sharing textures between different processes, but when replaying those external processes and textures are not available. The solution was to emit fake glTexImage2D calls with the contents of the shared textures whenever they are used. (The same technique was already being for tracing OpenGL's GL_OES_EGL_image extension, and Direct3D's shared resources.)

This does yield large traces which are inadequate for profiling (something that can be eventually addressed with more effort.) But replays should be now visually faithful, therefore useful for debugging correctness/rendering issues.

14 Jun 2013 11:30pm GMT

09 Jun 2013

feedplanet.freedesktop.org

Lennart Poettering: GNOME.Asia and LinuxCon Japan

Two weeks ago I attended GNOME.Asia/Seoul and LinuxCon Japan/Tokyo, thanks to sponsoring by the GNOME Foundation and the Linux Foundation. At GNOME.Asia I spoke about Sandboxed Applications for GNOME, and at LinuxCon Japan about the first three years of systemd. (I think at least the latter one was videotaped, and recordings might show up on the net eventually). I like to believe both talks went pretty well, and helped getting the message across to community what we are working on and what the roadmap for us is, and what we expect from the various projects, and especially GNOME. However, for me personally the hallway track was the most interesting part. The personal Q&A regarding our work on kdbus, cgroups, systemd and related projects where highly interesting. In fact, at both conferences we had something like impromptu hackfests on the topics of kdbus and cgroups, with some conferences attendees. I also enjoyed the opportunity to be on Karen's upcoming GNOME podcast, recorded in a session at Gyeongbokgung Palace in Seoul (what better place could there be for a podcast recording?).

I'd like to thank the GNOME and Linux foundations for sponsoring my attendance to these conferences. I'd especially like to thank the organizers of GNOME.Asia for their perfectly organized conference!

GNOME Travel Badge

09 Jun 2013 2:30pm GMT

05 Jun 2013

feedplanet.freedesktop.org

Rob Clark: fedora 19 installer for nexus4

As promised in my previous post, now there is an f19 installer for nexus4:

https://github.com/freedreno/nexus4-fedora

So if you have an n4 and a bit of free space, you can play around with accelerated open-source gpu goodness :-)

05 Jun 2013 8:38pm GMT

04 Jun 2013

feedplanet.freedesktop.org

Keith Packard: dri3 extension

Completing the DRI3 Extension

This week marks a pretty significant milestone for the glorious DRI3000 future. The first of the two new extensions is complete and running both full Gnome and KDE desktops.

DRI3 Extension Overview

The DRI3 extension provides facilities for building direct rendering libraries to work with the X window system. DRI3 provides three basic mechanisms:

  1. Open a DRM device.

  2. Share kernel objects associated with X pixmaps. The direct rendering client may allocate kernel objects itself and ask the X server to construct a pixmap referencing them, or the client may take an existing X pixmap and discover the underlying kernel object for it.

  3. Synchronize access to the kernel objects. Within the X server, Sync Fences are used to serialize access to objects. These Sync Fences are exposed via file descriptors which the underlying driver can use to implement synchronization. The current Intel DRM driver passes a shared page containing a Linux Futex.

Opening the DRM Device

Ideally, the DRM application would be able to just open the graphics device and start drawing, sending the resulting buffers to the X server for display. There's work going on to make this possible, but the current situation has the X server in charge of 'blessing' the file descriptors used by DRM clients.

DRI2 does this by having the DRM client fetch a 'magic cookie' from the kernel and pass that to the X server. The cookie is then passed to the kernel which matches it up with the DRM client and turns on rendering access for that application.

For DRI3, things are much simpler-the DRM client asks the X server to pass back a file descriptor for the device. The X server opens the device, does the magic cookie dance all by itself (at least for now), and then passes the file descriptor back to the application.

┌───
    DRI3Open
    drawable: DRAWABLE
    driverType: DRI3DRIVER
    provider: PROVIDER
      ▶
    nfd: CARD8
    driver: STRING
    device: FD
└───
    Errors: Drawable, Value, Match

    This requests that the X server open the direct rendering
    device associated with drawable, driverType and RandR
    provider. The provider must support SourceOutput or SourceOffload.

    The direct rendering library used to implement the specified
    'driverType' is returned in 'driver'. The file
    descriptor for the device is returned in 'device'. 'nfd' will
    be set to one (this is strictly a convenience for XCB which
    otherwise would need request-specific information about how
    many file descriptors were associated with this reply).

Sharing Kernel Pixel Buffers

An explicit non-goal of DRI3 is support for sharing buffers that don't map directly to regular X pixmaps. So, GL ancillary buffers like depth and stencil just don't apply here.

The shared buffers in DRI3 are regular X pixmaps in the X server. This provides a few obvious benefits over the DRI2 scheme: In the kernel, the buffers are referenced by DMA-BUF handles, which provides a nice driver-independent mechanism.

  1. Lifetimes are easily managed. Without being associated with a separate drawable, it's easy to know when to free the Pixmap.

  2. Regular X requests apply directly. For instance, copying between buffers can use the core CopyArea request.

To create back- and fake-front- buffers for Windows, the application creates a kernel buffer, associates a DMA-BUF file descriptor with that and then sends the fd to the X server with a pixmap ID to create the associated pixmap. Doing it in this direction avoids a round trip.

┌───
    DRI3PixmapFromBuffer
    pixmap: PIXMAP
    drawable: DRAWABLE
    size: CARD32
    width, height, stride: CARD16
    depth, bpp: CARD8
    buffer: FD
└───
    Errors: Alloc, Drawable, IDChoice, Value, Match

    Creates a pixmap for the direct rendering object associated
    with 'buffer'. Changes to pixmap will be visible in that
    direct rendered object and changes to the direct rendered
    object will be visible in the pixmap.

    'size' specifies the total size of the buffer bytes. 'width',
    'height' describe the geometry (in pixels) of the underlying
    buffer. 'stride' specifies the number of bytes per scanline in
    the buffer. The pixels within the buffer may not be arranged
    in a simple linear fashion, but 'size' will be at least
    'height' * 'stride'.

    Precisely how any additional information about the buffer is
    shared is outside the scope of this extension.

    If buffer cannot be used with the screen associated with
    drawable, a Match error is returned.

    If depth or bpp are not supported by the screen, a Value error
    is returned.

To provide for texture-from-pixmap, the application takes the pixmap ID and passes that to the X server which returns the a file descriptor for a DMA-BUF which is associated with the underlying kernel buffer.

┌───
    DRI3BufferFromPixmap
    pixmap: PIXMAP
      ▶
    depth: CARD8
    size: CARD32
    width, height, stride: CARD16
    depth, bpp: CARD8
    buffer: FD
└───
    Errors: Pixmap, Match

    Pass back a direct rendering object associated with
    pixmap. Changes to pixmap will be visible in that
    direct rendered object and changes to the direct rendered
    object will be visible in the pixmap.

    'size' specifies the total size of the buffer bytes. 'width',
    'height' describe the geometry (in pixels) of the underlying
    buffer. 'stride' specifies the number of bytes per scanline in
    the buffer. The pixels within the buffer may not be arranged
    in a simple linear fashion, but 'size' will be at least
    'height' * 'stride'.

    Precisely how any additional information about the buffer is
    shared is outside the scope of this extension.

    If buffer cannot be used with the screen associated with
    drawable, a Match error is returned.

Tracking Window Size Changes

When Eric Anholt and I first started discussing DRI3, we hoped to avoid needing to learn about the window size from the X server. The thought was that the union of all of the viewports specified by the application would form the bounds of the drawing area. When the window size changed, we expected the application would change the viewport.

Alas, this simple plan isn't sufficient here-a few GL functions are not limited to the viewport. So, we need to track the actual window size and monitor changes to it.

DRI2 does this by delivering 'invalidate' events to the application whenever the current buffer isn't valid; the application discovers that this event has been delivered and goes to as the X server for the new buffers. There are a couple of problems with this approach:

  1. Any outstanding DRM rendering requests will still draw to the old buffers.

  2. The Invalidate events must be captured before the application sees the related ConfigureNotify event so that the GL library can react appropriately.

The first problem is pretty intractable within DRI2-the application has no way of knowing whether a frame that it has drawn was delivered to the correct buffer as the underlying buffer object can change at any time. DRI3 fixes this by having the application in control of buffer management; it can easily copy data from the previous back buffer to the new back buffer synchronized to its own direct rendering.

The second problem was solved in DRI2 by using the existing Xlib event hooks; the GL library directly implements the Xlib side of the DRI2 extension and captures the InvalidateBuffers events within that code, delivering those to the driver code. The problem with this solution is that Xlib holds the Display structure mutex across this whole mess, and Mesa must be very careful not to make any Xlib calls during the invalidate call.

For DRI3, I considered placing the geometry data in a shared memory buffer, but my future plans for the "Present" extension led me to want an X event instead (more about the "Present" extension in a future posting).

An X ConfigureNotify event is sufficient for the current requirements to track window sizes accurately. However, there's no easy way for the GL library to ensure that ConfigureNotify events will be delivered to the application-other application code may (and probably will) adjust the window event mask for its own uses. I considered adding the necessary event mask tracking code within XCB, but again, knowing that the "Present" extension would probably need additional information anyhow, decided to create a new event instead.

Using an event requires that XCB provide some mechanism to capture those events, keep them from the regular X event stream, and deliver them to the GL library. A further requirement is that the GL library be absolutely assured of receiving notification about these events before the regular event processing within the application will see a core ConfigureNotify event.

The method I came up with for XCB is fairly specific to my requirements. The events are always 'XGE' events, and are tagged with a special 'event context ID', an XID allocated for this purpose. The combination of the extension op-code, the event type and this event context ID are used to split off these events to custom event queues using the following APIs:

/**
 * @brief Listen for a special event
 */
xcb_special_event_t *xcb_register_for_special_event(xcb_connection_t *c,
                                                    uint8_t extension,
                                                    uint16_t evtype,
                                                    uint32_t eid,
                                                    uint32_t *stamp);

This creates a special event queue which will contain only events matching the specified extension/type/event-id triplet.

/**
 * @brief Returns the next event from a special queue
 */
xcb_generic_event_t *xcb_check_for_special_event(xcb_connection_t *c,
                                                 xcb_special_event_t *se);

This pulls an event from a special event queue. These events will not appear in the regular X event queue and so applications will never see them.

There's one more piece of magic here-the 'stamp' value passed to xcbregisterforspecialevent. This pointer refers to a location in memory which will be incremented every time an event is placed in the special event queue. The application can cheaply monitor this memory location for changes and known when to check the queue for events.

Within GL, the value used is the existing dri2 'stamp' value. That is checked at the top of the rendering operation; if it has changed, the drawing buffers will be re-acquired. Part of the buffer acquisition process is a check for special events related to the window.

For now, I've placed these events in the DRI3 extension. However, they will move to the Present extension once that is working.

┌───
    DRI3SelectInput
    eventContext: DRI3EVENTID
    window: WINDOW
    eventMask: SETofDRI3EVENT
└───
    Errors: Window, Value, Match, IDchoice

    Selects the set of DRI3 events to be delivered for the
    specified window and event context. DRI3SelectInput can
    create, modify or delete event contexts. An event context is
    associated with a specific window; using an existing event
    context with a different window generates a Match error.

    If eventContext specifies an existing event context, then if
    eventMask is empty, DRI3SelectInput deletes the specified
    context, otherwise the specified event context is changed to
    select a different set of events.

    If eventContext is an unused XID, then if eventMask is empty
    no operation is performed. Otherwise, a new event context is
    created selecting the specified events.

The events themselves look a lot like a configure notify event:

┌───
    DRI3ConfigureNotify
    type: CARD8         XGE event type (35)
    extension: CARD8        DRI3 extension request number
    length: CARD16          2
    evtype: CARD16          DRI3_ConfigureNotify
    eventID: DRI3EVENTID
    window: WINDOW
    x: INT16
    y: INT16
    width: CARD16
    height: CARD16
    off_x: INT16
    off_y: INT16
    pixmap_width: CARD16
    pixmap_height: CARD16
    pixmap_flags: CARD32
└───

    'x' and 'y' are the parent-relative location of 'window'.

Note that there are a couple of odd additional fields-offx, offy, pixmapwidth, pixmapheight and pixmap_flags are all place-holders for what I expect to end up in the Present extension. For now, in DRI3, they should be ignored.

Synchronization

The DRM application needs to know when various X requests related to its buffers have finished. In particular, when performing a buffer swap, the client wants to know when that completes, and be able to block until it has. DRI2 does this by having the application make a synchronous request from the X server to get the names of the new back buffer for drawing the next frame. This has two problems:

  1. The synchronous round trip to the X server isn't free. Other running applications may cause fairly arbitrary delays in getting the reply back from the X server.

  2. Synchronizing with the X server doesn't ensure that GPU operations are necessarily serialized between the application and the X server.

What we want is a serialization guarantee between the X server and the DRM application that operates at the GPU level.

I've written a couple of times (dri3k first steps and Shared Memory Fences) about using X Sync extension Fences (created by James Jones and Aaron Plattner) for this synchronization and wanted to get a bit more specific here.

With the X server, a Sync extension Fence is essentially driver-specific, allowing the hardware design to control how the actual synchronization is performed. DRI3 creates a way to share the underlying operating system object by passing a file descriptor from application to the X server which somehow references that device object. Both sides of the protocol need to tacitly agree on what it means.

┌───
    DRI3FenceFromFD
    drawable: DRAWABLE
    fence: FENCE
    initially-triggered: BOOL
    fd: FD
└───
    Errors: IDchoice, Drawable

    Creates a Sync extension Fence that provides the regular Sync
    extension semantics along with a file descriptor that provides
    a device-specific mechanism to manipulate the fence directly.
    Details about the mechanism used with this file descriptor are
    outside the scope of the DRI3 extension.

For the current GEM kernel interface, because all GPU access is serialized at the kernel API, it's sufficient to serialize access to the kernel itself to ensure operations are serialized on the GPU. So, for GEM, I'm using a shared memory futex for the DRI3 synchronization primitive. That does not mean that all GPUs will share this same mechanism. Eliminate the kernel serialization guarantee and some more GPU-centric design will be required.

What about Swap Buffers?

None of the above stuff actually gets bits onto the screen. For now, the GL implementation is simply taking the X pixmap and copying it to the window at SwapBuffers time. This is sufficient to run applications, but doesn't provide for all of the fancy swap options, like limiting to frame rate or optimizing full-screen swaps.

I've decided to relegate all of that functionality to the as-yet-unspecified 'Present' extension.

Because the whole goal of DRI3 was to get direct rendered application contents into X pixmaps, the 'Present' extension will operate on those X objects directly. This means it will also be usable with non-DRM applications that use simple X pixmap based double buffering, a class which includes most existing non-GL based Gtk+ and Qt applications.

So, I get to reduce the size of the DRI3 extension while providing additional functionality for non direct-rendered applications.

Current Status

As I said above, all of the above functionality is running on my systems and has booted both complete KDE and Gnome sessions. There have been some recent DMA-BUF related fixes in the kernel, so you'll need to run the latest 3.9.x stable release or a 3.10 release candidate.

Here's references to all of the appropriate git repositories:

DRI3 protocol and spec:

git://people.freedesktop.org/~keithp/dri3proto      master

XCB protocol

git://people.freedesktop.org/~keithp/xcb/proto.git  dri3

XCB library

git://people.freedesktop.org/~keithp/xcb/libxcb.git dri3

xshmfence library:

git://people.freedesktop.org/~keithp/libxshmfence.git   master

X server:

git://people.freedesktop.org/~keithp/xserver.git    dri3

Mesa:

git://people.freedesktop.org/~keithp/mesa.git       dri3

Next Steps

Now it's time to go write the Present extension and get that working. I'll start coding and should have another posting here next week.

04 Jun 2013 10:28pm GMT

Luc Verhaegen: Old and new limare code, and management overhead...

I just pushed updated limare code and a fix to ioquake3.

In almost 160 patches, loads of things change:

Some of this code was already published to allow the immediate use of the OGT enabled ioquake3. But that branch is now going to be removed, as the new code replaces it fully.

As for performance, this is no better or worse than the FOSDEM code. 47fps on timedemo on the Allwinner A10 at 1024x600. But now on the Exynos 4, there are some new numbers... With the CPU clocked to 2GHz and the Mali clocked to 800MHz (!!!) we hit 145fps at 720p and 127fps at 1080p. But more on that a bit further in this post.

Upcoming: Userspace memory management.


Shortly after FOSDEM, i blogged about the 2% performance advantage over the binary driver when running Q3A.

As you might remember, we are using ARMs kernel driver, and despite all the pain that this is causing us due to shifting IOCTL numbers (whoever at ARM decided that IOCTL numbers should be defined as enums should be laid off immediately) I still think that this is a useful strategy. This allows us to immediately throw in the binary driver, and immediately compare Lima to the binary, and either help hard reverse engineering, or just make performance comparisons. Rewriting this kernel driver, or turning this into a fully fledged DRM driver is currently more than just a waste of time, it is actually counterproductive right now.

But now, while bringing up a basic mesa driver, it became clear that I needed to work on some form of memory management. Usually, you have the DRM driver handling all of that, (even for small allocations i think - not that i have checked). We do not have a DRM driver, and I do not intend to write one in the very near future either, and all I have is the big block mapping that the mali kernel driver offers (which is not bad in itself).

So in the train on the way back from linuxtag this year, I wrote up a small binary allocator to divide up the 2GB of address space that the Mali MMU gives us. On top of that, I now have 2 types of memory, sequential and persistent (next to UMP and external, for mapping the destination buffer into Mali memory), and limare can now allocate and map blocks of either at will.

The sequential memory is meant for per-frame data, holding things like draws and varyings and such, stuff that gets thrown away after the frame has been rendered. This simply tracks the amount of memory used, adds the newly requested memory at the end, and returns an address and a pointer. No tracking whatsoever. Very lightweight.

The persistent memory is the standard linked list type, with the overhead that that incurs. But this is ok, as this memory is meant for shaders, textures and attribute and element buffers. You do not create these _every_ draw, and you tend to reuse them, so it's acceptable if their management is a bit less optimized.

Normally, more management makes things worse, but this memory tracking allowed me to sanitize away some frame specific state tracking. Suddenly, Q3A at 720p which originally ran at 145fps on the exynos, ran at 176fps. A full 21% faster. Quite some difference.

I now have a board with a Samsung Exynos 4412 prime. This device has the quad A9s clocked at 1.7GHz, 2GB LP-DDR2 memory at 880MHz, and a quad PP Mali-400MP4 at 440MHz. This is quite the powerhouse compared to the 1GHz single A8 and single PP Mali-400 at 320MHz. Then, this Exynos chip I got actually clocks the A9s to 2GHz and the mali to a whopping 800MHz (81% faster than the base clock). Simply insane.

The trouble with the exynos device, though, is that there are only X11 binaries. This involves a copy of the rendered buffer to the framebuffer which totally kills performance. I cannot properly compare these X11 binaries with my limare code. So I did take my new memory management code to the A10 again, and at 1024x600 it ran the timedemo at 49.5fps. About a 6% margin over the binary framebuffer driver, or tripling my 2% lead at FOSDEM. Not too bad for increased management, right?

Anyway, with the overclocking headroom of the exynos, it was time for a proper round of benchmarking with limare on exynos.

Benchmark, with a pretty picture!


Limare Q3A benchmark results on exynos4412

The above picture, which I quickly threw together manually, maps it out nicely.

Remember, this is an Exynos 4412 prime, with 4 A9s clocked from 1.7-2.0GHz, 2GB LP-DDR2 at 880MHz, and a Mali-400MP4 which clocks from 440MHz to an insane 800MHz. The test is the quake 3 arena timedemo, running on top of limare. Quake 3 Arena is single threaded, so apart from the limare job handling, the other 3 A9 cores simply sit idle. It's sadly the only good test I have, if someone wants to finish the work to port Doom3 to gles, I am sure that many people will really appreciate it.

At 720p, we are fully CPU limited. At some points in the timedemo (as not all scenes put the same load on cpu and/or gpu), the difference in mali clock makes us slightly faster if the cpu can keep up, but this levels out slightly above 533MHz. Everything else is simply scaling linearly with the cpu clock. Every change in cpu clock is a 80% change in framerate. We end up hitting 176.4fps.

At 1080p, it is a different story. 1080p is 2.25 times the amount of screen real estate of 720p (if that number rings a bell, 2.25MB equals two banks of Tseng ET6x00 MDRAM :p). 2.25 times the amount of pixels that need to pushed out. Here clearly the CPU is not the limiting factor. Scaling linearly from the original 91fps at 440MHz is a bit pointless, as the Q3A benchmark is not always stressing CPU and GPU equally over the whole run. I've drawn the continuation of the 440-533MHz increase, and that would lead to 150fps, but instead we run into 135.1fps. I think that we might be stressing the memory subsystem too much. At 135fps, we are pushing over 1GBps out to the framebuffer, this while the display is refreshing at 60fps, so reading in half a gigabyte. And all of this before doing a single texture lookup (of which we have loads).

It is interesting to see the CPU become measurably relevant towards 800MHz. There must be a few frames where the GPU load is such that the faster CPU is making a distinguishable difference. Maybe there is more going on than just memory overload... Maybe in future i will get bored enough to properly implement the mali profiling support of the kernel, so that we can get some actual GP and PP usage information, and not just the time we spent waiting for the kernel job to return.

ARM Management and the Lima driver


I have recently learned, from a very reliable source, that ARM management seriously dislikes the Lima driver project.

To put it nicely, they see no advantage in an open source driver for the Mali, and believe that the Lima driver is already revealing way too much of the internals of the Mali hardware. Plus, their stance is that if they really wanted an open source driver, they could simply open up their own codebase, and be done.

Really?

We can debate endlessly about not seeing an advantage to an open source driver for the Mali. In the end ARMs direct customers will decide on that one. I believe that there is already 'a slight bit of' traction for the general concept of open source software, I actually think that a large part of ARMs high margin products depend on that concept right now, and this situation is not going to get any better with ARMv8. Silicon vendors and device makers are also becoming more and more aware of the pain of having to deal with badly integrated code and binary blobs. As Lima becomes more complete, ARMs customers will more and more demand support for the Lima driver from ARM, and ARM gets to repeat that mantra: "We simply do not see the advantage"...

About revealing the internals of the Mali, why would this be an issue? Or, let me rephrase that, what is ARM afraid of?

If they are afraid of IP issues, then the damage was done the second the Mali was poured into silicon and sold. Then the simple fact that ARM is that apprehensive should get IP trolls' mouths watering. Hey IP Trolls! ARM management believes that there are IP issues with the Mali! Here is the rainbow! Start searching for your pot of gold now!

Maybe they are afraid that what is being revealed by the Lima driver is going to help the competition. If that is the case, then it shows that ARM today has very little confidence in the strength of their Mali product or in their own market position. And even if Nvidia or Qualcomm could learn something today, they will only be able to make use of that two years or even further down the line. How exactly is that going to hurt the Mali in the market it is in, where 2 years is an eternity?

If ARM really believes in their Mali product, both in the Mali's competitivity and in the originality of its implementation, then they have no tangible reason to be afraid of revealing anything about its internals.

Then there is the view that ARM could just open source their own driver. Perhaps they could, it really could be that they have had very strict agreements with their partners, and that ARM is free to do what they want with the current Mali codebases. I personally think it is rather unlikely that everything is as watertight as ARM management imagines. And even then, given that they are afraid of IP issues... How certain are ARMs lawyers that nothing contentious slipped into the code over the years? How long will it take ARMs legal department to fully review this code and assess that risk?

The only really feasible solution tends to be a freshly written driver, with a full development history available publically. And if ARM wants to occupy their legal department, then they could try to match intel (AMD started so well, but ATI threw in the towel so quickly, but luckily the AMD GPGPU guys continued part of it), and provide the Technical Reference Manual and other documents to the Mali. That would be much more productive, especially as that will already be more legal overhead than ARM management would be willing to spare, when they do finally end up seeing the light.

So. ARM management hates us. But guess what. Apart from telling us to change our name (there was apparently the "fear" of a trademark issue with us using Remali, so we ended up calling it Lima instead), there was nothing that they could do to stop us a year and a half ago. And there is even less that ARM can do to stop us today :)

A full 6.0%...

04 Jun 2013 1:17am GMT

03 Jun 2013

feedplanet.freedesktop.org

Aleksander Morgado: Changing modes and capabilities in ModemManager

I don't really recall when my first interaction with ModemManager was, I just remember saying "wait, this modem just works?". But I do remember one day when I spent a couple of hours trying to understand why my modem wouldn't switch to 2G-only mode even if I explicitly selected it in the network-manager-applet. Truth be told, I didn't dig much in the issue that day; I barely knew what ModemManager was, or how it would interface with NetworkManager, or how to really debug it. And here I am possibly 4 years after that day, trying to get that same issue fixed.

allowed-and-preferred-modes

ModemManager has always allowed to specify which 'network type' to use; or, rather than network type, 'allowed and preferred' modes as we name them. It basically is a way to tell your modem that you want to use one technology preferred over another (e.g. allow 2G and 3G, but prefer 3G), or even tell the modem to use only one technology type (e.g. 3G only). Even modern phones allow you to turn off 3G support and only use 2G, in order to save battery. The main problem with ModemManager's way of handling this issue was that there was a predefined set of combinations to select (as exposed by the applet), and that not all combinations are supported by all modems. Even worse, ModemManager may not know how to use them, or the modem itself may not support mode switching at all (which was actually what was happening with my modem 4 years ago). Therefore, the UI would just try to show all the options and hope for the best when launching the mobile broadband connection. And there it comes the next issue; allowing to select mode preferences during the connection setup just makes the modem restart the whole radio stack, and the connection attempt may end up timing out, as the whole network registration process needs to be done from scratch before connecting…

Allowed and Preferred modes

In the new ModemManager interfaces, each Modem object will expose a "SupportedModes" property listing all the mode combinations (allowed + preferred) the modem actually supports. Graphical user interfaces will therefore be able to provide mode switching options listing only those combinations that will work. If a modem doesn't support mode switching, no such list should be provided, and the user will not get confused. At any time, the Modem object will also expose a "CurrentModes" property, showing which is the currently selected combination of allowed and preferred modes. And the "SetCurrentModes()" method will allow to switch current modes, accepting as input only combinations which are given in "SupportedModes" (or the special allowed=ANY and preferred=NONE).

Also, changing current modes directly when calling Simple.Connect() will no longer be possible. This means that NetworkManager will never request allowed mode switching during a connection attempt (and hence no radio stack reloading in the modem causing timeouts). The logical place to put allowed mode switching is therefore a system configuration application like the GNOME Control Center or similar, which should allow mode switching at any time, not only just during a connection attempt. A good side effect of this change is that the NetworkManager connection settings now contain only connection-related configuration, which in the case of 3GPP devices can be linked to the SIM in use, leaving out all modem-specific configuration.

Capabilities

There was a time when modems were either 3GPP (GSM/GPRS/UMTS/HSPA…) or 3GPP2 (CDMA/EV-DO…). Nowadays, modems with multiple capabilities are pretty common, specially since LTE is around (LTE, even if 3GPP, is also 3GPP2′s blessed 4G technology, instead of the superhero named one which is forgotten by everyone already). ModemManager will now allow to change capabilities in addition to allowed and preferred modes; so a user with a modem which can work both in 3GPP and 3GPP2 networks will be able to switch from one to the other directly from the user interface. Of course, if the modem supports this (currently only QMI-based modems).

The new "SupportedCapabilities" property will expose all capability combinations supported by the modem, while "CurrentCapabilities" will expose which are the current ones being used at any given time. For example, a modem with "gsm-umts", "cdma-evdo" and "lte" capabilities may support configuring only "cdma-evdo", or "gsm-umts"+"lte". Changing current capabilities is now possible through the "SetCurrentCapabilities()" method, which has a logic very similar to that of the "SetCurrentModes()" method. If a modem supports multiple capability combinations as exposed by "SupportedCapabilities", this method will allow changing between them. The main difference with mode changing is that we will force a device power-cycle when this change is done, so the modem will disappear and reappear again with the new capabilities.

Capabilities and allowed/preferred modes have a lot in common, so much that there is a single interface in QMI based modems to change them. Therefore, when a modem allows changing capabilities, the list of allowed/preferred mode combinations may (and very likely will) be different depending on the current capabilities in the modem. For example, LTE-enabled QMI-powered modems will not be able to switch allowed/preferred modes when they have "lte" among the current capabilities, but they will be able if the capabilities are changed to only "gsm-umts". This is not a big deal, as mode preference (e.g. 3G preferred) is not applicable when the modem does LTE (there is no way of saying allow 2G, 3G and 4G but prefer 3G).

Bands

ModemManager also allows to specify which frequency bands to use in the modem, but unlike with modes and capabilities, the "SupportedBands" property is not a list of all possible band combinations supported. Instead, it's just a bitmask with all supported bands, without specifying whether an actual combination is going to work in "SetCurrentBands()" or not. Listing combinations instead of just the bitmask would be truly too much… But anyway, changing frequency bands is not a feature that a normal user should play with, so just don't do it. I actually bricked a Pantech UML290 myself playing with this…

aleksander/api-breaks

All these updates, plus some other ones, are available in the 'aleksander/api-breaks' branch in the ModemManager git repository, which should hit git master very soon, likely this week. These ones should be the last API breaks done before releasing the new ModemManager, and will be kept stable after that.


Filed under: Development, FreeDesktop Planet, GNOME Planet, Lanedo Planet, Planets Tagged: gnome, ModemManager, NetworkManager, QMI

03 Jun 2013 4:12pm GMT

02 Jun 2013

feedplanet.freedesktop.org

Rob Clark: freedreno + gnome-shell on nexus4/a320

We don't need no binary blobz :-)



gallium/freedreno + xf86-video-freedreno using XA gallium state tracker on fedora F18. Gnome-shell, compiz, xonontic, ioquake all work. Just need to clean up the patches for XA and freedreno a bit more before they are ready for upstream. And hopefully in the next couple days I'll have some time to put together a sort of make-shift installer for anyone else who wants to try.

02 Jun 2013 11:42pm GMT

31 May 2013

feedplanet.freedesktop.org

Julien Danjou: OpenStack Ceilometer Havana-1 milestone released

Yesterday, the first milestone of the Havana developement branch of Ceilometer has been released and is now available for testing and download. This means the first quarter of the OpenStack Havana development has passed!

New features

Ten blueprints have been implemented as you can see on the release page. I'm going to talk through some of them here, that are the most interesting for users.

Ceilometer can now counts the scheduling attempt of instances done by nova-scheduler. This can be useful to eventually bill such information or for audit (implemented by me for eNovance).

People using the HBase backend can now do requests filtering on any of the counter fields, something we call metadata queries, and which was missing for this backend driver. Thanks to Shengjie Min (Dell) for the implementation.

Counters can now be sent over UDP instead of the Oslo RPC mechanism (AMQP based by default). This allows counter transmission to be done in a much faster way, though less reliable. The primary use case being not audit or billing, but the alarming features that we are working on (implemented by me for eNovance).

The initial alarm API has been designed and implemented, thanks to Mehdi Abaakouk (eNovance) and Angus Salkled (RedHat) who tackled this. We're now able to do CRUD actions on these.

Posting of meters via the HTTP API is now possible. This is now another conduct that can be used to publish and collector meter. Thanks to Angus Salkled (RedHat) for implementing this.

I've been working on an somewhat experimental notifier driver for Oslo notification that publishes Ceilometer counters instead of the standard notification, using the Ceilometer pipeline setup.

Sandy Walsh (Rackspace) has put in place the base needed to store raw notifications (events), with the final goal of bringing more functionnalities around these into Ceilometer.

Obviously, all of this blueprint and bugfixes wouldn't be implemented or fixed without the harden eyes of our entire team, reviewing code and advising restlessly the developers. Thanks to them!

Bug fixes

Thirty-one bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read the list if you are curious.

Toward Havana 2

We now have 21 blueprints targetting the Ceilometer's second Havana milestone, with some of them are already started. I'll try to make sure we'll get there without too much trouble for the 18th July 2013. Stay tuned!

31 May 2013 11:15am GMT

26 May 2013

feedplanet.freedesktop.org

Daniel Vetter: i915/GEM Crashcourse: Overview

Now that the entire series is done I've figured a small overview would be in order.

Part 1 talks about the different address spaces that a i915 GEM buffer object can reside in and where and how the respective page tables are set up. Then it also covers different buffer layouts as far as they're a concern for the kernel, namely how tiling, swizzling and fencing works.

Part 2 covers all the different bits and pieces required to submit work to the gpu and keep track of the gpu's progress: Command submission, relocation handling, command retiring and synchronization are the topics.

Part 3 looks at some of the details of the memory management implement in the i915.ko driver. Specifically we look at how we handle running out of GTT space and what happens when we're generally short on memory.

Finally part 4 discusses coherency and caches and how to most efficiently transfer between the gpu coherency domains and the cpu coherncy domain under different circumstances.

Happy reading!
Update: There's now also a new article with a few questions and answers about some details in the i915 gem code.

26 May 2013 2:42pm GMT

Daniel Vetter: i915/GEM Q&A

So apparently people do indeed read my my i915/GEM crashcourse and a bunch of follow-up questions popped up in private mails. Since I'm a lazy bastard I've clean some of the common questions&answers up to be able to easily point at them. And hopefully they also help someone else to clarify things a bit.

Question: What's the significance of i915_gem_sw_finish_ioctl ? It seems to flush cpu caches, but only conditional on obj->pin_count != 0. Why does it no unconditionally flush the cpu caches like e.g. when we move an unsnooped/not LLC-cached object into a gpu domain?

Answer: i915_gem_sw_finish_ioctlis only used to flush out cpu rendering to the display (and in current userspace it's not used at all). obj->pin_count != 0 is used as a proxy for "this a scanout buffer". Obviously more intelligent userspace should know whether it is doing cpu rendering to a displayed buffer or not and force the expensive clflushing with e.g. the set_domain ioctl only when really required. But the sw_finish ioctl is called from the libdrm cpu mmap unmap function, which does not have this knowledge at hand, hence the check in the kernel. Furthermore for efficient integration of cpu rendering into the gpu render pipeline we want to use snoopable objects even on non-LLC platforms which means that this ioctl shouldn't really be used any more for new code.

Question: So the cpu can only access a GEM object through the GTT when it's in the mappable part of the GTT, i.e. when gtt_offset + size <= gtt_mappable_end. But the i915_gem_object_set_to_gtt_domain function does not check that whether this condition is satisfied or not and simply goes ahead with the domain change. Why is that done so, even though the cpu won't be able to access the buffer object at its current place?

Answer:The GTT domain is purely about coherency, i.e. a buffer object is in the GTT domain if reads/writes through the GTT would see the correct values. The other big domain is cpu domain, i.e. the data (when accessed directly in the physical memory location, not going through the GTT) is coherent with cpu caches. Shifting between these two domains requires flushing/invalidating cpu caches.

Note that on recent kernels that doesn't even mean that there's a global GTT mapping allocated for that buffer object: This is used to optimize away the redundant cache flushing when moving an object around, e.g. when moving it into the mappable range to serve a cpu access page fault. In the future this will be even more common once we have proper per-process GTT address spaces. Then an object could be fully coherent with the GTT domain, read by the gpu through a PPGTT mapping, but don't have an offset allocated for it in the global GTT at all.

The mappable GTT address range on the other hand is a different concept and simply means the object has a GTT mapping visible to the cpu (on gpus without PPGTT the global GTT can be up to 2g, but only 256m are usually visible in the pci bar). Note that GEM object can be mappable but can be (at the same time) in the cpu domain. This happens when userspace writes to the buffer object through the cpu mappings.

Question: How does the the i915_gem_fault function handle a page fault when it itself is invoked through a page fault in the i915 GEM kernel code? Like suppose if fault_in_pages_readable function is called which dereferences a user pointer - won't that cause issues with deadlocks?

Answer:Yes, this can happen and we need to be careful that we cannot possible deadlock with our own pagefault handlers. And it's not just theoretical, it happens in the wild when a GL client tries to use a pointer obtained from one of the texture mapping funtions (which can use a GTT memory mapping internally) to upload data (which could use the pwrite GEM ioctl).

These potential deadlocks are resolved by instructing the linux memory subsystem to not serve pagefaults when accessing userspace memory but instead fail it. Then our code can release any resources and locks required by our own page fault handler and retry the operation in a slowpath. Often this requires that we copy the data into a (unfaultable) temporary buffer in kernel's memory space. These atomic sections are often implicit, but we have a few places where we need to explicitly disable page fault handler with pagefault_disable/enable() calls.

Question: Is obj->fenced_gpu_access ever set on modern platforms - it seems not? Or could this cause a stall waiting for the gpu when all fences are in use and we need a few fence to handle a GTT page fault?

Answer: No, this is only ever set on Gen2/3 devices. Those gpus use the same GTT fences used on all platforms for detiling cpu access also for gpu access, at least for some gpu rendering functions. So this is irrelevant on modern platforms and can't lead to a stall in the pagefault handler when accessing an otherwise idle buffer object.

Question: What is this wedeged stuff - there's lots of references to it in the i915 GEM code?

Answer: This is part of the gpu hang detection and reset handling code, which I didn't really cover in my crashcourse. It is set when we've detected a hang but failed to reset the gpu. It will cause all subsequent command submission from userspace to fail with -EIO, which is used by userspace as a signal to fall back to software rendering. The i915 hang detection and reset code has been (and still is) under pretty active development and is nowadays a rather complex piece of code. I plan to cover it more in-depth hopefully soon.

Question: In the use_cpu_reloc function, why is the obj->cache_level != I915_CACHE_NONE condition used?

Answer: That's just crazy optimization - it's always faster to write relocations through cpu maps if LLC caching is enabled. But without caching it's faster to use global GTT access - but then only if we have the mappable mapping already set up. Note that pwrite ioctl code has similar tricks.

26 May 2013 2:40pm GMT

25 May 2013

feedplanet.freedesktop.org

Søren Sandmann Pedersen: The Radix Heap

The Radix Heap is a priority queue that has better caching behavior than the well-known binary heap, but also two restrictions: (a) that all the keys in the heap are integers and (b) that you can never insert a new item that is smaller than all the other items currently in the heap.

These restrictions are not that severe. The Radix Heap still works in many algorithms that use heaps as a subroutine: Dijkstra's shortest-path algorithm, Prim's minimum spanning tree algorithm, various sweepline algorithms in computational geometry.

Here is how it works. If we assume that the keys are 32 bit integers, the radix heap will have 33 buckets, each one containing a list of items. We also maintain one global value last_deleted, which is initially MIN_INT and otherwise contains the last value extracted from the queue.

The invariant is this:

The items in bucket $k$ differ from last_deleted in bit $k - 1$, but not in bit $k$ or higher. The items in bucket 0 are equal to last_deleted.

For example, if we compare an item from bucket 10 to last_deleted, we will find that bits 31-10 are equal, bit 9 is different, and bits 8-0 may or may not be different.

Here is an example of a radix heap where the last extracted value was 7:

As an example, consider the item 13 in bucket 4. The bit pattern of 7 is 0111 and the bit pattern of 13 is 1101, so the highest bit that is different is bit number 3. Therefore the item 13 belongs in bucket $3 + 1 = 4$. Buckets 1, 2, and 3 are empty, but that's because a number that differs from 7 in bits 0, 1, or 2 would be smaller than 7 and so isn't allowed in the heap according to restriction (b).

Operations
When a new item is inserted, it has to be added to the correct bucket. How can we compute the bucket number? We have to find the highest bit where the new item differs from last_deleted. This is easily done by XORing them together and then finding the highest bit in the result. Adding one then gives the bucket number:

bucket_no = highest_bit (new_element XOR last_deleted) + 1

where highest_bit(x) is a function that returns the highest set bit of x, or $-1$ if x is 0.

Inserting the item clearly preserves the invariant because the new item will be in the correct bucket, and last_deleted didn't change, so all the existing items are still in the right place.

Extracting the minimum involves first finding the minimal item by walking the lowest-numbered non-empty bucket and finding the minimal item in that bucket. Then that item is deleted and last_deleted is updated. Then the bucket is walked again and all the items are redistributed into new buckets according to the new last_deleted item.

The extracted item will be the minimal one in the data structure because we picked the minimal item in the redistributed bucket, and all the buckets with lower numbers are empty. And if there were a smaller item in one of the buckets with higher numbers, it would be differing from last_deleted in one of the more significant bits, say bit $k$. But since the items in the redistributed bucket are equal to last_deleted in bit $k$, the hypothetical smaller item would then have to also be smaller than last_deleted, which it can't be because of restriction (b) mentioned in the introduction. Note that this argument also works for two-complement signed integers.

We have to be sure this doesn't violate the invariant. First note that all the items that are being redistributed will satisfy the invariant because they are simply being inserted. The items in a bucket with a higher number $k$ were all different from the old last_deleted in the $(k-1)$th bit. This bit must then necessarily also be different from the $(k-1)$th bit in the new last_deleted, because if it weren't, the new last_deleted would itself have belonged in bucket $k$. And finally, since the bucket being redistributed is the lowest-numbered non-empty one, there can't be any items in a bucket with a lower number. So the invariant still holds.

In the example above, if we extract the two '7's from bucket 0 and the '8' from bucket 4, the new heap will look like this:

Notice that bucket 4, where the '8' came from, is now empty.

Performance
Inserting into the radix heap takes constant time because all we have to do is add the new item to a list. Determining the highest set bit can be done in constant time with an instruction such as bsr.

The performance of extraction is dominated by the redistribution of items. When a bucket is redistributed, it ends up being empty. To see why, remember that all the items are different from last_deleted in the $(k - 1)$th bit. Because the new last_deleted comes from bucket $k$, the items are now all equal to last_deleted in the $(k - 1)th$ bit. Hence they will all be redistributed to a lower-numbered bucket.

Now consider the life-cycle of a single element. In the worst case it starts out being added to bucket 31 and every time it is redistributed, it moves to a lower-numbered bucket. When it reaches bucket 0, it will be next in line for extraction. It follows that the maximum number of redistributions that an element can experience is 31.

Since a redistribution takes constant time per element distributed, and since an element will only be redistributed $d$ times, where $d$ is the number of bits in the element, it follows that the amortized time complexity of extraction is $O(d)$. In practice we will often do better though, because most items will not move through all the buckets.

Caching performance
Some descriptions of the radix heap recommend implementing the buckets as doubly linked lists, but that would be a mistake because linked lists have terrible cache locality. It is better to implement them as dynamically growing arrays. If you do that, the top of the buckets will tend to be hot which means the per-item number of cache misses during redistribution of a bucket will tend to be $O(1/B)$, where $B$ is the number of integers in a cache line. This means the amortized cache-miss complexity of extraction will be closer to $O(d/B)$ than to $O(d)$.

In a regular binary heap, both insertion and extraction require $\Theta(\log n)$ swaps in the worst case, and each swap (except for those very close to the top of the heap) will cause a cache miss.

In other words, if $d = \Theta(\log n)$, extraction from a radix heap will tend to generate $\Theta(\log n / B)$ cache misses, where a binary heap will require $\Theta(\log n)$.

25 May 2013 12:00am GMT

24 May 2013

feedplanet.freedesktop.org

Pekka Paalanen: Weston on Raspberry Pi Accelerated

Raspberry Pi is a nice tiny computer with a relatively powerful VideoCore graphics processor, and an ARM core bolted on the side running Linux. Around October 2012 I was bringing Wayland to it, and in November the Weston rpi-backend was merged upstream. Unfortunately, somehow I did not get around to write about it. In spring 2013 I did a follow-on project on the rpi-backend for the Raspberry Pi Foundation as part of my work for Collabora. We are now really pushing Wayland forward on the Raspberry Pi, and strengthening Collabora's Wayland expertise on all fronts. In the following I will explain what I did and how the new rpi-backend for Weston works in technical terms. If you are more interested in why this was done, I refer you to the excellent post by Daniel Stone: Weston on Raspberry Pi.

Bringing Wayland to Raspberry Pi in 2012

Raspberry Pi has EGL and GL ES 2 support, so the easiest way to bring Wayland was to port Weston. Fortunately unlike most Android-based devices, Raspberry Pi supports normal Linux distributions, and specifically Raspbian, which is a variant of Debian. That means very standard Linux desktop stuff, and easy to target. Therefore I only had to write a new Raspberry Pi specific backend to Weston. I could not use any existing backend, because the graphics stack does not support DRM nor GBM, and running on top of (fbdev) X server would void the whole point. No other usable backends existed at the time.

The proprietary graphics API on RPi is Dispmanx. Dispmanx basically offers a full 2D compositor, but since Weston composited with GL ES 2, I only needed enough Dispmanx to get a full-screen surface for EGL. Half of the patch was just boilerplate to support input and VT handling. All that was fairly easy, but left the Dispmanx API largely unused, not hooking up to the real performance of the VideoCore. Sure, GL ES 2 is accelerated on the VideoCore, too, but it is a much more complex API.

I continued to take more advantage of the hardware compositor Dispmanx exposes. At the time, the way to do that was to implement support for Weston planes. Weston planes were developed for taking advantage of overlay hardware. A backend can take suitable surfaces out from the scenegraph and composite them directly in hardware, bypassing the GL ES 2 renderer of Weston. A major motivation behind it was to offload video display to dedicated hardware, and avoid YUV-RGB color conversion and scaling in GL shaders. Planes allow also the use of hardware cursors.

The hardware compositor on RPi is partially firmware-based. This means that it does not have a constant limit in number of overlays. Standard PC hardware has at most a few overlays if any, the hardware cursor included. The RPi hardware however offers a lot more. In fact, it is possible to assign all surfaces into overlay elements. That is what I implemented, and in an ideal case (no surface transformations) I managed to put everything into overlay elements, and the GL renderer was left with nothing to do.

The hardware compositor does have its limitations. It can do alpha blending, but it cannot rotate surfaces. It also does have a limit on how many elements it can handle, but the actual number depends on many things. Therefore, I had an automatic fallback to the GL renderer. The Weston plane infrastructure made that very easy.

The fallback had some serious downsides, though. There was no way to synchronize all the overlay elements with the GL rendering, and switches between fallback and overlays caused glitches. What is worse, memory consumption exploded through the roof. We only support wl_shm buffers, which need to be copied into GL textures and Dispmanx resources (hardware buffers). As we would jump between GL and overlays arbitrarily and per surface, and I did not want to copy each attached buffer to both of texture and resouce, I had to keep the wl_shm buffer around, just in case it needs to jump and copy as needed. That means that clients will be double-buffered, as they do not get the buffer back until they send a new one. In Dispmanx, the elements, too, need to be double-buffered to ensure that there cannot be glitches, so they needed two resources per element. In total, that means 2 wl_shm buffers, 1 GL texture, and 2 resources. That is 5 surface-sized buffers for every surface! But it worked.

The first project ended, and time passed. Weston got the pixman-renderer, and the renderer interfaces matured. EGL and GL were decoupled from the Weston core. This made the next project possible.

Introducing the Rpi-renderer in Spring 2013

Since Dispmanx offers a full hardware compositor, it was decided that the GL renderer is dropped from Weston's rpi-backend. We lose arbitrary surface transformations like rotation, but on all other aspects it is a win: memory usage, glitches, code and APIs, and presumably performance and power consumption. Dispmanx allows scaling, output transforms, and alpha channel mixed with full-surface alpha. No glitches as we do not jump between GL and overlays anymore. All on-screen elements can be properly synchronized. Clients are able to use single buffering. The Weston renderer API is more complete than the plane API. We do not need to manipulate complex GL state and create vertex buffers, or run the geometry decimation code; we only compute clips, positions, and sizes.

The rpi-backend's plane code had all the essential bits for Dispmanx to implement the rpi-renderer, so lots of the code was already there. I took me less than a week to kick out the GL renderer and have the rpi-renderer show the desktop for the first time. The rest of a month's time was spent on adding features and fixing issues, pretty much.

Graphics Details

Rpi-backend

The rpi-renderer and rpi-backend are tied together, since they both need to do their part on driving the Dispmanx API. The rpi-backend does all the usual stuff like opens evdev input devices, and initializes Dispmanx. It configures a single output, and manages its updates. The repaint callback for the output starts a Dispmanx update cycle, calls into the rpi-renderer to "draw" all surfaces, and then submits the update.

Update submission is asynchronous, which means that Dispmanx does a callback in a different thread, when the update is completed and on screen, including the synchronization to vblank. Using a thread is slightly inconvenient, since that does not plug in to Weston's event loop directly. Therefore I use a trick: rpi_flippipe is essentially a pipe, a pair of file descriptors connected together. Write something into one end, and it pops out the other end. The callback rpi_flippipe_update_complete(), which is called by Dispmanx in a different thread, only records the current timestamp and writes it to the pipe. The other end of the pipe has been registered with Weston's event loop, so eventually rpi_flippipe_handler() gets called in the right thread context, and we can actually handle the completion by calling rpi_output_update_complete().

Rpi-renderer

Weston's renderer API is pretty small:

The rpi-renderer per-surface state is struct rpir_surface. Among other things, it contains a handle to a Dispmanx element (essentially an overlay) that shows this surface, and two Dispmanx resources (hardware pixel buffers); the front and the back. To show a picture, a resource is assigned to an element for scanout.

The attach callback basically only grabs a reference to the given wl_shm buffer. When Weston core starts an output repaint cycle, it calls flush_damage, where the buffer contents are copied to the back resource. Damage is tracked, so that in theory, only the changed parts of the buffer are copied. In reality, the implementation of vc_dispmanx_resource_write_data() does not support arbitrary sub-region updates, so we are forced to copy full scanlines with the same stride as the resource was created with. If stride does not match, the resource is reallocated first. Then flush_damage drops the wl_shm buffer reference, allowing the compositor to release the buffer, and the client can continue single-buffered. The pixels are saved in the back resource.

Copying the buffer involves also another quirk. Even though the Dispmanx API allows to define an image with a pre-multiplied alpha channel, and mix that with a full-surface (element) alpha, a hardware issue causes it to produce wrong results. Therefore we cannot use pre-multiplied alpha, since we want the full-surface alpha to work. This is solved by setting the magic bit 31 of the pixel format argument, which causes vc_dispmanx_resource_write_data() to un-pre-multiply, that is divide, the alpha channel using the VideoCore. The contents of the resource become not pre-multiplied, and mixing with full-surface alpha works.

The repaint_output callback first recomputes the output transformation matrix, since Weston core computes it in GL coordinate system, and we use framebuffer coordinates more or less. Then the rpi-renderer iterates over all surfaces in the repaint list. If a surface is completely obscured by opaque surfaces, its Dispmanx element is removed. Otherwise, the element is created as necessary and updated to the new front resource. The element's source and destination pixel rectangles are computed from the surface state, and clipped by the resource and the output size. Also output transformation is taken into account. If the destination rectangle turns out empty, the element is removed, because every existing Dispmanx element requires VideoCore cycles, and it is best to use as few elements as possible. The new state is set to the Dispmanx element.

After all surfaces in the repaint list are handled, rpi_renderer_repaint_output() goes over all other Dispmanx elements on screen, and removes them. This makes sure that a surface that was hidden, and therefore is not in the repaint list, will really get removed from the screen. Then execution returns to the rpi-backend, which submits the whole update in a single batch.

Once the update completes, the rpi-backend calls rpi_renderer_finish_frame(), which releases unneeded Dispmanx resources, and destroys orphaned per-surface state. These operations cannot be done any earlier, since we need to be sure the related Dispmanx elements have really been updated or removed to avoid possible visual glitches.

The rpi-renderer implements surface_set_color by allocating a 1×1 Dispmanx resource, writing the color into that single pixel, and then scaling it to the required size in the element. Dispmanx also offers a screen capturing function, which stores a snapshot of the output into a resource.

Conclusion

While losing some niche features, we gained a lot by pushing all compositing into the VideoCore and the firmware. Memory consumption is now down to a reasonable level of three buffers per surface, or just two if you force single-buffering of Dispmanx elements. Two is on par with Weston's GL renderer on DRM. We leverage the 2D hardware for compositing directly, which should perform better. Glitches and jerks should be gone. You may still be able to cause the compositing to malfunction by opening too many windows, so instead of the compositor becoming slow, you get bad stuff on screen, which is probably the only downside here. "Too many" is perhaps around 20 or more windows visible at the same time, depending.

If the user experience of Weston on Raspberry Pi was smooth earlier, especially compared to X (see the video), it is even smoother now. Just try the desktop zoom (Win+MouseWheel), for instance! Also, my fellow collaborans wrote some new desktop effects for Weston in this project. Should you have a company needing assistance with Wayland, Collabora is here to help.

The code is available in the git branch raspberrypi-dispmanx, and in the Wayland mailing list. On May 23rd, 2013, the Raspberry Pi specific patches are already merged upstream, and the demo candy patches are waiting for review.

Further related links:
Raspberry Pi Foundation, Wayland preview
Collabora, press release

24 May 2013 6:27am GMT

23 May 2013

feedplanet.freedesktop.org

Daniel Stone: Weston on Raspberry Pi

One of the platforms we've been working on for a while at Collabora is the Raspberry Pi. Obviously the $25 pricepoint makes it hugely appealing to a lot of people - including free software developers who up until now have managed to avoid the agonyjoy we experience on a daily basis working on embedded and mobile platforms - but there are a couple of aspects which speak specifically to us as a company.

Firstly, we did quite a bit of work on OLPC through the years, which had a similar, very laudable, educational mission encouraging not just deep computer literacy in children, but also open source involvement. The Raspberry Pi has broadly the same aims, a very education-friendly pricepoint, and has seen huge success.

Less loftily, it's a great example of a number of architectures we've been quietly working on for quite some time, where hugely powerful special-purpose (i.e. not OpenGL ES) graphics hardware goes nearly unused, in favour of heavily loading the less powerful CPU, or pushing everything through GL.


background etc

The Raspberry Pi has a Broadcom BCM2385 SoC in it, containing an extremely beefy (roughly set-top-box-grade) video, media and graphics processor called the VideoCore (somewhat akin to a display controller, GPU and DSP hybrid), and a … somewhat less beefy general-purpose ARMv61 CPU. The ARM side does everything you'd expect, whereas the VideoCore is a multi-functional beast, acting as the GPU for OpenGL ES, the display engine for outputs/overlays/etc, and also any general-purpose processing (e.g. accelerated JPEG decode).

In terms of how this looks from the ARM, the VideoCore exposes its display functionality through DispManX, an API for display control similar, in capability at least, to KMS. The DispManX exposes a number of output displays, each of which can have a number of planes (sometimes called overlays or sprites) which can each be in different colourspaces (think: video), scaled, alpha-blended, or variously stacked. No surprise there, as this is how most GPUs and display controllers look everywhere: from your phone, to your desktop, to your set-top box.

A recurring theme for us is how to properly use and expose these overlays. There's a huge benefit in doing so over using GL: not only are they a lot faster, but they also have hugely better quality when doing colourspace conversion and scaling, and extra filters for much better image quality2. There's also a pretty strong power argument to be made; at one stage, measuring on a phone, we found a 20% runtime difference - from 4 hours to a bit over 5 - when watching videos using the overlay, compared to pumping them through GL ES. And that was without the zerocopy support we enjoy nowadays!

The X Video extension (Xv) exposes overlays in a very limiting and frustrating way, meaning they're only really suitable for video, and even then aren't capable of zerocopy. DRI2 video support was proposed to fix this, but so far TI's OMAP is the only real deployment of this, and client support is very patchy.

Even then, this is only applicable to video. Under the X11 model, effectively the only way to do compositing is for an external process to render the entire screen as a single image, then pass that image to the X server to display. This is fine for GL, since that's exactly what it's built for, but it means you're never going to get to use all your lovely 2D compositing hardware. Either that, or you do offload compositing inside the X server: something very difficult which almost no-one does, not only because it means no compositing. And no compositing means that your desktop is all back to 1995, with those attractive flickering backgrounds and jittery resizes. :(

i thought this was about rpi … ?

We've been working with the Raspberry Pi Foundation since last year, when we first brought up Weston on Raspberry Pi. At that stage, Weston was still very heavily GLES-based, but was able to opportunistically pull individual surfaces out into overlays if the conditions were right. This worked, and was all well and good, but was an awkward abuse of Weston's internal model, and made RPi look rather unlike any other backend.

Fast forward to a couple of months ago, and Weston had come a long way along the road towards 1.1. A pretty vital piece for us was the renderer infrastructure: whereas Weston previously only had pluggable backends controlling the final display output backend (think: KMS, fbdev, RDP), it now gained a similar split for the renderers.

The original split was between the original GLES renderer and a purely-software Pixman renderer (which was still capable of everything the GLES renderer was!), but it was also a perfect fit for DispManX. So, the first thing Pekka did, and the base of all our work since, was to split the RPi backend into a backend and a renderer, allowing us to guarantee that we'd never fall back to GLES. This lets us feed the entire set of windows into the hardware, complete with stacking order, alpha blending, and 2D transforms, and have the VideoCore take care of everything, all fully synchronised and tear-free. Without using GL!

Pekka's blog has more of the gory details on both the Weston internals, and how DispManX works.

shiny demos

Seemingly no-one can talk about open source graphics without getting all misty-eyed over wobbly windows; what good is straight tech with nothing to show for it? So, while Pekka worked on the renderer, Louis-Francis and myself knocked up a couple of demos to show off what we could do with 2D compositing alone.

The first, and most compelling, demo is in our case study, in which Louis-Francis drags some windows around under both X11 and Weston for the camera. X11 struggles badly, whereas Weston with its VideoCore backend keeps up like a champ.

Louis-Francis then added an optional fade layer, where all the background windows were dimmed, complete with a nice fading animation between foreground and background. This is done internally with one or two solid-colour surfaces, with varying alpha.

Last but not least, I added a reimplementation of GNOME3′s window overview mode. It is a little on the rudimentary side, and does certainly suffer from not having an established layout/packing framework to use, but does work, and I think provides a pretty good demonstration of the kinds of smooth and fluid animations that are totally possible without ever touching GLES. And the best part is that windows zoom around, scaled and alpha-blended, at 60fps, whereas X11 struggles to get to 10fps just moving an unscaled, opaque, window. Yeesh.

And then some of the details were just that: details. Pekka add a new background scaling mode which preserves aspect ratio, Louis-Francis and Pekka worked to make sure the memory usage was always as low as humanly possible, including an optional mode where we trade strict visual correctness for performance and memory usage, we fixed some bugs in the XWayland emulation, and some bugfixes.

sounds great; how do i get it?

The source is already winging its way upstream; currently, the renderer has been merged, but some of the effects are still oustanding. In the meantime, you can clone Pekka's repository:

$ git clone git://git.collabora.co.uk/git/user/pq/weston
$ cd weston
$ git checkout origin/raspberrypi-dispmanx-wip

Or we've also got Raspbian packages, based on the stable 1.0.x rather than unreleased git:

# echo deb http://raspberrypi.collabora.com wheezy rpi >> /etc/apt/sources.list
# apt-get update
# apt-get install weston

tl;dr

A lot of hardware - particularly in the media/embedded/mobile space - ships with hugely powerful special-purpose graphics hardware, aside from the usual OpenGL ES. Up until now, this special-purpose hardware has gone unused, but for equally special-purpose hacks. The work we've done with Weston and Raspberry Pi, using only its dedicated 2D compositing hardware, is a fantastic example of what we can do with Wayland and these kinds of platforms in general, without using GL ES.

now convince your boss

Would you love us to shut up and take your money, but just need that little bit extra to persuade those humourless lot up in purchasing? Fear not: we have an entire case study for the work we've done on the Raspberry Pi, as well as a general graphics page. Raspberry Pi have a writeup themselves too. Or maybe you're more impressed (ha) by press releases? Either way, our trained operators are standing by to take your call.

  1. ARMv6 is the sixth iteration of the ARM architecture; the chip family is ARM11, which was the generation immediately pre-Cortex. Various ARM11s were used in, e.g., the Nokia N810, the iPhone 3G, and the Nintendo 3DS. If you're confused by the ARM nomenclature, Wikipedia has a really good table.
  2. Fun side note: if you want zero-copy video through GL, thanks to the unbelievably harsh wording of GL_OES_EGL_image_external, then you're only allowed to use linear or nearest filtering. Not even bilinear, and definitely nothing near what overlays can do.

23 May 2013 6:22pm GMT

22 May 2013

feedplanet.freedesktop.org

Nagappan Alagappan: [Ann]: Cobra 3.5 - Windows GUI test automation tool


New features:
* ooldtp python client
* Support setting text on combo box
* Added simple command line options
* Support state.editable in hasstate
* Handle valuepattern in click API
* Support ToolBar type on click
* Write to log file if environment variable is set (set LDTP_LOG_FILE=c:\ldtp.log)
* Support control type Table, DataItem in Tree implementation
* Added scrollbar as supported type

New API:
* MouseMove
* setcellvalue
* guitimeout
* oneup
* onedown
* oneleft
* oneright
* scrollup
* scrolldown
* scrollright
* scrollleft

Bugs fixed:
* Fix to support taskbar with consistent index
* istextstateenabled API
* Fallback to object state enabled if value pattern is not available
* Fix to support InvokePattern on Open button
* Use width, height if provided while capturing screenshot
* Work around for copying text to clip board
* QT 5.0.2 specific changes
* Check errno attribute to support cygwin environment
* Fix keyboard APIs with new supported key controls (+, -, :, ;, ~, `, arrow up, down, right, left)
* Don't grab focus if type is tab item

Java client:
* Fixed selectRow arguments
* Fixed compilation issues
Python client:
* Fix optional argument issue in doesrowexist
C# client:
* Added new APIs (scrollup, scrolldown, scrollleft, scrollright, oneup, onedown, oneleft, oneright)
Ruby/Perl client: No changes

Credit:

Nagappan Alagappan, John Yingjun Li, Helen Wu, Eyas Kopty, VMware colleagues

Please spread the word and also share your feedback with us (email me).

About LDTP:

Cross Platform GUI Automation tool Linux version is LDTP, Windows version is Cobra and Mac version is PyATOM.

* Linux version is known to work on GNOME / KDE (QT >= 4.8) / Java Swing / LibreOffice / Mozilla application on all major Linux distribution.
* Windows version is known to work on application written in .NET / C++ / Java / QT on Windows XP SP3 / Windows 7 / Windows 8 development version.
* Mac version is currently under development and verified only on OS X Lion. Where ever PyATOM runs, LDTP should work on it.

Download source / binary (Windows XP / Vista / 7 / 8)
System requirement: .NET 3.5, refer README.txt after installation

Documentation references: For detailed information on LDTP framework and latest updates visit http://ldtp.freedesktop.org

LDTP API doc / Java doc
Report bugs

22 May 2013 10:36pm GMT

Aleksander Morgado: libmbim 1.0.0 released!

Just tagged a 1.0.0 release for libmbim, a library which helps you talk to MBIM-capable modems. You can read more about the MBIM protocol in the libmbim introduction blogpost I wrote some months ago. The 1.0.0 tarball is ready for download from freedesktop.org:

http://www.freedesktop.org/software/libmbim/libmbim-1.0.0.tar.xz

If you want to easily talk to a MBIM device from a GLib-based application, you may want to check the libmbim API documentation.

libmbim is currently used by ModemManager (git master), but you can also now use it in standalone mode with either mbimcli (the command line utility) or mbim-network (a helper script to launch a connection):

# echo "APN=Internet" > /etc/mbim-network.conf

# mbim-network /dev/cdc-wdm0 start
Loading profile...
APN: Internet
Starting network with 'mbimcli -d /dev/cdc-wdm0 --connect=Internet --no-close'...
Network started successfully

# mbim-network /dev/cdc-wdm0 status
Loading profile...
APN: Internet
Getting status with 'mbimcli -d /dev/cdc-wdm0 --query-connection-state --no-close'...
Status: activated

# mbim-network /dev/cdc-wdm0 stop
Loading profile...
APN: Internet
Stopping network with 'mbimcli -d /dev/cdc-wdm0 --disconnect'...
Network stopped successfully

As with libqmi's qmi-network script, you'll still need to run a DHCP client on the wwan interface after getting connected through MBIM. Note that your modem may not support DHCP… if that's your case then patches are welcome to update the script to dump the IP configuration :) Or just use ModemManager, which works nicely with the static IP setup.

Enjoy!


Filed under: Development, FreeDesktop Planet, GNOME Planet, Lanedo Planet, Planets Tagged: libmbim, MBIM, ModemManager

22 May 2013 3:28pm GMT