28 Feb 2015


Corbin Simpson: Monte: Typhon Profiling

Work continues on Typhon. I've recently yearned for a way to study the Monte-level call stacks for profiling feedback. After a bit of work, I think that I've built some things that will help me.

My initial desire was to get the venerable perf to work with Typhon. perf's output is easy to understand, with a little practice, and describes performance problems pretty well.

I'm going to combine this with Brendan Gregg's very cool flame graph system for visually summarizing call stacks, in order to show off how the profiling information is being interpreted. I like flame graphs and they were definitely a goal of this effort.

Maybe perf doesn't need any help. I was able to use it to isolate some Typhon problems last week. I'll use my Monte webserver for all of these tests, putting it under some stress and then looking at the traces and flame graphs.

Now seems like a good time to mention that my dorky little webserver is not production-ready; it is literally just good enough to respond to Firefox, siege, and httperf with a 200 OK and a couple bytes of greeting. This is definitely a microbenchmark.

With that said, let's look at what perf and flame graphs say about webserver performance:

An unhelpful HTTP server profile

You can zoom in on this by clicking. Not that it'll help much. This flame graph has two big problems:

  1. Most of the time is spent in the mysterious "[unknown]" frames. I bet that those are just caused by the JIT's code generation, but perf doesn't know that they're meaningful or how to label them.
  2. The combination of JIT and builtin objects with builtin methods result in totally misleading call stacks, because most object calls don't result in new methods being added to the stack.

I decided to tackle the first problem first, because it seemed easier. Digging a bit, I found a way to generate information on JIT-created code objects and get that information to perf via a temporary file.

The technique is only documented via tribal knowledge and arcane blog entries. (I suppose that, in this regard, I am not helping.) It is described both in this kernel patch implementing the feature, and also in this V8 patch. The Typhon JIT hooks show off my implementation of it.

So, does it work? What does it look like?

I didn't upload a picture of this run, because it doesn't look different from the earlier graph! The big [unknown] frames aren't improved at all. Sufficient digging will reveal the specific newly-annotated frames being nearly never called. Clearly this was not a winning approach.

At this point, I decided to completely change my tack. I wrote a completely new call stack tracer inside Typhon. I wanted to do a sampling profiler, but sampling is hard in RPython. The vmprof project might fix that someday. For now, I'll have to do a precise profiler.

Unlabeled HTTP server profile with correct atoms

I omitted the coding montage. Every time a call is made from within SmallCaps, the profiler takes a time measurement before and after the call. This is pretty great! But can we get more useful names?

Names in Monte are different from names in, say, Python or Java. Python and Java both have class names. Since Monte does not have classes, Monte doesn't have a class name. A compromise which we accept here is to use the "display name" of the class, which will be the pattern used to bind a user-level object literal, and will be the name of the class for all of the runtime object classes. This is acceptable.

HTTP server profile with correct atoms and useful display names

Note how the graphs are differently shaped; all of the frames are being split out properly and the graph is more detailed as a result. The JIT is still active during this entire venture, and it'd be cool to see what the JIT is doing. We can use RPython's rpython.rlib.jit.we_are_jitted() function to mark methods as being JIT'd, and we can ask the flame graph generator to colorize them.


Oh man! This is looking pretty cool. Let's colorize the frames that are able to sit directly below JIT entry points. I do this with a heuristic (regular expression).


This isn't even close to the kind of precision and detail from the amazing Java-on-Illumos profiles on Gregg's site, but it's more than enough to help my profiling efforts.

28 Feb 2015 4:13pm GMT

26 Feb 2015


Bastien Nocera: Another fake flash story

I recently purchased a 64GB mini SD card to slot in to my laptop and/or tablet, keeping media separate from my home directory pretty full of kernel sources.

This Samsung card looked fast enough, and at 25€ include shipping, seemed good enough value.

Hmm, no mention of the SD card size?

The packaging looked rather bare, and with no mention of the card's size. I opened up the packaging, and looked over the card.

Made in Taiwan?

What made it weirder is that it says "made in Taiwan", rather than "Made in Korea" or "Made in China/PRC". Samsung apparently makes some cards in Taiwan, I've learnt, but I didn't know that before getting suspicious.

After modifying gnome-multiwriter's fake flash checker, I tested the card, and sure enough, it's an 8GB card, with its firmware modified to show up as 67GB (67GB!). The device (identified through the serial number) is apparently well-known in swindler realms.

Buyer beware, do not buy from "carte sd" on Amazon.fr, and always check for fake flash memory using F3 or h2testw, until udisks gets support for this.

Amazon were prompt in reimbursing me, but the Comité national anti-contrefaçon and Samsung were completely uninterested in pursuing this further.

In short:

26 Feb 2015 10:57am GMT

23 Feb 2015


Christian Schaller: Reliable BIOS updates in Fedora

Some years ago I bought myself a new laptop, deleted the windows partition and installed Fedora on the system. Only to later realize that the system had a bug that required a BIOS update to fix and that the only tool for doing such updates was available for Windows only. And while some tools and methods have been available from a subset of vendors, BIOS updates on Linux has always been somewhat of hit and miss situation. Well luckily it seems that we will finally get a proper solution to this problem.
Peter Jones, who is Red Hat's representative to the UEFI working group and who is working on making sure we got everything needed to support this on Linux, approached me some time ago to let me know of the latest incoming update to the UEFI standard which provides a mechanism for doing BIOS updates. Which means that any system that supports UEFI 2.5 will in theory be one where we can initiate the BIOS update from Linux. So systems supporting this version of the UEFI specs is expected to become available through the course of this year and if you are lucky your hardware vendor might even provide a BIOS update bringing UEFI 2.5 support to your existing hardware, although you would of course need to do that one BIOS update in the old way.

So with Peter's help we got hold of some prototype hardware from our friends at Intel which already got UEFI 2.5 support. This hardware is currently in the hands of Richard Hughes. Richard will be working on incorporating the use of this functionality into GNOME Software, so that you can do any needed BIOS updates through GNOME Software along with all your other software update needs.

Peter and Richard will as part of this be working to define a specification/guideline for hardware vendors for how they can make their BIOS updates available in a manner we can consume and automatically check for updates. We will try to align ourselves with the requirements from Microsoft in this area to allow the vendors to either use the exact same package for both Windows and Linux or at least only need small changes to them. We can hopefully get this specification up on freedesktop.org for wider consumption once its done.

I am also already speaking with a couple of hardware vendors to see if we can pilot this functionality with them, to both encourage them to support UEFI 2.5 as quickly as possible and also work with them to figure out the finer details of how to make the updates available in a easily consumable fashion.

Our hope here is that you eventually can get almost any hardware and know that if you ever need a BIOS update you can just fire up Software and it will tell you what if any BIOS updates are available for your hardware, and then let you download and install them. For people running Fedora servers we have had some initial discussions about doing BIOS updates through Cockpit, in addition of course to the command line tools that Peter is writing for this.

I mentioned in an earlier blog post that one of our goals with the Fedora Workstation is to drain the swamp in terms of fixing the real issues making using a Linux desktop challenging, well this is another piece of that puzzle and I am really glad we had Peter working with the UEFI standards group to ensure the final specification was useful also for Linux users.

Anyway as soon as I got some data on concrete hardware that will support this I will make sure to let you know.

23 Feb 2015 4:17pm GMT

22 Feb 2015


Rob Clark: a4xx shaping up

So, I finally figured out the bug that was causing some incorrect rendering in xonotic (and, it turns out, to be the same bug plaguing a lot of other games/webgl/etc). The fix is pushed to upstream mesa master (and I guess I should probably push it to the 10.5 stable branch too). Now that xonotic renders correctly, I think I can finally call freedreno a4xx support usable:

Also, for fun, a little comparison between the ifc6540 board (snapdragon 805, aka apq8084), and my laptop (i5-4310U). Both have 1920x1080 resolution, both running gnome-shell and firefox (with identical settings). Laptop is fedora f21 while ifc6540 is rawhide), but it is quite close to an apples-to-apples comparision:

Obviously not a rigorous benchmark, so please don't read too much into the results. The intel is still faster overall (as it should be at it's size/price/power budget), but amazing that the gap is becoming so small between something that can be squeezed into a cell phone and dedicated laptop class chips.

22 Feb 2015 3:41pm GMT

21 Feb 2015


Samuel Pitoiset: Reverse engineering Windows or Linux PCI drivers with Intel VT-d and QEMU – Part 1

Today, I will describe a new way to reverse engineer PCI drivers by creating a PCI passthrough with a QEMU virtual machine. In this article, I will show you how to use the Intel VT-d technology in order to trace memory mapped input/output (MMIO) accesses of a QEMU VM. As a member of Nouveau community, this howto will only be focused on the NVIDIA's proprietary driver but it should be pretty similar for all PCI drivers.


Reverse engineering the NVIDIA's proprietary driver is not an easy task, especially on Windows because we have no support for both mmiotrace, a toolbox for tracing memory mapped I/O access within the Linux kernel, and valgrind-mmt which allows tracing application accesses to mmaped memory.

When I started to reverse engineer NVIDIA Perfkit on Windows (for graphics performance counters) in-between the Google Summer of Code 2013 and 2014, I wrote some tools for dumping the configuration of these performance counters, but it was very painful to find multiplexers because I couldn't really trace MMIO accesses. I would have liked to use Intel VT-d but my old computer didn't support that recent technology, but recently I got a new computer and my life has changed. ;)

But what is VT-d and how to use it with QEMU ?

An input/output memory management unit (IOMMU) allows guest virtual machines to directly use peripheral devices, such as Ethernet, accelerated graphics cards, through DMA and interrupt remapping. This is called VT-d at Intel and AMD-Vi at AMD.

QEMU allows to use that technology through the VFIO driver which is an IOMMU/device agnostic framework for exposing direct device access to userspace, in a secure, IOMMU protected environment. In other words, this allows safe, non-privileged, userspace drivers. Initially developed by Cisco, VFIO is now maintened by Alex Williamson at Red Hat.

In this howto, I will use Fedora as guest OS but whatever you use it should work for both Linux and Windows OS. Let's get start.

Tested hardware

Motherboard: ASUS B85 PRO GAMER

CPU: Intel Core i5-4460 3.20GHz

GPU: NVIDIA GeForce 210 (host) and NVIDIA GeForce 9500 GT (guest)

OS: Arch Linux (host) and Fedora 21 (guest)


Your CPU needs to support both virtualization and IOMMU (Intel VT-d technology, Core i5 at least). You will also need two NVIDIA GPUs and two monitors, or one with two different inputs (one plugged into your host GPU, one into your guest GPU). I would also recommend you to have a separate keyboard and mouse for the guest OS.

Step 1: Hardware setup

Check if your CPU supports virtualization.

egrep -i '^flags.*(svm|vmx)' /proc/cpuinfo

If so, enable CPU virtualization support and Intel VT-d from the BIOS.

Step 2: Kernel config

1) Modify kernel config
Device Drivers --->
    [*] IOMMU Hardware Support  --->
        [*]   Support for Intel IOMMU using DMA Remapping Devices
        [*]   Support for Interrupt Remapping
Device Drivers --->
    [*] VFIO Non-Privileged userspace driver framework  --->
        [*]   VFIO PCI support for VGA devices
Bus options (PCI etc.) --->
    [*] PCI Stub driver
2) Build kernel
3) Reboot, and check if your system has support for both IOMMU and DMA remapping
dmesg | grep -e IOMMU -e DMAR
[    0.000000] ACPI: DMAR 0x00000000BD9373C0 000080 (v01 INTEL  HSW      00000001 INTL 00000001)
[    0.019360] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap d2008c20660462 ecap f010da
[    0.019362] IOAPIC id 8 under DRHD base  0xfed90000 IOMMU 0
[    0.292166] DMAR: No ATSR found
[    0.292235] IOMMU: dmar0 using Queued invalidation
[    0.292237] IOMMU: Setting RMRR:
[    0.292246] IOMMU: Setting identity map for device 0000:00:14.0 [0xbd8a6000 - 0xbd8b2fff]
[    0.292269] IOMMU: Setting identity map for device 0000:00:1a.0 [0xbd8a6000 - 0xbd8b2fff]
[    0.292288] IOMMU: Setting identity map for device 0000:00:1d.0 [0xbd8a6000 - 0xbd8b2fff]
[    0.292301] IOMMU: Prepare 0-16MiB unity mapping for LPC
[    0.292307] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]

!!! If you have no output, you have to fix this before continuing !!!

Step 3: Build QEMU

git clone git://git.qemu-project.org/qemu.git --depth 1
cd qemu
./configure --python=/usr/bin/python2 # Python 3 is not yet supported
make && make install

You can also install QEMU from your favorite package manager, but I would recommend you to get the source code if you want to enable VFIO tracing support.

Step 4: Unbind the GPU with pci-stub

According to my hardware config, I have two NVIDIA GPUs, so blacklisting the Nouveau kernel module is not so good. Instead, I will use pci-stub in order to unbind the GPU which will be assigned to the guest OS.

NOTE: If pci-stub was built as a module, you'll need to modify /etc/mkinitcpio.conf, add pci-stub in the MODULES section, and update your initramfs.

01:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2)
05:00.0 VGA compatible controller: NVIDIA Corporation G96 [GeForce 9500 GT] (rev a1)
lspci -n
01:00.0 0300: 10de:0a65 (rev a2) # GT218
05:00.0 0300: 10de:0640 (rev a1) # G96

Now add the following kernel parameter to your bootloader.


Reboot, and check.

dmesg | grep pci-stub
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-nouveau root=UUID=5f64607c-5c72-4f65-9960-d5c7a981059e rw quiet pci-stub.ids=10de:0640
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-nouveau root=UUID=5f64607c-5c72-4f65-9960-d5c7a981059e rw quiet pci-stub.ids=10de:0640
[    0.295763] pci-stub: add 10DE:0640 sub=FFFFFFFF:FFFFFFFF cls=00000000/00000000
[    0.295768] pci-stub 0000:05:00.0: claimed by stub

Step 5: Bind the GPU with VFIO

Now, it's time to bind the GPU (the G96 card in this example) with VFIO in order to pass through it to the VM. You can use this script to make life easier:


modprobe vfio-pci

for dev in "$@"; do
        vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
        device=$(cat /sys/bus/pci/devices/$dev/device)
        if [ -e /sys/bus/pci/devices/$dev/driver ]; then
                echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id

Bind the GPU:

./vfio-bind.sh 0000:05:00.0 # G96

Step 6: Testing KVM VGA-Passthrough

Let's test if it works, as root:

qemu-system-x86_64 \
    -enable-kvm \
    -M q35 \
    -m 2G \
    -cpu host, kvm=off \
    -device vfio-pci,host=05:00.0,multifunction=on,x-vga=on

If it works fine, you should see a black QEMU window with the message "Guest has not initialized the display (yet)". You will need to pass -vga none, otherwise it won't work. I'll show you all the options I use a bit later.

NOTE: kvm=off is required for some recent NVIDIA proprietary drivers because it won't be loaded if it detects KVM…

Step 7: Add USB support

At this step, we have assigned the GPU to the virtual machine, but it would be a good idea to be able to use that guest OS with a keyboard, for example. To do this, we need to add USB support to the VM. The preferred way is to pass through an entire USB controller like we already did for the GPU.

lspci | grep USB
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)

Add the following line to QEMU, example for 00:14.0:

-device vfio-pci,host=00:14.0,bus=pcie.0

Before trying USB support inside the VM, you need to assign that USB controller to VFIO, but you will lose your keyboard and your mouse from the host in case they are connected to that controller.

./vfio-bind.sh 0000:00:14.0

In order to re-enable the USB support from the host, you will need to unbind the controller, and to bind it to xhci_hcd.

echo 0000:00:14.0 > /sys/bus/pci/drivers/vfio-pci/unbind
echo 0000:00:14.0 > /sys/bus/pci/drivers/xhci_hcd/bind

If you get an error with USB support, you might simply try a different controller, or try to assign USB devices by ID.

Step 8: Install guest OS

Now, it's time to install the guest OS. I installed Fedora 21 because it's just not possible to run Arch Linux inside QEMU due to a bug in syslinux… Whatever, install your favorite Linux OS and go ahead. I would also recommend to install envytools (a collection of tools developed by the members of the Nouveau community) in order to easily test the tracing support.

You can use the script below to launch a VM with VGA and USB passthrough, and all the stuff we need.


modprobe vfio-pci

        vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
        device=$(cat /sys/bus/pci/devices/$dev/device)
        if [ -e /sys/bus/pci/devices/$dev/driver ]; then
                echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id

# Bind devices.
modprobe vfio-pci
vfio_bind 0000:05:00.0  # GPU (NVIDIA G96)
vfio_bind 0000:00:14.0  # USB controller

qemu-system-x86_64 \
    -enable-kvm \
    -M q35 \
    -m 2G \
    -hda fedora.img \
    -boot d \
    -cpu host,kvm=off \
    -vga none \
    -device vfio-pci,host=05:00.0,multifunction=on,x-vga=on \
    -device vfio-pci,host=00:14.0,bus=pcie.0

# Restore USB controller
echo 0000:00:14.0 > /sys/bus/pci/drivers/vfio-pci/unbind
echo 0000:00:14.0 > /sys/bus/pci/drivers/xhci_hcd/bind

Step 9: Enable VFIO tracing support for QEMU

1) Configure QEMU to enable tracing

Enable the stderr trace backend. Please refer to docs/tracing.txt if you want to change the backend.

./configure --python=/usr/bin/python2 --enable-trace-backends=stderr
2) Disable MMAP support

Disabling MMAP support uses the slower read/write accesses to MMIO space that will get traced. To do this, open the file include/hw/vfio/vfio-common.h, and change #define VFIO_ALLOW_MMAP from 1 to 0.

 /* Extra debugging, trap acceleration paths for more logging */
-#define VFIO_ALLOW_MMAP 1
+#define VFIO_ALLOW_MMAP 0

Re-build QEMU.

3) Add the trace points you want to observe

Create a events.txt file and add the vfio_region_write trace point which dumps MMIO read/write accesses of the GPU.

echo "vfio_region_write" > events.txt

VFIO tracing support is now enabled and configured, really easy, huh?

Thanks to Alex Williamson for these hints.

Step 10: Trace MMIO write accesses

Let's now test VFIO tracing support. Enable events tracing by adding the following line to the script which launchs the VM.

-trace events=events.txt

Launch the VM. You should see lot of traces from the standard error output, this is a good news.

Open a terminal in the VM, go to the directory where envytools has been built, and run (as root) the following command.

./nvahammer 0xa404 0xdeadbeef

This command writes a 32-bit value (0xdeadbeef) to the MMIO register at 0xa404 and repeats the write in an infinite loop. It needs to be manually aborted.

Go back to the host, and you should see the following traces if it works fine.

12347@1424299207.289770:vfio_region_write  (0000:05:00.0:region0+0xa404, 0xdeadbeef, 4)
12347@1424299207.289774:vfio_region_write  (0000:05:00.0:region0+0xa404, 0xdeadbeef, 4)
12347@1424299207.289778:vfio_region_write  (0000:05:00.0:region0+0xa404, 0xdeadbeef, 4)

In this example, we have only traced MMIO write accesses, but of course, if you want to trace read accesses, you just have to change vfio_region_write to vfio_region_read.


In this article I showed you how to trace MMIO accesses using a PCI passthrough with QEMU, Intel VT-d and VFIO. However, all PCI accesses are currently traced including USB controller and this is not ideal unlike mmiotrace which only dumps accesses for one peripheral. It would be also a good idea to have the same format as mmiotrace in order to use the decoding tools we already have for it in envytools.

Future work

- do not trace all PCI accesses (device and subrange address filtering)

- VFIO traces to the mmiotrace format

- compare performance when tracing support is enabled or not

Related ressources

KVM VGA-Passthrough on ArchLinux

VGA-Passthrough on Debian

VFIO documentation

QEMU VFIO tracing documentation

21 Feb 2015 11:13am GMT

18 Feb 2015


Luc Verhaegen: FOSDEM and early Tamil driver work.


It was another great FOSDEM this year. Even with their 5-10.000 attendants, the formula of being absolutely free, with limited sponsorship, and while only making small changes each year is an absolute winner. There is just no conference which comes even close to FOSDEM.

For those on ICE14 on Friday, the highspeed train from Frankfurt to Brussels south at 14:00, who were so nice to participate in my ad-hoc visitor count: 66. I counted 66 people, but i might have skipped a few as people were sometimes too engrossed in their laptops to hear me over their headphones. On a ~400 seat train, that's a pretty high number, and i never see the same level of geekiness on the Frankfurt to Brussels trains as on the Friday before FOSDEM. If it didn't sound like an organizational nightmare, it might have been a good idea to talk to DB and get a whole carriage reserved especially for FOSDEM goers.

With the Graphics DevRoom we returned to the K building this year, and i absolutely love the cozy 80 people classrooms we have there. With good airflow, freely movable tables, and an easy way to put in my powersockets, i have an easy time as a devroom organizer. While some speakers probably prefer bigger rooms to have a higher number of attendants, there is nothing like the more direct interaction of the rooms in the K buildings. With us sitting on the top floor, we also only had people who were very actively interested in the topics of the graphics devroom.

Despite the fact that FOSDEM has no equal, and anyone who does anything with open source software in the European Union should attend, i always have a very difficult time recruiting a full set of speakers for the graphics DevRoom. Perhaps the biggest reason for this is the fact that it is a free conference, and it lacks the elitarian status of a paid-for conference. Everyone can attend your talk, even people who do not work on the kernel or on graphics drivers, and the potential speaker might feel as if he needs to waste his time on people who are below his own perceived station. Another reason may be that it is harder to convince the beancounters to sponsor a visit to a free conference. In that case, if you live in the European Union and when you are paid to do open source software, you should be able to afford going to the must-visit FOSDEM by yourself.

As for next year, i am not sure whether there will be a graphics devroom again. Speaker count really was too low and perhaps it is time for another hiatus. Perhaps it is once again time to show people that talking in a devroom at FOSDEM truly is a privilege, and not to be taken for granted.

Tamil "Driver" talk.

My talk this year, or rather, my incoherent mumble finished off with a demo, was about showing my work on the ARM Mali Midgard GPUs. For those who had to endure it, my apologies for my ill-preparedness; i poured all my efforts into the demo (which was finished on Wednesday), and spent too much time doing devroom stuff (which ate Thursday) and of course in drinking up, ahem, the event that is FOSDEM (which ate the rest of the weekend). I will try to make up for it now in this blog post.

Current Tamil Status.

As some might remember, in September and October 2013, i did some preliminary work on the Mali T-series. I spent about 3 to 3.5 weeks building the infrastructure to capture the command stream and replay it. At the same time I also dug deep into the Mali binary driver to expose the binary shader compiler. These two feats gave me all the prerequisites for doing the job of reverse engineering the command stream of the Mali Midgard.

During the Christmas holidays of 2014 and in January 2015, i spent my time building a command stream parser. This is a huge improvement over my work on lima, where i only ended doing so later on in the process (while bringing up Q3A). As i built up the capabilities of this parser, i threw ever more complex GLES2 programs at it. A week before FOSDEM, my parser was correctly handling multiple draws and frames, uniforms, attributes, varyings and textures. Instead of having raw blobs of memory, i had C structs and tables, allowing me to easily see the differences between streams.

I then took the parsed result of my most complex test and slowly turned that into actual C code, using the shader binary produced by the binary compiler, and adding a trivial sequential memory allocator. I then added rotation into the mix, and this is the demo as seen on FOSDEM (and now uploaded to youtube).

All the big things are known.

For textures. I only have simple texture support at this time, no mipmapping nor cubemapping yet, and only RGB565 and RGBA32 are supported at this time. I also still have not figured out how to disable swizzling, instead i re-use the texture swizzling code from lima, the only place where I was able to re-use code in tamil. This missing knowledge is just some busywork, and a bit of coding away.

As for programs, while both the Mali Utgard (M-series) and Midgard (T-series) binary compilers output in a format called MBS (Mali Binary Shader), the contents of each file is significantly different. I had no option but to rewrite the MBS parser for tamil.

Instead of rewriting the vertex shaders binaries like ARMs binary driver does, i reorder the entries in the attribute descriptor table to match the order as described by the shader compiler output. This avoids adding a whole lot of logic to handle this correctly, even though MBS now describes which bits to alter in the binary. I still lay uniforms, attributes and varyings out by hand though, i similarly have only limited knowledge of typing at this point. This mostly is a bit of busywork of writing up the actual logic, and trying out a few different things.

I know only very few things about most of the GL state. Again, mostly busywork with a bit of testing and coding up the results. And while many values left and right are still magic, nothing big is hiding any more.

Unlike lima, i am refraining from building up more infrastructure (than necessary to show the demo) outside of Mesa. The next step really is writing up a mesa driver. Since my lima driver for mesa was already pretty advanced, i should be able to re-use a lot of the knowledge gained there, and perhaps some code.

The demo

The demo was shown on a Samsung ARM Chromebook, with a kernel and linux installation from september 2013 (when i brought up cs capture and exposed the shader compiler). The exynos KMS driver on this 3.4.0 kernel is terrible. It only accepts a few fixed resolutions (as if I never existed and modesetting wasn't a thing), and then panics when you even look at it. Try to leave X while using HDMI: panic. Try to use a KMS plane to display the resulting render: panic.

In the end, i used fbdev and memcpy the rendered frame over to the console. On this kernel, i cannot even disable the console, so some of the visible flashing is the console either being overwritten by or overwriting the copied render.

The youtube video shows a capture of the Chromebooks built in LCD, at 1280x720 on a 1366x768 display. At FOSDEM, i was lucky that the projector accepted the 1280x720 mode the exynos hdmi driver produced. My dumb HDMI->VGA converter (which makes the image darker) was willing to pass this through directly. I have a more intelligent HDMI->VGA adapter which also does scaling and which keeps colours nice, but that one just refused the output of the exynos driver. The video that was captured in our devroom probably does not show the demo correctly, as that should've been at 1024x768.

The demo shows 3 cubes rotating in front of the milky way. It is 4 different draws, using 3 different textures, and 3 different programs. These cubes currently rotate at 47fps, with the memcpy. During the talk, the chromebook slowed down progressively down to 26fps and even 21fps at one point, but i have not seen that behaviour before or since. I do know of an issue that makes the demo fail at frame 79530, which is 100% reproducible. I still need to track this down, it probably is an issue with my job handling code.


With Lima and Tamil i am in a very unique position. Unlike on adreno, tegra or Videocore, i have to deal with many different SoCs. Apart from the difference in kernel GPU drivers, i also have to deal with differences in display drivers and run into a whole new world of hurts every time i move over to a new target device. The information for doing a proper linux installation on an android or chrome device is usually dispersed, not up to date, and not too good, and i get to do a lot of the legwork for myself every time, knowing full well that a lot of others have done so already but couldn't be bothered to document things (hence my role in the linux-sunxi community).

The ARM chromebook and its broken kms driver is much of the same. Last FOSDEM i complained how badly supported and documented the Samsung ARM chromebook is, despite its popularity, and appealed for more linux-sunxi style, SoC specific communities, especially since I, as a graphics driver developer, cannot go and spend as much time in each and every of the SoC projects as i have done with sunxi.

During the questions round of this talk, one guy in the audience asked what needed to be done to fix the SoC pain. At first i completely missed the question, upon which he went and rephrased his question. My answer was: provide the infrastructure, make some honest noise and people will come. Usually, when some asks such a question, nothing ever comes from it. But Merlijn "Wizzup" Wajer and his buddy S.J.R. "Swabbles" van Schaik really came through.

Today there is the linux-exynos.org wiki, the linux-exynos mailinglist, and the #linux-exynos irc channel. While the community is not as large as linux-sunxi, it is steadily growing. So if you own exynos based hardware, or if your company is basing a product on the exynos chipset, head to linux-exynos.org and help these guys out. Linux-exynos deserves your support.

18 Feb 2015 9:05am GMT

16 Feb 2015


Julien Danjou: Hacking Python AST: checking methods declaration

A few months ago, I wrote the definitive guide about Python method declaration, which had quite a good success. I still fight every day in OpenStack to have the developers declare their methods correctly in the patches they submit.

Automation plan

The thing is, I really dislike doing the same things over and over again. Furthermore, I'm not perfect either, and I miss a lot of these kind of problems in the reviews I made. So I decided to replace me by a program - a more scalable and less error-prone version of my brain.

In OpenStack, we rely on flake8 to do static analysis of our Python code in order to spot common programming mistakes.

But we are really pedantic, so we wrote some extra hacking rules that we enforce on our code. To that end, we wrote a flake8 extension called hacking. I really like these rules, I even recommend to apply them in your own project. Though I might be biased or victim of Stockholm syndrome. Your call.

Anyway, it's pretty clear that I need to add a check for method declaration in hacking. Let's write a flake8 extension!

Typical error

The typical error I spot is the following:

class Foo(object):
# self is not used, the method does not need
# to be bound, it should be declared static
def bar(self, a, b, c):
return a + b - c

That would be the correct version:

class Foo(object):
def bar(a, b, c):
return a + b - c

This kind of mistake is not a show-stopper. It's just not optimized. Why you have to manually declare static or class methods might be a language issue, but I don't want to debate about Python misfeatures or design flaws.


We could probably use some big magical regular expression to catch this problem. flake8 is based on the pep8 tool, which can do a line by line analysis of the code. But this method would make it very hard and error prone to detect this pattern.

Though it's also possible to do an AST based analysis on on a per-file basis with pep8. So that's the method I pick as it's the most solid.

AST analysis

I won't dive deeply into Python AST and how it works. You can find plenty of sources on the Internet, and I even talk about it a bit in my book The Hacker's Guide to Python.

To check correctly if all the methods in a Python file are correctly declared, we need to do the following:

Flake8 plugin

In order to register a new plugin in flake8 via hacking, we just need to add an entry in setup.cfg:

flake8.extension =
H904 = hacking.checks.other:StaticmethodChecker
H905 = hacking.checks.other:StaticmethodChecker

We register 2 hacking codes here. As you will notice later, we are actually going to add an extra check in our code for the same price. Stay tuned.

The next step is to write the actual plugin. Since we are using an AST based check, the plugin needs to be a class following a certain signature:

class StaticmethodChecker(object):
def __init__(self, tree, filename):
self.tree = tree

def run(self):

So far, so good and pretty easy. We store the tree locally, then we just need to use it in run() and yield the problem we discover following pep8 expected signature, which is a tuple of (lineno, col_offset, error_string, code).

This AST is made for walking ♪ ♬ ♩

The ast module provides the walk function, that allow to iterate easily on a tree. We'll use that to run through the AST. First, let's write a loop that ignores the statement that are not class definition.

class StaticmethodChecker(object):
def __init__(self, tree, filename):
self.tree = tree

def run(self):
for stmt in ast.walk(self.tree):
# Ignore non-class
if not isinstance(stmt, ast.ClassDef):

We still don't check for anything, but we know how to ignore statement that are not class definitions. The next step need to be to ignore what is not function definition. We just iterate over the attributes of the class definition.

for stmt in ast.walk(self.tree):
# Ignore non-class
if not isinstance(stmt, ast.ClassDef):
# If it's a class, iterate over its body member to find methods
for body_item in stmt.body:
# Not a method, skip
if not isinstance(body_item, ast.FunctionDef):

We're all set for checking the method, which is body_item. First, we need to check if it's already declared as static. If so, we don't have to do any further check and we can bail out.

for stmt in ast.walk(self.tree):
# Ignore non-class
if not isinstance(stmt, ast.ClassDef):
# If it's a class, iterate over its body member to find methods
for body_item in stmt.body:
# Not a method, skip
if not isinstance(body_item, ast.FunctionDef):
# Check that it has a decorator
for decorator in body_item.decorator_list:
if (isinstance(decorator, ast.Name)
and decorator.id == 'staticmethod'):
# It's a static function, it's OK
# Function is not static, we do nothing for now

Note that we use the special for/else form of Python, where the else is evaluated unless we used break to exit the for loop.

for stmt in ast.walk(self.tree):
# Ignore non-class
if not isinstance(stmt, ast.ClassDef):
# If it's a class, iterate over its body member to find methods
for body_item in stmt.body:
# Not a method, skip
if not isinstance(body_item, ast.FunctionDef):
# Check that it has a decorator
for decorator in body_item.decorator_list:
if (isinstance(decorator, ast.Name)
and decorator.id == 'staticmethod'):
# It's a static function, it's OK
first_arg = body_item.args.args[0]
except IndexError:
yield (
"H905: method misses first argument",
# Check next method

We finally added some check! We grab the first argument from the method signature. Unless it fails, and in that case, we know there's a problem: you can't have a bound method without the self argument, therefore we raise the H905 code to signal a method that misses its first argument.

Now you know why we registered this second pep8 code along with H904 in setup.cfg. We have here a good opportunity to kill two birds with one stone.

The next step is to check if that first argument is used in the code of the method.

for stmt in ast.walk(self.tree):
# Ignore non-class
if not isinstance(stmt, ast.ClassDef):
# If it's a class, iterate over its body member to find methods
for body_item in stmt.body:
# Not a method, skip
if not isinstance(body_item, ast.FunctionDef):
# Check that it has a decorator
for decorator in body_item.decorator_list:
if (isinstance(decorator, ast.Name)
and decorator.id == 'staticmethod'):
# It's a static function, it's OK
first_arg = body_item.args.args[0]
except IndexError:
yield (
"H905: method misses first argument",
# Check next method
for func_stmt in ast.walk(body_item):
if six.PY3:
if (isinstance(func_stmt, ast.Name)
and first_arg.arg == func_stmt.id):
# The first argument is used, it's OK
if (func_stmt != first_arg
and isinstance(func_stmt, ast.Name)
and func_stmt.id == first_arg.id):
# The first argument is used, it's OK
yield (
"H904: method should be declared static",

To that end, we iterate using ast.walk again and we look for the use of the same variable named (usually self, but if could be anything, like cls for @classmethod) in the body of the function. If not found, we finally yield the H904 error code. Otherwise, we're good.


I've submitted this patch to hacking, and, finger crossed, it might be merged one day. If it's not I'll create a new Python package with that check for flake8. The actual submitted code is a bit more complex to take into account the use of abc module and include some tests.

As you may have notice, the code walks over the module AST definition several times. There might be a couple of optimization to browse the AST in only one pass, but I'm not sure it's worth it considering the actual usage of the tool. I'll let that as an exercise for the reader interested in contributing to OpenStack. 😉

Happy hacking!

The Hacker's Guide to Python

A book I wrote talking about designing Python applications, state of the art, advice to apply when building your application, various Python tips, etc. Interested? Check it out.

16 Feb 2015 11:39am GMT

14 Feb 2015


Corbin Simpson: Monte: Typhon and SmallCaps

Two years of having a real job have made me bitter and grumpy, so I'm gonna stop with the cutesy blog post titles.

Today, I'm gonna talk a bit about Monte. Specifically, I'm going to talk about Typhon's current virtual machine. Typhon used to use an abstract syntax tree interpreter ("AST interpreter") as its virtual machine. It now uses a variant of the SmallCaps bytecode machine. I'll explain how and why.

When I started designing Typhon, I opted for an AST VM because it seemed like it matched Monte's semantics well. Monte is an expression language, which means that every syntactic node is an expression which can be evaluated to a single value. Monte is side-effecting, so evaluation must happen in an environment which can record the side effects. An AST interpreter could have an object representing each syntactic node, and each node could be evaluated within an environment to produce a value.

class Node(object):

    def evaluate(self, environment):

Subclasses of Node can override evaluate() to have behavior specific to that node. The Assign class can assign values to names in the environment, the Object class can create new objects, and the Call class can pass messages to objects. This technique is amenable to RPython, since Node.evaluate() has the same signature for all nodes.

How well does this work with the RPython just-in-time ("JIT") compiler? It does alright. The JIT has little problem seeing that the nodes are compile-time constant, and promotion applied strategically allows the JIT to see into things like Call nodes, which cause methods to be inlined relatively well. However, there are some problems.

The first problem is that Monte has no syntactic loops. Since RPython's JIT is designed to compile loops, this means that some extra footwork is needed to set up the JIT. I took Monte's looping object, __loop, and extended it to detect when its argument is likely to yield a JIT trace. The detection is simple, examining whether the loop body is a user-defined object and only entering the JIT when that is the case. In theory, most bodies will be user-defined, since Monte turns this:

object SubList:
    to coerce(specimen, ej):
        def [] + l exit ej := specimen
        for element in l:
            subguard.coerce(element, ej)
        return specimen

‌into this:

object SubList:
    method coerce(specimen, ej):
        def via (__splitList.run([0])) [l] exit ej := specimen
        var validFlag__6 := true
            __loop.run([l, object _ {
                method run(_, value__8) {
                    subguard.coerce([value__8, ej])
            validFlag__6 := false

This example is from Typhon's prelude. The expansion of loops, and other syntax, is defined to turn the full Monte language into a smaller language, Kernel-Monte, which resembles Kernel-E. Now, in this example, the nonce loop object generated by the expander is very custom, and probably couldn't be simplified any further. However, it is theoretically possible that an optimizer could detect loops that call a single method repeatedly and simplify them more aggressively. In practice, that optimization doesn't exist, so Typhon thinks that all loops are user-defined and allows the JIT to trace all of them.

The next hurdle has to do with names. Monte's namespaces are static, which means that it's possible to always know in advance which names are in a scope. This is a powerful property for compilers, since it opens up many kinds of compilation and lowering which aren't available to languages with dynamic namespaces like Python. However, Typhon doesn't have access to the scoping information from the expander, only the AST. This means that Typhon has to redo most of the name and scope analysis in order to figure out things like how big namespaces should be and where to store everything. I initially did all of this at runtime, in the JIT, but it is very slow. This is because the JIT has problems seeing into dictionaries, and it cannot trust that a dictionary is actually constant-size or constant-keyed.

RPython does have an answer to this, called virtual and virtualizable objects. A virtual object is one that is never constructed within a JIT trace, because the JIT knows that the object won't leave the trace and can be totally elided. (The literature talks at length of "escape analysis", the formal term for this.) Typhon's AST nodes occasionally generated virtual objects, but only rarely, because most objects are assigned to names in the environment, and the JIT refuses to ignore modifications to the environment.

Virtualizable objects solve this problem neatly. A virtualizable object has one or more attributes, called virtualizable attributes, which can be "out-of-sync" during a JIT trace. A virtualizable can delay being updated, as long as it's updated at some point before the end of the JIT trace. RPython allows fields of integers and floating-point values to be virtualizable, as well as constant-size lists. However, Typhon uses mappings of names for its environments, backed by dictionaries, and dictionaries can't be virtualizable.

The traditional solution to this problem involves assigning an index to every name within a scope, and using a constant-size list with the indices. I did this, but it was arduous and didn't have the big payoff that I had wanted. Why not? Well, every AST node introduces a new scope! This caused scope creation to be expensive. I realized that I needed to compute closures as well.

Around this time, a few months ago, I began to despair because debugging the AST VM is hard. RPython's JIT logging and tooling is all based around the assumption that there is a low-level virtual machine of some sort which has instructions and encapsulation, and the AST was just too hard to manage in this way. I had had to invent my own tracebacks, my own logging, and my own log viewer. This wasn't going well. I wanted to join the stack-based VM crew and not have piles of ASTs to slog through whenever something wasn't going right. So, I decided to try to implement SmallCaps, from E. E, of course, is the inspiration for Monte, and shares many features with Monte. SmallCaps was based on old Smalltalk systems, but was designed to work with unique E features like ejectors.

So, enough talk, time for some code. First, let's lay down some ground rules. These are the guiding semantics of SmallCaps in Typhon. Keep in mind that we are describing a single-stack automaton with a side stack for exception handling and an environment with frames for local values and closed-over "global" values.

With these rules, the compiler's methods become very obvious.

class Str(Node):
    A literal string.

    def compile(self, compiler):
        index = compiler.literal(StrObject(self._s))

Strings are compiled into a single LITERAL instruction that places a string on the stack. Simple enough.

class Sequence(Node):
    A sequence of nodes.

    def compile(self, compiler):
        for node in self._l[:-1]:
            compiler.addInstruction("POP", 0)

Here we compile sequences of nodes by compiling each node in the sequence, and then using POP to remove each intermediate node's result, since they aren't used. This nicely mirrors the semantics of sequences, which are to evaluate every node in the sequence and then return the value of the ultimate node's evaluation.

This also shows off a variant on the Visitor pattern which Allen, Mike, and I are calling the "Tourist pattern", where an accumulator is passed from node to node in the structure and recursion is directed by each node. This makes managing the Expression Problem much easier, since nodes completely contain all of the logic for each accumulation, and makes certain transformations much easier. More on that in a future post.

class FinalPattern(Pattern):

    def compile(self, compiler):
        # [specimen ej]
        if self._g is None:
            compiler.addInstruction("POP", 0)
            # [specimen]
            # [specimen ej guard]
            compiler.addInstruction("ROT", 0)
            compiler.addInstruction("ROT", 0)
            # [guard specimen ej]
            compiler.call(u"coerce", 2)
            # [specimen]
        index = compiler.addFrame(u"_makeFinalSlot")
        compiler.addInstruction("NOUN_FRAME", index)
        compiler.addInstruction("SWAP", 0)
        # [_makeFinalSlot specimen]
        compiler.call(u"run", 1)
        index = compiler.addLocal(self._n)
        compiler.addInstruction("BINDSLOT", index)
        # []

This pattern is compiled to insert a specimen into the environment, compiling the optional guard along the way and ensuring order of operations. The interspersed comments represent the top of stack in-between operations, because it helps me keep track of how things are compiled.

With this representation, the Compiler is able to see the names and indices of every binding introduced during compilation, which means that creating index-based frames as constant-size lists is easy. (I was going to say "trivial," but it was not trivial!)

I was asked on IRC about why I chose to adapt SmallCaps instead of other possible VMs. The answer is mostly that SmallCaps was designed and implemented by people that were much more experienced than me, and that I trust their judgement. I tried several years ago to design a much purer concatenative semantics for Kernel-E, and failed. SmallCaps works, even if it's not the simplest thing to implement. I did briefly consider even smaller semantics, like those of the Self language, but I couldn't find anything expressive enough to capture all of Kernel-E's systems. Ejectors are tricky.

That's all for now. Peace.

14 Feb 2015 1:51pm GMT

Alberto Ruiz: Despicable SPI

So apparently, this happened, and then this. Long story short, the elementary OS guys had been offered to use SPI as the legal entity to represent the project, something they didn't need at all, and since they didn't, Joshua Drake, apparently a director at SPI, decided to threat them with bad press all over if they didn't agree to join SPI. Which he then did, he started several threads on reddit and wrote a blog post trying to undermine the project, the post is now deleted, and this aberration of an apology (which is total BS and shows how much of an ass he is).

I seriously don't get why this guy has not been fired from the SPI organization immediately, this sort of bullying behaviour should not be allowed and, at least in my book, an apology means nothing. Someone like that does not belong to an organization that is supposed to help free software thrive and protect its communities.

I don't get how SPI expects the community to trust them at all after this.

I am really angry at this and I would like to express the elementary OS guys all my support.

14 Feb 2015 11:58am GMT

11 Feb 2015


Daniel Vetter: Neat drm/i915 Stuff for 3.20

Linux 3.19 was just released and my usual overview of what the next merge window will bring is more than overdue. The big thing overall is certainly all the work around atomic display updates, but read on for what else all has been done.

Let's first start with all the driver internal rework to support atomic. The big thing with atomic is that it requires a clean split between code that checks display updates and the code that commits a new display state to the hardware. The corallary from that is that any derived state that's computed in the validation code and needed int the commit code must be stored somewhere in the state object. Gustavo Padovan and Matt Roper have done all that work to support atomic plane updates. This is the code that's now in 3.20 as a tech preview. The big things missing for proper atomic plane updates is async commit support (which has already landed for 3.21) and support to adjust watermark settings on the fly. Patches for from Ville have been around for a long time, but need to be rebased, reviewed and extended for recently added platforms.

On the modeset side Ander Conselvan de Oliveira has done a lot of the necessary work already. Unfortunately converting the modeset code is much more involved for mainly two reaons: First there's a lot more derived state that needs to be handled, and the existing code already has structures and code for this. Conceptually the code has been prepared for an atomic world since the big display rewrite and the introduction of CRTC configuration structures. But converting the i915 modeset code to the new DRM atomic structures and interface is still a lot of work. Most of these refactorings have landed in 3.20. The other hold-up is shared resources and the software state to handle that. This is mostly for handling display PLLs, but also other shared resources like the display FIFO buffers. Patches to handle this are still in-flight.

Continuing with modeset work Jani Nikula has reworked the DSI support to use the shared DSI helpers from the DRM core. Jani also reworked the DSI to in preparation for dual-link DSI support, which Gaurav Singh implemented. Rodrigo Vivi and others provided a lot of patches to improve PSR support and enable it for Baytrail/Braswell. Unfortunately there's still issues with the automated testcase and so PSR unfortunately stays disabled by default for now. Rodrigo also wrote some nice DocBook documentation for fbc, another step towards fully documenting i915 driver internals.

Moving on to platform enabling there has been a lot of work from Ville on Cherryview: RPS/gpu turbo and pipe CRC support (used for automated display testing) are both improved. On Skylake almost all the basic enabling is merged now: PM patches, enabling mmio pageflips and fastboot support from Damien have landed. Tvrtko Ursulin also create the infrastructure for global GTT views. This will be used for some upcoming features on Skylake. And to simplify enabling future platforms Chris Wilson and Mika Kuoppala have completely rewritten the forcewake domains handling code.

Also really important for Skylake is that the execlist support for gen8+ command submission is finally in a good enough to be used by default - on Skylake that's the only support path, legacy ring submission has been deprecated. Around that feature and a few other closely related ones a lot of code was refactoring: John Harrison implemented the conversion from raw sequence numbers to request objects for gpu progress tracking. This as is also prep work for landing a gpu scheduler. Nick Hoath removed the special execlist request tracking structures, simplifying the code. The engine initialization code was also refactored for a cleaner split between software and hardware initialization, leading to robuster code for driver load and system resume. Dave Gordon has also reworked the code tracking and computing the ring free space. On top of that we've also enabled full ppgtt again, but for now only where execlists are available since there are still issues with the legacy ring-based pagetable loading.

For generic GEM work there's the really nice support for write-combine cpu memory mappings from Akash Goel and Chris Wilson. On Atom SoC platforms lacking the giant LLC bigger chips have this gives a really fast way to upload textures. And even with LLC it's useful for uploading to scanout targets since those are always uncached. But like the special-purpose uploads paths for LLC platforms the cpu mmap views do not detile textures, hence special-purpose fastpaths need to be written in mesa and other userspace to fully exploit this. In other GEM features the shadow batch copying code for the command parser has now also landed.

Finally there's the redesign from Imre Deak to use the component framework for the snd-hda/i915 interactions. Modern platforms need a lot of coordination between the graphics and sound driver side for audio over hdmi, and thus far this was done with some ad-hoc dynamic symbol lookups. Which results in a lot of headaches to get the ordering correctly for driver load or system suspend and resume. With the component framework this depency is now explicit, which means we will be able to get rid of a few hacks. It's also much easier to extend for the future - new platforms tend to integrate different components even more.

11 Feb 2015 11:20pm GMT

Pekka Paalanen: Weston repaint scheduling

Now that Presentation feedback has finally landed in Weston (feedback, flags), people are starting to pay attention to the output timings as now you can better measure them. I have seen a couple of complaints already that Weston has an extra frame of latency, and this is true. I also have a patch series to fix it that I am going to propose.

To explain how the patch series affects Weston's repaint loop, I made some JSON-timeline recordings before and after, and produced some graphs with Wesgr. Here I will explain how the repaint loop works timing-wise.

The old algorithm

The old repaint scheduling algorithm in Weston repaints immediately on receiving the pageflip completion event. This maximizes the time available for the compositor itself to repaint, but it also means that clients can never hit the very next vblank / pageflip.

Figure 1. The old algorithm, the client paints as response to frame callbacks.

Frame callback events are sent at the "post repaint" step. This gives clients almost a full frame's time to draw and send their content before the compositor goes to "begin repaint" again. In Figure 1. you see, that if a client paints extremely fast, the latency to screen is almost two frame periods. The frame latency can never be less than one frame period, because the compositor samples the surface contents (the "repaint flush" point) immediately after the previous vblank.

Figure 2. The old algorithm, the client paints as response to Presentation feedback events.

While frame callback driven clients still get to the full frame rate, the situation is worse if the client painting is driven by presentation_feedback.presented events. The intent is to draw and show a new frame as soon as the old frame was shown. Because Weston starts repaint immediately on the pageflip completion, which is essentially the same time when Presentation feedback is sent, the client cannot hit the repaint of this frame and gets postponed to the next. This is the same two frame latency as with frame callbacks, but here the framerate is halved because the client waits for the frame to be actually shown before continuing, as is evident in Figure 2.

Figure 3. The old algorithm, client posts a frame while the compositor is idle.

Figure 3. shows a less relevant case, where the compositor is idle while a client posts a new frame ("damage commit"). When the compositor is idle graphics-wise (the gray background in the figure), it is not repainting continuously according to the output scanout cycle. To start painting again, Weston waits for an extra vblank first, then repaints, and then the new frame is shown on the next vblank. This is also a 1-2 frame period latency, but it is unrelated to the other two cases, and is not changed by the patches.

The modification to the algorithm

The modification is simple, yet perhaps counter-intuitive at first. We reduce the latency by adding a delay. The "delay before repaint" is in all the figures, and the old algorithm is essentially using a zero delay. The compositor's repaint is delayed so that clients have a chance to post a new frame before the compositor samples the surface contents.

A good amount of delay is a hard question. Too small delay and clients do not have time to act. Too long delay and the compositor itself will be in danger of missing the vblank deadline. I do not know what a good amount is or how to derive it, so I just made it configurable. You can set the repaint window length in milliseconds in weston.ini. The repaint window is the time from starting repaint to the deadline, so the delay is the frame period minus the repaint window. If the repaint window is too long for a frame period, the algorithm will reduce to the old behaviour.

The new algorithm

The following figures are made with a 60 Hz refresh and a 7 millisecond repaint window.

Figure 4. The new algorithm, the client paints as response to frame callback.

When a client paints as response to the frame callback (Figure 4), it still has a whole frame period of time to paint and post the frame. The total latency to screen is a little shorter now, by the length of the delay before compositor's repaint. It is a slight improvement.

Figure 5. The new algorithm, the client paints as response to Presentation feedback.

A significant improvement can be seen in Figure 5. A client that uses the Presentation extension to wait for a frame to be actually shown before painting again is now able to reach the full output frame rate. It just needs to paint and post a new frame during the delay before compositor's repaint. This mode of operation provides the shortest possible latency to screen as the client is able to target the very next vblank. The latency is below one frame period if the deadlines are met.


This is a relatively simple change that should reduce display latency, but analyzing how exactly it affects things is not trivial. That is why Wesgr was born.

This change does not really allow clients to wait some additional time before painting to reduce the latency even more, because nothing tells clients when the compositor will repaint exactly. The risk of missing an unknown deadline grows the later a client paints. Would knowing the deadline have practical applications? I'm not sure.

These figures also show the difference between the frame callback and Presentation feedback. When a client's repaint loop is driven by frame callbacks, it maximizes the time available for repainting, which reduces the possibility to miss the deadline. If a client drives its repaint loop by Presentation feedback events, it minimizes the display latency at the cost of increased risk of missing the deadline.

All the above ignores a few things. First, we assume that the time of display is the point of vblank which starts to scan out the new frame. Scanning out a frame actually takes most of the frame period, it's not instantaneous. Going deeper, updating the framebuffer during scanout period instead of vblank could allow reducing latency even more, but the matter becomes complicated and even somewhat subjective. I hear some people prefer tearing to reduce the latency further. Second, we also ignore any fencing issues that might come up in practise. If a client submits a GPU job that takes a long while, there is a good chance it will cause everything to miss a deadline or more.

As usual, this work and most of the development of JSON-timeline and Wesgr were sponsored by Collabora.

PS. Latency and timing issues are nothing new. Owen Taylor has several excellent posts on related subjects in his blog.

11 Feb 2015 3:03pm GMT

09 Feb 2015


Aleksander Morgado: Dell-branded Sierra Wireless 3G/4G modem not online?


Your Dell modem not getting online?

It's not uncommon to find weird mobile broadband modems that for one reason or another don't end up working as expected with NetworkManager/ModemManager; but the new 3G/4G modems in Dell laptops are at a total different level. These Dell-branded devices are really Sierra Wireless powered modems, e.g. the Dell 5808 is a Sierra Wireless MC7355, or the Dell DW5570 is a Sierra Wireless MC8805.

Late last year we started to receive several bugreports in the ModemManager and libqmi mailing lists for these kind of devices. Basically, the modem would never get to a proper online mode with the RF transceivers powered and therefore would never even get registered in the mobile network. This was happening to both QMI and MBIM based configurations, and the direct error message reported by libqmi when trying to get into online mode was just… not very very helpful.

$ sudo qmicli -d /dev/cdc-wdm1 --dms-get-operating-mode
[/dev/cdc-wdm1] Operating mode retrieved:
Mode: 'low-power'
HW restricted: 'no'

$ sudo qmicli -d /dev/cdc-wdm1 --dms-set-operating-mode=online
error: couldn't set operating mode: QMI protocol error (3): 'Internal'

The issue was reported to the kernel, assuming that this would likely be a new missing rfkill related setup in newer Dell laptops. One of the users reported in that same bugreport that actually using Sierra's GobiNet driver instead of qmi_wwan would end up putting the modem in online mode, so just switching drivers during boot would make it work. WTF?

Digging in Sierra's GobiNet QMI driver

Well, without much hope of finding anything, and given that I had just bought such a Dell modem myself for testing a new "Dell" plugin, I decided to dig into Sierra's kernel driver sources. Apart from some already known things (e.g. they use the WDA service to set the net data format in new modems instead of the old CTL service), these lines popped:

if (is9x15)
// Set FCC Authentication
result = QMIDMSSWISetFCCAuth( pDev );
if (result != 0)
return result;

The Sierra GobiNet driver is sending some magic "FCC auth" command during boot to the modem; which according to the driver sources maps to command 0x555F in the DMS service. Hey I should try that!

Adding the new command support in libqmi wasn't difficult, so in some minutes I was ready to test it… and worked.

$ sudo qmicli -d /dev/cdc-wdm1 --dms-get-operating-mode
[/dev/cdc-wdm1] Operating mode retrieved:
Mode: 'low-power'
HW restricted: 'no'

$ sudo qmicli -d /dev/cdc-wdm1 --dms-set-fcc-authentication
[/dev/cdc-wdm1] Successfully set FCC authentication

$ sudo qmicli -d /dev/cdc-wdm1 --dms-get-operating-mode
[/dev/cdc-wdm1] Operating mode retrieved:
Mode: 'online'
HW restricted: 'no'

Support for this is already available automatically when using libqmi and ModemManager git master. It will hit the next stable releases likely as well.


Well, I don't know if there is any command in MBIM to do the same operation (likely there is in a Sierra-specific service), but one thing we could anyway try to do is to use "QMI embedded in MBIM", which Bjørn has already tested some times. I'll try to test that some day, but I'll need to get another modem as my DW5570 only comes up with a QMI configuration. For now, if you're stuck with this problem using MBIM, you can likely just select USB configuration #1 using usb_modeswitch and get the modem switched to QMI mode.


Dell-branded Sierra Wireless modems need the "FCC Auth" command (QMI DMS service, 0x555F) before they can be brought online; supported in libqmi and ModemManager already.

These fixes have been already released in ModemManager 1.4.4 and libqmi 1.12.4.

Filed under: Development, FreeDesktop Planet, GNOME Planet, Planets Tagged: Dell, GobiNet, libqmi, MBIM, ModemManager, QMI, sierra-wireless

09 Feb 2015 1:06pm GMT

06 Feb 2015


Peter Hutterer: libinput device groups

I just pushed a patchset into libinput to introduce the concept of device groups. This post will explain what they are in this context and why they are needed.

libinput exposes kernel devices as an opaque struct libinput_device. It only recognises evdev devices at this point, this may change in the future if we see a need for it. libinput also exposes a few bits of information about the device such as the name, PID/VID and a handle to the struct udev_device that matches this device. The latter enables callers to get more information from the device. libinput also provides a bunch of configuration settings for each device. Pointer devices get acceleration settings, absolute devices have calibration, etc. For most devices this works just fine.

Some devices like Wacom tablets are represented as multiple event nodes. On a 3.19 kernel you'd get three event nodes for an Intuos 5 touch - the pad (i.e. the tablet itself), a touch node and one node for all the tools (stylus, eraser, etc. multiplexed). libinput exposes each of these nodes as separate device, but that is problematic when applying certain configuration settings. For example, applying a left-handed configuration to the tablet means it's rotated by 180 degrees so we need to rotate the coordinates accordingly. Of course, such a rotation would have to apply to both the touch and the stylus devices but now the caller is left with having to figure out which other devices to set.

The original idea was to present such devices as a single, merged struct libinput_device with multiple capabilities, i.e. a single physical device that can do touch, tablet and pad buttons. A configuration setting like left-handed-ness would then apply to all devices transparently. The API is clean, usage is simple, everybody is happy. Except when they aren't - this doesn't actually work particularly well. First, having such merged devices means we require devices to change at runtime, adding/removing capabilities on-the-fly which puts a burden on the callers to handle this correctly. Second, not all configuration options apply to all subdevices. If the Intuos is used as a touchpad you may want natural scrolling enabled on the touchpad but the wheel on the Wacom mouse should probably still work normally. Third, the subdevices may have different PID/VIDs and certainly have different udev devices. So now libinput needs a way to get to those. In short, a merged device looks nice in theory but the implementation of it would make the libinput API cumbersome to use for little benefit.

The solution to this are device groups: each device in libinput is now part of a struct libinput_device_group. This is just an opaque object that doesn't do anything but sit there but it's enough to identify how devices are grouped together. If two devices return the same device group, they logically belong together. The caller can then decide what to do with it, e.g. loop through all devices of a group to apply a certain configuration setting to all devices. The basic approach is thus:

new_device = libinput_event_get_device(event);
new_group = libinput_device_get_device_group(new_device);

for each (device, group) in previously_stored_devices {
if (group == new_group)
printf("This device shares a group with %s", device);

The device groups' lifetime is as you'd expect: it is created for the first device in the group and ceases once the last device in a group is removed. It's not deleted until the last reference was deleted but it won't get recycled. In other words, if you keep unplugging and re-plugging that Intuos tablet, the device group will be new after every plug.

Note that we're intentionally not providing ways to get the devices from a device group, or counting the devices within a group, etc. This avoids race conditions (the view libinput has of the devices isn't the same as the caller has while going through the event queue) but it also makes the API simpler. libinput's callers are mainly compositors which use toolkits with advanced datastructures (glib, Qt, etc.). Using a pointer as key into a hashmap is simpler and less buggy than using whatever hand-crafted hashmap/list implementation we can provide through the libinput API.

06 Feb 2015 1:43am GMT

29 Jan 2015


Peter Hutterer: Lenovos X1 Carbon 3rd touchpad woes

Lenovo released a new set of laptops for 2015 with a new (old) feature: the trackpoint device has the physical buttons back. Last year's experiment apparently didn't work out so well.

What do we do in Linux with the last generation's touchpads? The kernel marks them with INPUT_PROP_TOPBUTTONPAD based on the PNPID [1]. In the X.Org synaptics driver and libinput we take that property and emulate top software buttons based on it. That took us a while to get sorted and feed into the myriad of Linux distributions out there but at some point last year the delicate balance of nature was restored and the touchpad-related rage dropped to the usual background noise.

Slow-forward to 2015 and Lenovo introduces the new series. In the absence of unnecessary creativity they are called the X1 Carbon 3rd, T450, T550, X250, W550, L450, etc. Lenovo did away with the un(der)-appreciated top software buttons and re-introduced physical buttons for the trackpoint. Now, that's bound to make everyone happy again. However, as we learned from Agent Smith, happiness is not the default state of humans so Lenovo made sure the harvest is safe.

What we expected to happen was that the trackpoint device has BTN_LEFT, BTN_MIDDLE, BTN_RIGHT and the touchpad has BTN_LEFT and is marked with INPUT_PROP_BUTTONPAD (i.e. it is a Clickpad). That is the case on the x220 generation and the T440 generation. Though the latter doesn't actually have trackpoint buttons and we emulated them in software.

On the X1 Carbon 3rd, the trackpoint has BTN_LEFT, BTN_MIDDLE, BTN_RIGHT but they never send events. The touchpad has BTN_LEFT and BTN_0, BTN_1 and BTN_2 [2]. Clicking the left button on the trackpoint generates BTN_0 on the touchpad device, clicking the right button generates BTN_1 on the touchpad device. So in short, Lenovo has decided to wire the newly re-introduced trackpoint buttons to the touchpad, not the trackpoint. [3] The middle button is currently dead, which is a kernel bug. Meanwhile we think of it as security feature - never accidentally paste your password into your IRC session again!

What does this mean for us? Neither synaptics nor evdev nor libinput currently support this so we've been busy aipodae and writing patches like crazy. The patch goes into the kernel and udev.... The two patches needed go into the kernel and udev, and libinput. No, the three patches needed go into the kernel, udev and libinput, and synaptics. The four patches, no, wait. Amongst the projects needing patches are the kernel, udev, libinput and synaptics. I'll try again:

With those put together, things pretty much work as they're supposed to. libinput handles middle button scrolling as well this way but synaptics won't, much for the same reason it didn't work in the previous generation: synaptics can't talk to evdev and vice versa. And given that synaptics is on life support or in pallative care, depending how you look at it, I recommend not holding your breath for a fix. Otherwise you may join it quickly.

Note that all the patches are fresh off the presses and there may be a few bits changing before they are done. If you absolutely can't live without the trackpoint's buttons you can work around it by disabling the synaptics kernel driver until the patches have trickled down to your distribution.

The tracking bug for all this is Bug 88609. Feel free to CC yourself on it. Bring popcorn.

Final note: I haven't seen logs from the T450, T550, ... devices yet yet so this is so far only confirmed on the X1 Carbon so far. Given the hardware is essentially identical I expect it to be true for the rest of the series though.

[1] We also apply quirks for the 2013 generation because the firmware was buggy - a problem Synaptics Inc. has since fixed (but currently gives us slight headaches).
[2] It is also marked with INPUT_PROP_TOPBUTTONPAD which is a bug. It uses a new PNPID but one that was in the range we previously believed was for pads without trackpoint buttons. That's an an easy thing to fix.
[3] The reason for that seems to be HW design: this way they can keep the same case/keyboard and just swap the touchpad bits.
[4] synaptics is old enough to support dedicated scroll buttons. Buttons that used to send BTN_0 and BTN_1 and are thus interpreted as scroll up/down event.

29 Jan 2015 5:03am GMT

28 Jan 2015


Daniel Vetter: Update for Atomic Display Updates

Another kernel release is imminent and a lot of things happened since my last big blog post about atomic modeset. Read on for what new shiny things 3.20 will bring this area.

Atomic IOCTL and Properties

The big thing for sure is that the actual atomic IOCTL from Ville has finally landed for 3.20. That together with all the work from Rob Clark to add all the new atomic properties to make it functional (there's no IOCTL structs for any standard values, everything is a property now) means userspace can finally start using atomic. Well, it's still hidden behind the experimental module option drm.atomic=1 but otherwise it's all there. There's a few big differences compared to earlier iterations:

Another missing piece that's now resolved is the DPMS support. On a high level this greatly reduces the complexity of the legacy DPMS settings into a simple per-CRTC boolean. Contemporary userspace really wants nothing more and anything but a simple on/off is the only thing that current hardware can support. Furthermore all the bookkeeping is done in the helpers, which call down into drivers to enable or disable entire display pipelines as needed. Which means that drivers can rip out tons of code for the legacy DPMS support by just wiring up drm_atomic_helper_connector_dpms.

New Driver Conversions and Backend Hooks

Another big thing for 3.20 is that driver support has moved forward greatly: Tegra has now most of the basic bits ready, MSM is already converted. Both still lack conversion and testing for DPMS since that landed very late though. There's a lot of prep work for exynos, but unfortunately nothing yet towards atomic support. And i915 is in the process of being converted to the new structures and will have very preliminary atomic plane updates support in 3.20, hidden behind a module option for now.

And all that work resulted in piles of little fixes. Like making sure that the legacy cursor IOCTLs don't suddenly stall a lot more, breaking old userspace. But we also added a bunch more hooks for driver backends to simplifiy driver code:

Finally driver conversions showed that vblank handling requirements imposed by the atomic helpers are a lot stricter than what userspace tends to cope with. i915 has recently reworked all its vblank handling and improved the core helpers with the addition of drm_crtc_vblank_off() and drm_crtc_vblank_on(). If these are used in the CRTC disable and enable hooks the vblank code will automatically reject vblank waits when the pipe is off. Which is the behaviour both the atomic helpers and the transitional helpers expect.

One complication is that on load the driver must ensure manually that the vblank state matches up with the hardware and atomic software state with a manual call to these functions. In the simple case where drivers reset everything to off (which is what the reset implementations provided by the atomic helpers presume) this just means calling drm_crtc_vblank_off() somewhen in the driver load code. For drivers that read out the actual hardware state they need to call either _off() or _on() matching on the actual display pipe status.

Future Work

Of course there's still a few things left to do before atomic display updates can be rolled out to the masses. And a few things that would be rather nice to have, too:

It's promising to keep interesting!

28 Jan 2015 5:18pm GMT

27 Jan 2015


Matthias Klumpp: AppStream 0.8 released!

Yesterday I released version 0.8 of AppStream, the cross-distribution standard for software metadata, that is currently used by GNOME-Software, Muon and Apper in to display rich metadata about applications and other software components.

What's new?

The new release contains some tweaks on AppStreams documentation, and extends the specification with a few more tags and refinements. For example we now recommend sizes for screenshots. The recommended sizes are the ones GNOME-Software already uses today, and it is a good idea to ship those to make software-centers look great, as others SCs are planning to use them as well. Normal sizes as well as sizes for HiDPI displays are defined. This change affects only the distribution-generated data, the upstream metadata is unaffected by this (the distro-specific metadata generator will resize the screenshots anyway).

Another addition to the spec is the introduction of an optional <source_pkgname/> tag, which holds the source package name the packages defined in <pkgname/> tags are built from. This is mainly for internal use by the distributor, e.g. it can decide to use this information to link to internal resources (like bugtrackers, package-watch etc.). It may also be used by software-center applications as additional information to group software components.

Furthermore, we introduced a <bundle/> tag for future use with 3rd-party application installation solutions. The tag notifies a software-installer about the presence of a 3rd-party application bundle, and provides the necessary information on how to install it. In order to do that, the software-center needs to support the respective installation solution. Currently, the Limba project and Xdg-App bundles are supported. For software managers, it is a good idea to implement support for 3rd-party app installers, as soon as the solutions are ready. Currently, the projects are worked on heavily. The new tag is currently already used by Limba, which is the reason why it depends on the latest AppStream release.

How do I get it?

All AppStream libraries, libappstream, libappstream-qt and libappstream-glib, are supporting the 0.8 specification in their latest version - so in case you are using one of these, you don't need to do anything. For Debian, the DEP-11 spec is being updated at time, and the changes will land in the DEP-11 tools soon.

Improve your metadata!

This call goes especilly to many KDE projects! Getting good data is partly a task for the distributor, since packaging issues can result in incorrect or broken data, screenshots need to be properly resized etc. However, flawed upstream data can also prevent software from being shown, since software with broken data or missing data will not be incorporated in the distro XML AppStream data file.

Richard Hughes of Fedora has created a nice overview of software failing to be included. You can see the failed-list here - the data can be filtered by desktop environment etc. For KDE projects, a Comment= field is often missing in their .desktop files (or a <summary/> tag needs to be added to their AppStream upstream XML file). Keep in mind that you are not only helping Fedora by fixing these issues, but also all other distributions cosuming the metadata you ship upstream.

For Debian, we will have a similar overview soon, since it is also a very helpful tool to find packaging issues.

If you want to get more information on how to improve your upstream metadata, and how new metadata should look like, take a look at the quickstart guide in the AppStream documentation.

27 Jan 2015 4:48pm GMT