02 Sep 2014
I've had a couple of questions about whether there's a way for others to contribute to the VC4 driver project. There is! I haven't posted about it before because things aren't as ready as I'd like for others to do development (it has a tendency to lock up, and the X implementation isn't really ready yet so you don't get to see your results), but that shouldn't actually stop anyone.
To get your environment set up, build the kernel (https://github.com/anholt/linux.git vc4 branch), Mesa (git://anongit.freedesktop.org/mesa/mesa) with --with-gallium-drivers=vc4, and piglit (git://anongit.freedesktop.org/git/piglit). For working on the Pi, I highly recommend having a serial cable and doing NFS root so that you don't have to write things to slow, unreliable SD cards.
You can run an existing piglit test that should work, to check your environment: env PIGLIT_PLATFORM=gbm VC4_DEBUG=qir ./bin/shader_runner tests/shaders/glsl-algebraic-add-add-1.shader_test -auto -fbo -- you should see a dump of the IR for this shader, and a pass report. The kernel will make some noise about how it's rendered a frame.
Now the actual work: I've left some of the TGSI opcodes unfinished (SCS, DST, DPH, and XPD, for example), so the driver just aborts when a shader tries to use them. How they work is described in src/gallium/docs/source/tgsi.rst. The TGSI-to_QIR code is in vc4_program.c (where you'll find all the opcodes that are implemented currently), and vc4_qir.h has all the opcodes that are available to you and helpers for generating them. Once it's in QIR (which I think should have all the opcodes you need for this work), vc4_qpu_emit.c will turn the QIR into actual QPU code like you find described in the chip specs.
You can dump the shaders being generated by the driver using VC4_DEBUG=tgsi,qir,qpu in the environment (that gets you 3/4 stages of code dumped -- at times you might want some subset of that just to quiet things down).
Since we've still got a lot of GPU hangs, and I don't have reset wokring, you can't even complete a piglit run to find all the problems or to test your changes to see if your changes are good. What I can offer currently is that you could run PIGLIT_PLATFORM=gbm VC4_DEBUG=norast ./piglit-run.py tests/quick.py results/vc4-norast; piglit-summary-html.py --overwrite summary/mysum results/vc4-norast will get you a list of all the tests (which mostly failed, since we didn't render anything), some of which will have assertion failed. Now that you have which tests were assertion failing from the opcode you worked on, you can run them manually, like PIGLIT_PLATFORM=gbm /home/anholt/src/piglit/bin/shader_runner /home/anholt/src/piglit/generated_tests/spec/glsl-1.10/execution/built-in-functions/vs-asin-vec4.shader_test -auto (copy-and-pasted from the results) or PIGLIT_PLATFORM=gbm PIGLIT_TEST="XPD test 2 (same src and dst arg)" ./bin/glean -o -v -v -v -t +vertProg1 --quick (also copy and pasted from the results, but note that you need the other env var for glean to pick out the subtest to run).
Other things you might want eventually: I do my development using cross-builds instead of on the Pi, install to a prefix in my homedir, then rsync that into my NFS root and use LD_LIBRARY_PATH/LIBGL_DRIVERS_PATH on the Pi to point my tests at the driver in the homedir prefix. Cross-builds were a *huge* pain to set up (debian's multiarch doesn't ship the .so symlink with the libary, and the -dev packages that do install them don't install simultaneously for multiple arches), but it's worth it in the end. If you look into cross-build, what I'm using is rpi-tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64/bin/arm-linux-gnueabihf-gcc and you'll want --enable-malloc0returnsnull if you cross-build a bunch of X-related packages.
02 Sep 2014 12:19am GMT
01 Sep 2014
So two years ago my family and I moved to Brno in the Czech Republic due to me starting a new job at Red Hat. It has been two roller coaster years with a lot of changes happening both inside Red Hat and with the world that the Linux desktop operates in. During those years my wife and I have gotten to love Brno, which both of us find a bit surprising as we where both quite skeptical to the city in the outset.
I think having grown up in west europe during the cold war I had some preconceptions about what life was like in the former east europe and Brno specifically is struggling a bit with being the second city in Czech after Prague, due to Prague so often being hailed internationally as a beautiful and exciting city.
But I think during these two years Brno has proven itself to us as a place that is great to live, especially if you have a little child. Brno has a lot of beautiful outdoors areas which are great for hiking or relaxing, it is packed full of these childrens cafes where you can take your kid to play while you sit down and have a coffee or a tea, a vibrant expat community, affordable housing, a good range of restaurants, short distance to major cities like Vienna, Prague and Budapest. And lot of old castles and towns around to explore in the vicinity. I think Telc has to be one of our topmost favorites in that regard. And it has very little crime, my wife has been telling her friends how Brno is the first city she has ever lived in where she feels that as a woman she can walk along through the city in the evening or at night and feel safe.
But that said the time has come for us to move on. Due to one of these changes inside Red Hat I mentioned I am getting moved to our US Engineering office in Westford, Massachusetts. For those not familiar with Westford it is close to a city you probably do know, Boston.
So tomorrow the moving company will arrive at our flat here in Brno and pack up everything for the transport to the US. The furniture will take some time to arrive there, so while our stuff is sailing across the ocean we will live with my family in Norway, while I take advantage of the Red Hat office in downtown Oslo. So by mid-October I expect us to be fully set up in the Boston area, although we are heading over there next week for a final house hunting trip so that the furniture has a place to arrive to
So goodbye to Brno for now, and looking forward to seeing new and old friends in Boston!
01 Sep 2014 12:53pm GMT
31 Aug 2014
In a previous blog story I discussed Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems, I now want to take the opportunity to explain a bit where we want to take this with systemd in the longer run, and what we want to build out of it. This is going to be a longer story, so better grab a cold bottle of Club Mate before you start reading.
Traditional Linux distributions are built around packaging systems like RPM or dpkg, and an organization model where upstream developers and downstream packagers are relatively clearly separated: an upstream developer writes code, and puts it somewhere online, in a tarball. A packager than grabs it and turns it into RPMs/DEBs. The user then grabs these RPMs/DEBs and installs them locally on the system. For a variety of uses this is a fantastic scheme: users have a large selection of readily packaged software available, in mostly uniform packaging, from a single source they can trust. In this scheme the distribution vets all software it packages, and as long as the user trusts the distribution all should be good. The distribution takes the responsibility of ensuring the software is not malicious, of timely fixing security problems and helping the user if something is wrong.
However, this scheme also has a number of problems, and doesn't fit many use-cases of our software particularly well. Let's have a look at the problems of this scheme for many upstreams:
Upstream software vendors are fully dependent on downstream distributions to package their stuff. It's the downstream distribution that decides on schedules, packaging details, and how to handle support. Often upstream vendors want much faster release cycles then the downstream distributions follow.
Realistic testing is extremely unreliable and next to impossible. Since the end-user can run a variety of different package versions together, and expects the software he runs to just work on any combination, the test matrix explodes. If upstream tests its version on distribution X release Y, then there's no guarantee that that's the precise combination of packages that the end user will eventually run. In fact, it is very unlikely that the end user will, since most distributions probably updated a number of libraries the package relies on by the time the package ends up being made available to the user. The fact that each package can be individually updated by the user, and each user can combine library versions, plug-ins and executables relatively freely, results in a high risk of something going wrong.
Since there are so many different distributions in so many different versions around, if upstream tries to build and test software for them it needs to do so for a large number of distributions, which is a massive effort.
The distributions are actually quite different in many ways. In fact, they are different in a lot of the most basic functionality. For example, the path where to put x86-64 libraries is different on Fedora and Debian derived systems..
Developing software for a number of distributions and versions is hard: if you want to do it, you need to actually install them, each one of them, manually, and then build your software for each.
Since most downstream distributions have strict licensing and trademark requirements (and rightly so), any kind of closed source software (or otherwise non-free) does not fit into this scheme at all.
This all together makes it really hard for many upstreams to work nicely with the current way how Linux works. Often they try to improve the situation for them, for example by bundling libraries, to make their test and build matrices smaller.
The toolbox approach of classic Linux distributions is fantastic for people who want to put together their individual system, nicely adjusted to exactly what they need. However, this is not really how many of today's Linux systems are built, installed or updated. If you build any kind of embedded device, a server system, or even user systems, you frequently do your work based on complete system images, that are linearly versioned. You build these images somewhere, and then you replicate them atomically to a larger number of systems. On these systems, you don't install or remove packages, you get a defined set of files, and besides installing or updating the system there are no ways how to change the set of tools you get.
The current Linux distributions are not particularly good at providing for this major use-case of Linux. Their strict focus on individual packages as well as package managers as end-user install and update tool is incompatible with what many system vendors want.
The classic Linux distribution scheme is frequently not what end users want, either. Many users are used to app markets like Android, Windows or iOS/Mac have. Markets are a platform that doesn't package, build or maintain software like distributions do, but simply allows users to quickly find and download the software they need, with the app vendor responsible for keeping the app updated, secured, and all that on the vendor's release cycle. Users tend to be impatient. They want their software quickly, and the fine distinction between trusting a single distribution or a myriad of app developers individually is usually not important for them. The companies behind the marketplaces usually try to improve this trust problem by providing sand-boxing technologies: as a replacement for the distribution that audits, vets, builds and packages the software and thus allows users to trust it to a certain level, these vendors try to find technical solutions to ensure that the software they offer for download can't be malicious.
Existing Approaches To Fix These Problems
Now, all the issues pointed out above are not new, and there are sometimes quite successful attempts to do something about it. Ubuntu Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of this problem set, usually with a strict focus on one facet of Linux systems. For example, Ubuntu Apps focus strictly on end user (desktop) applications, and don't care about how we built/update/install the OS itself, or containers. Docker OTOH focuses on containers only, and doesn't care about end-user apps. Software Collections tries to focus on the development environments. ChromeOS focuses on the OS itself, but only for end-user devices. CoreOS also focuses on the OS, but only for server systems.
The approaches they find are usually good at specific things, and use a variety of different technologies, on different layers. However, none of these projects tried to fix this problems in a generic way, for all uses, right in the core components of the OS itself.
Linux has come to tremendous successes because its kernel is so generic: you can build supercomputers and tiny embedded devices out of it. It's time we come up with a basic, reusable scheme how to solve the problem set described above, that is equally generic.
What We Want
The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom Gundersen, David Herrmann, and yours truly) recently met in Berlin about all these things, and tried to come up with a scheme that is somewhat simple, but tries to solve the issues generically, for all use-cases, as part of the systemd project. All that in a way that is somewhat compatible with the current scheme of distributions, to allow a slow, gradual adoption. Also, and that's something one cannot stress enough: the toolbox scheme of classic Linux distributions is actually a good one, and for many cases the right one. However, we need to make sure we make distributions relevant again for all use-cases, not just those of highly individualized systems.
Anyway, so let's summarize what we are trying to do:
We want an efficient way that allows vendors to package their software (regardless if just an app, or the whole OS) directly for the end user, and know the precise combination of libraries and packages it will operate with.
We want to allow end users and administrators to install these packages on their systems, regardless which distribution they have installed on it.
We want a unified solution that ultimately can cover updates for full systems, OS containers, end user apps, programming ABIs, and more. These updates shall be double-buffered, (at least). This is an absolute necessity if we want to prepare the ground for operating systems that manage themselves, that can update safely without administrator involvement.
We want our images to be trustable (i.e. signed). In fact we want a fully trustable OS, with images that can be verified by a full trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the kernel, and initrd. Cryptographically secure verification of the code we execute is relevant on the desktop (like ChromeOS does), but also for apps, for embedded devices and even on servers (in a post-Snowden world, in particular).
What We Propose
So much about the set of problems, and what we are trying to do. So, now, let's discuss the technical bits we came up with:
The scheme we propose is built around the variety of concepts of btrfs and Linux file system name-spacing. btrfs at this point already has a large number of features that fit neatly in our concept, and the maintainers are busy working on a couple of others we want to eventually make use of.
As first part of our proposal we make heavy use of btrfs sub-volumes and introduce a clear naming scheme for them. We name snapshots like this:
usr:<vendorid>:<architecture>:<version>-- This refers to a full vendor operating system tree. It's basically a /usr tree (and no other directories), in a specific version, with everything you need to boot it up inside it. The
<vendorid>field is replaced by some vendor identifier, maybe a scheme like
<architecture>field specifies a CPU architecture the OS is designed for, for example
<version>field specifies a specific OS version, for example
23.4. An example sub-volume name could hence look like this:
root:<name>:<vendorid>:<architecture>-- This refers to an instance of an operating system. Its basically a root directory, containing primarily /etc and /var (but possibly more). Sub-volumes of this type do not contain a populated /usr tree though. The
<name>field refers to some instance name (maybe the host name of the instance). The other fields are defined as above. An example sub-volume name is
runtime:<vendorid>:<architecture>:<version>-- This refers to a vendor runtime. A runtime here is supposed to be a set of libraries and other resources that are needed to run apps (for the concept of apps see below), all in a /usr tree. In this regard this is very similar to the
usrsub-volumes explained above, however, while a
usrsub-volume is a full OS and contains everything necessary to boot, a runtime is really only a set of libraries. You cannot boot it, but you can run apps with it. An example sub-volume name is:
framework:<vendorid>:<architecture>:<version>-- This is very similar to a vendor runtime, as described above, it contains just a /usr tree, but goes one step further: it additionally contains all development headers, compilers and build tools, that allow developing against a specific runtime. For each runtime there should be a framework. When you develop against a specific framework in a specific architecture, then the resulting app will be compatible with the runtime of the same vendor ID and architecture. Example:
app:<vendorid>:<runtime>:<architecture>:<version>-- This encapsulates an application bundle. It contains a tree that at runtime is mounted to
/opt/<vendorid>, and contains all the application's resources. The
<vendorid>could be a string like
<runtime>refers to one the vendor id of one specific runtime the application is built for, for example
<version>refer to the architecture the application is built for, and of course its version. Example:
home:<user>:<uid>:<gid>-- This sub-volume shall refer to the home directory of the specific user. The
<user>field contains the user name, the
<gid>fields the numeric Unix UIDs and GIDs of the user. The idea here is that in the long run the list of sub-volumes is sufficient as a user database (but see below). Example:
btrfs partitions that adhere to this naming scheme should be clearly identifiable. It is our intention to introduce a new GPT partition type ID for this.
How To Use It
After we introduced this naming scheme let's see what we can build of this:
When booting up a system we mount the root directory from one of the
rootsub-volumes, and then mount /usr from a matching
usrsub-volume. Matching here means it carries the same
<architecture>. Of course, by default we should pick the matching
usrsub-volume with the newest version by default.
When we boot up an OS container, we do exactly the same as the when we boot up a regular system: we simply combine a
usrsub-volume with a
When we enumerate the system's users we simply go through the list of
When a user authenticates and logs in we mount his home directory from his snapshot.
When an app is run, we set up a new file system name-space, mount the
/opt/<vendorid>/, and the appropriate
runtimesub-volume the app picked to
/usr, as well as the user's
/home/$USERto its place.
When a developer wants to develop against a specific runtime he installs the right framework, and then temporarily transitions into a name space where
/usris mounted from the framework sub-volume, and
/home/$USERfrom his own home directory. In this name space he then runs his build commands. He can build in multiple name spaces at the same time, if he intends to builds software for multiple runtimes or architectures at the same time.
Instantiating a new system or OS container (which is exactly the same in this scheme) just consists of creating a new appropriately named
root sub-volume. Completely naturally you can share one vendor OS copy in one specific version with a multitude of container instances.
Everything is double-buffered (or actually, n-fold-buffered), because
app sub-volumes can exist in multiple versions. Of course, by default the execution logic should always pick the newest release of each sub-volume, but it is up to the user keep multiple versions around, and possibly execute older versions, if he desires to do so. In fact, like on ChromeOS this could even be handled automatically: if a system fails to boot with a newer snapshot, the boot loader can automatically revert back to an older version of the OS.
Note that in result this allows installing not only multiple end-user applications into the same btrfs volume, but also multiple operating systems, multiple system instances, multiple runtimes, multiple frameworks. Or to spell this out in an example:
Let's say Fedora, Mageia and ArchLinux all implement this scheme, and provide ready-made end-user images. Also, the GNOME, KDE, SDL projects all define a runtime+framework to develop against. Finally, both LibreOffice and Firefox provide their stuff according to this scheme. You can now trivially install of these into the same btrfs volume:
In the example above, we have three vendor operating systems installed. All of them in three versions, and one even in a beta version. We have four system instances around. Two of them of Fedora, maybe one of them we usually boot from, the other we run for very specific purposes in an OS container. We also have the runtimes for two GNOME releases in multiple versions, plus one for KDE. Then, we have the development trees for one version of KDE and GNOME around, as well as two apps, that make use of two releases of the GNOME runtime. Finally, we have the home directories of two users.
Now, with the name-spacing concepts we introduced above, we can actually relatively freely mix and match apps and OSes, or develop against specific frameworks in specific versions on any operating system. It doesn't matter if you booted your ArchLinux instance, or your Fedora one, you can execute both LibreOffice and Firefox just fine, because at execution time they get matched up with the right runtime, and all of them are available from all the operating systems you installed. You get the precise runtime that the upstream vendor of Firefox/LibreOffice did their testing with. It doesn't matter anymore which distribution you run, and which distribution the vendor prefers.
Also, given that the user database is actually encoded in the sub-volume list, it doesn't matter which system you boot, the distribution should be able to find your local users automatically, without any configuration in /etc/passwd.
With this naming scheme plus the way how we can combine them on execution we already came quite far, but how do we actually get these sub-volumes onto the final machines, and how do we update them? Well, btrfs has a feature they call "send-and-receive". It basically allows you to "diff" two file system versions, and generate a binary delta. You can generate these deltas on a developer's machine and then push them into the user's system, and he'll get the exact same sub-volume too. This is how we envision installation and updating of operating systems, applications, runtimes, frameworks. At installation time, we simply deserialize an initial send-and-receive delta into our btrfs volume, and later, when a new version is released we just add in the few bits that are new, by dropping in another send-and-receive delta under a new sub-volume name. And we do it exactly the same for the OS itself, for a runtime, a framework or an app. There's no technical distinction anymore. The underlying operation for installing apps, runtime, frameworks, vendor OSes, as well as the operation for updating them is done the exact same way for all.
Of course, keeping multiple full /usr trees around sounds like an awful lot of waste, after all they will contain a lot of very similar data, since a lot of resources are shared between distributions, frameworks and runtimes. However, thankfully btrfs actually is able to de-duplicate this for us. If we add in a new app snapshot, this simply adds in the new files that changed. Moreover different runtimes and operating systems might actually end up sharing the same tree.
Even though the example above focuses primarily on the end-user, desktop side of things, the concept is also extremely powerful in server scenarios. For example, it is easy to build your own
usr trees and deliver them to your hosts using this scheme. The
usr sub-volumes are supposed to be something that administrators can put together. After deserializing them into a couple of hosts, you can trivially instantiate them as OS containers there, simply by adding a new
root sub-volume for each instance, referencing the
usr tree you just put together. Instantiating OS containers hence becomes as easy as creating a new btrfs sub-volume. And you can still update the images nicely, get fully double-buffered updates and everything.
And of course, this scheme also applies great to embedded use-cases. Regardless if you build a TV, an IVI system or a phone: you can put together you OS versions as
usr trees, and then use btrfs-send-and-receive facilities to deliver them to the systems, and update them there.
Many people when they hear the word "btrfs" instantly reply with "is it ready yet?". Thankfully, most of the functionality we really need here is strictly read-only. With the exception of the
home sub-volumes (see below) all snapshots are strictly read-only, and are delivered as immutable vendor trees onto the devices. They never are changed. Even if btrfs might still be immature, for this kind of read-only logic it should be more than good enough.
Note that this scheme also enables doing fat systems: for example, an installer image could include a Fedora version compiled for x86-64, one for i386, one for ARM, all in the same btrfs volume. Due to btrfs' de-duplication they will share as much as possible, and when the image is booted up the right sub-volume is automatically picked. Something similar of course applies to the apps too!
This also allows us to implement something that we like to call Operating-System-As-A-Virus. Installing a new system is little more than:
- Creating a new GPT partition table
- Adding an EFI System Partition (FAT) to it
- Adding a new btrfs volume to it
- Deserializing a single
usrsub-volume into the btrfs volume
- Installing a boot loader into the EFI System Partition
Now, since the only real vendor data you need is the
usr sub-volume, you can trivially duplicate this onto any block device you want. Let's say you are a happy Fedora user, and you want to provide a friend with his own installation of this awesome system, all on a USB stick. All you have to do for this is do the steps above, using your installed
usr tree as source to copy. And there you go! And you don't have to be afraid that any of your personal data is copied too, as the
usr sub-volume is the exact version your vendor provided you with. Or with other words: there's no distinction anymore between installer images and installed systems. It's all the same. Installation becomes replication, not more. Live-CDs and installed systems can be fully identical.
Note that in this design apps are actually developed against a single, very specific runtime, that contains all libraries it can link against (including a specific glibc version!). Any library that is not included in the runtime the developer picked must be included in the app itself. This is similar how apps on Android declare one very specific Android version they are developed against. This greatly simplifies application installation, as there's no dependency hell: each app pulls in one runtime, and the app is actually free to pick which one, as you can have multiple installed, though only one is used by each app.
Also note that operating systems built this way will never see "half-updated" systems, as it is common when a system is updated using RPM/dpkg. When updating the system the code will either run the old or the new version, but it will never see part of the old files and part of the new files. This is the same for apps, runtimes, and frameworks, too.
Where We Are Now
We are currently working on a lot of the groundwork necessary for this. This scheme relies on the ability to monopolize the vendor OS resources in /usr, which is the key of what I described in Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems a few weeks back. Then, of course, for the full desktop app concept we need a strong sandbox, that does more than just hiding files from the file system view. After all with an app concept like the above the primary interfacing between the executed desktop apps and the rest of the system is via IPC (which is why we work on kdbus and teach it all kinds of sand-boxing features), and the kernel itself. Harald Hoyer has started working on generating the btrfs send-and-receive images based on Fedora.
Getting to the full scheme will take a while. Currently we have many of the building blocks ready, but some major items are missing. For example, we push quite a few problems into btrfs, that other solutions try to solve in user space. One of them is actually signing/verification of images. The btrfs maintainers are working on adding this to the code base, but currently nothing exists. This functionality is essential though to come to a fully verified system where a trust chain exists all the way from the firmware to the apps. Also, to make the
home sub-volume scheme fully workable we actually need encrypted sub-volumes, so that the sub-volume's pass-phrase can be used for authenticating users in PAM. This doesn't exist either.
Working towards this scheme is a gradual process. Many of the steps we require for this are useful outside of the grand scheme though, which means we can slowly work towards the goal, and our users can already take benefit of what we are working on as we go.
Also, and most importantly, this is not really a departure from traditional operating systems:
Each app, each OS and each app sees a traditional Unix hierarchy with /usr, /home, /opt, /var, /etc. It executes in an environment that is pretty much identical to how it would be run on traditional systems.
There's no need to fully move to a system that uses only btrfs and follows strictly this sub-volume scheme. For example, we intend to provide implicit support for systems that are installed on ext4 or xfs, or that are put together with traditional packaging tools such as RPM or dpkg: if the the user tries to install a runtime/app/framework/os image on a system that doesn't use btrfs so far, it can just create a loop-back btrfs image in /var, and push the data into that. Even us developers will run our stuff like this for a while, after all this new scheme is not particularly useful for highly individualized systems, and we developers usually tend to run systems like that.
Also note that this in no way a departure from packaging systems like RPM or DEB. Even if the new scheme we propose is used for installing and updating a specific system, it is RPM/DEB that is used to put together the vendor OS tree initially. Hence, even in this scheme RPM/DEB are highly relevant, though not strictly as an end-user tool anymore, but as a build tool.
So Let's Summarize Again What We Propose
We want a unified scheme, how we can install and update OS images, user apps, runtimes and frameworks.
We want a unified scheme how you can relatively freely mix OS images, apps, runtimes and frameworks on the same system.
We want a fully trusted system, where cryptographic verification of all executed code can be done, all the way to the firmware, as standard feature of the system.
We want to allow app vendors to write their programs against very specific frameworks, under the knowledge that they will end up being executed with the exact same set of libraries chosen.
We want to allow parallel installation of multiple OSes and versions of them, multiple runtimes in multiple versions, as well as multiple frameworks in multiple versions. And of course, multiple apps in multiple versions.
We want everything double buffered (or actually n-fold buffered), to ensure we can reliably update/rollback versions, in particular to safely do automatic updates.
We want a system where updating a runtime, OS, framework, or OS container is as simple as adding in a new snapshot and restarting the runtime/OS/framework/OS container.
We want a system where we can easily instantiate a number of OS instances from a single vendor tree, with zero difference for doing this on order to be able to boot it on bare metal/VM or as a container.
We want to enable Linux to have an open scheme that people can use to build app markets and similar schemes, not restricted to a specific vendor.
I'll be talking about this at LinuxCon Europe in October. I originally intended to discuss this at the Linux Plumbers Conference (which I assumed was the right forum for this kind of major plumbing level improvement), and at linux.conf.au, but there was no interest in my session submissions there...
Of course this is all work in progress. These are our current ideas we are working towards. As we progress we will likely change a number of things. For example, the precise naming of the sub-volumes might look very different in the end.
Of course, we are developers of the systemd project. Implementing this scheme is not just a job for the systemd developers. This is a reinvention how distributions work, and hence needs great support from the distributions. We really hope we can trigger some interest by publishing this proposal now, to get the distributions on board. This after all is explicitly not supposed to be a solution for one specific project and one specific vendor product, we care about making this open, and solving it for the generic case, without cutting corners.
If you have any questions about this, you know how you can reach us (IRC, mail, G+, ...).
The future is going to be awesome!
31 Aug 2014 10:00pm GMT
29 Aug 2014
We currently have a large influx of new people contributing to i915 - for the curious just check the git logs. As part of ramping them up I've done a few trainings about upstream review, and a bunch of people I've talked with at KS in Chicago were interested in that, too. So I've cleaned up the slides a bit and dropped the very few references to Intel internal resources. No speaker notes or video recording, but I think this is useful all in itself. And of course if you have comments or see big gaps - feedback is very much welcome:
Upstream Review Training Slides
29 Aug 2014 4:14pm GMT
21 Aug 2014
Today I finally got X up on my vc4 driver using glamor. As you can see, there are a bunch of visual issues, and what you can't see is that after a few frames of those gears the hardware locked up and didn't come back. It's still major progress.
The code can be found in my vc4 branch of mesa and linux-2.6, and the glamor branch of my xf86-video-modesetting. I think the driver's at the point now that someone else could potentially participate. I've intentionally left a bunch of easy problems -- things like supporting the SCS, DST, DPH, and XPD opcodes, for which we have piglit tests (in glean) and are just a matter of translating the math from TGSI's vec4 instruction set (documented in tgsi.rst) to the scalar QIR opcodes.
21 Aug 2014 11:58pm GMT
19 Aug 2014
I've switched my Git repositories to GitHub recently, and started to watch my contributions statistics, which were very low considering I spend my days hacking on open source software, especially OpenStack.
OpenStack hosts its Git repositories on its own infrastructure at git.openstack.org, but also mirrors them on GitHub. Logically, I was expecting GitHub to track my commits there too, as I'm using the same email address everywhere.
It turns out that it was not the case, and the help page about that on GitHub describes the rule in place to compute statistics. Indeed, according to GitHub, I had no relations to the OpenStack repositories, as I never forked them nor opened a pull request on them (OpenStack uses Gerrit).
Starring a repository is enough to build a relationship between a user and a repository, so this is was the only thing needed to inform GitHub that I have contributed to those repositories. Considering OpenStack has hundreds of repositories, I decided to star them all by using a small Python script using pygithub.
And voilà, my statistics are now including all my contributions to OpenStack!
19 Aug 2014 5:00pm GMT
18 Aug 2014
A little more than 2 years ago, the Ceilometer project was launched inside the OpenStack ecosystem. Its main objective was to measure OpenStack cloud platforms in order to provide data and mechanisms for functionalities such as billing, alarming or capacity planning.
In this article, I would like to relate what I've been doing with other Ceilometer developers in the last 5 months. I've lowered my involvement in Ceilometer itself directly to concentrate on solving one of its biggest issue at the source, and I think it's largely time to take a break and talk about it.
Ceilometer early design
For the last years, Ceilometer didn't change in its core architecture. Without diving too much in all its parts, one of the early design decision was to build the metering around a data structure we called samples. A sample is generated each time Ceilometer measures something. It is composed of a few fields, such as the the resource id that is metered, the user and project id owning that resources, the meter name, the measured value, a timestamp and a few free-form metadata. Each time Ceilometer measures something, one of its components (an agent, a pollster…) constructs and emits a sample headed for the storage component that we call the collector.
This collector is responsible for storing the samples into a database. The Ceilometer collector uses a pluggable storage system, meaning that you can pick any database system you prefer. Our original implementation has been based on MongoDB from the beginning, but we then added a SQL driver, and people contributed things such as HBase or DB2 support.
The REST API exposed by Ceilometer allows to execute various reading requests on this data store. It can returns you the list of resources that have been measured for a particular project, or compute some statistics on metrics. Allowing such a large panel of possibilities and having such a flexible data structure allows to do a lot of different things with Ceilometer, as you can almost query the data in any mean you want.
The scalability issue
We soon started to encounter scalability issues in many of the read requests made via the REST API. A lot of the requests requires the data storage to do full scans of all the stored samples. Indeed, the fact that the API allows you to filter on any fields and also on the free-form metadata (meaning non indexed key/values tuples) has a terrible cost in terms of performance (as pointed before, the metadata are attached to each sample generated by Ceilometer and is stored as is). That basically means that the sample data structure is stored in most drivers in just one table or collection, in order to be able to scan them at once, and there's no good "perfect" sharding solution, making data storage scalability painful.
It turns out that the Ceilometer REST API is unable to handle most of the requests in a timely manner as most operations are O(n) where n is the number of samples recorded (see big O notation if you're unfamiliar with it). That number of samples can grow very rapidly in an environment of thousands of metered nodes and with a data retention of several weeks. There is a few optimizations to make things smoother in general cases fortunately, but as soon as you run specific queries, the API gets barely usable.
During this last year, as the Ceilometer PTL, I discovered these issues first hand since a lot of people were feeding me back with this kind of testimony. We engaged several blueprints to improve the situation, but it was soon clear to me that this was not going to be enough anyway.
Thinking outside the box
Unfortunately, the PTL job doesn't leave him enough time to work on the actual code nor to play with anything new. I was coping with most of the project bureaucracy and I wasn't able to work on any good solution to tackle the issue at its root. Still, I had a few ideas that I wanted to try and as soon as I stepped down from the PTL role, I stopped working on Ceilometer itself to try something new and to think a bit outside the box.
When one takes a look at what have been brought recently in Ceilometer, they can see the idea that Ceilometer actually needs to handle 2 types of data: events and metrics.
Events are data generated when something happens: an instance start, a volume is attached, or an HTTP request is sent to an REST API server. These are events that Ceilometer needs to collect and store. Most OpenStack components are able to send such events using the notification system built into oslo.messaging.
Metrics is what Ceilometer needs to store but that is not necessarily tied to an event. Think about an instance CPU usage, a router network bandwidth usage, the number of images that Glance is storing for you, etc… These are not events, since nothing is happening. These are facts, states we need to meter.
Computing statistics for billing or capacity planning requires both of these data sources, but they should be distinct. Based on that assumption, and the fact that Ceilometer was getting support for storing events, I started to focus on getting the metric part right.
I had been a system administrator for a decade before jumping into OpenStack development, so I know a thing or two on how monitoring is done in this area, and what kind of technology operators rely on. I also know that there's still no silver bullet - this made it a good challenge.
The first thing that came to my mind was to use some kind of time-series database, and export its access via a REST API - as we do in all OpenStack services. This should cover the metric storage pretty well.
A cloud of gnocchis!
At the end of April 2014, this led met to start a new project code-named Gnocchi. For the record, the name was picked after confusing so many times the OpenStack Marconi project, reading OpenStack Macaroni instead. At least one OpenStack project should have a "pasta" name, right?
The point of having a new project and not send patches on Ceilometer, was that first I had no clue if it was going to make something that would be any better, and second, being able to iterate more rapidly without being strongly coupled with the release process.
The first prototype started around the following idea: what you want is to meter things. That means storing a list of tuples of (timestamp, value) for it. I've named these things "entities", as no assumption are made on what they are. An entity can represent the temperature in a room or the CPU usage of an instance. The service shouldn't care and should be agnostic in this regard.
One feature that we discussed for several OpenStack summits in the Ceilometer sessions, was the idea of doing aggregation. Meaning, aggregating samples over a period of time to only store a smaller amount of them. These are things that time-series format such as the RRDtool have been doing for a long time on the fly, and I decided it was a good trail to follow.
I assumed that this was going to be a requirement when storing metrics into Gnocchi. The user would need to provide what kind of archiving it would need: 1 second precision over a day, 1 hour precision over a year, or even both.
The first driver written to achieve that and store those metrics inside Gnocchi was based on whisper. Whisper is the file format used to store metrics for the Graphite project. For the actual storage, the driver uses Swift, which has the advantages to be part of OpenStack and scalable.
Storing metrics for each entities in a different whisper file and putting them in Swift turned out to have a fantastic algorithm complexity: it was O(1). Indeed, the complexity needed to store and retrieve metrics doesn't depends on the number of metrics you have nor on the number of things you are metering. Which is already a huge win compared to the current Ceilometer collector design.
However, it turned out that whisper has a few limitations that I was unable to circumvent in any manner. I needed to patch it to remove a lot of its assumption about manipulating file, or that everything is relative to now (
time.time()). I've started to hack on that in my own fork, but… then everything broke. The whisper project code base is, well, not the state of the art, and have 0 unit test. I was starring at a huge effort to transform whisper into the time-series format I wanted, without being sure I wasn't going to break everything (remember, no test coverage).
I decided to take a break and look into alternatives, and stumbled upon Pandas, a data manipulation and statistics library for Python. Turns out that Pandas support time-series natively, and that it could do a lot of the smart computation needed in Gnocchi. I built a new file format leveraging Pandas for computing the time-series and named it carbonara (a wink to both the Carbon project and pasta, how clever!). The code is quite small (a third of whisper's, 200 SLOC vs 600 SLOC), does not have many of the whisper limitations and… it has test coverage. These Carbonara files are then, in the same fashion, stored into Swift containers.
Anyway, Gnocchi storage driver system is designed in the same spirit that the rest of OpenStack and Ceilometer storage driver system. It's a plug-in system with an API, so anyone can write their own driver. Eoghan Glynn has already started to write a InfluxDB driver, working closely with the upstream developer of that database. Dina Belova started to write an OpenTSDB driver. This helps to make sure the API is designed directly in the right way.
Measuring individual entities is great and needed, but you also need to link them with resources. When measuring the temperature and the number of a people in a room, it is useful to link these 2 separate entities to a resource, in that case the room, and give a name to these relations, so one is able to identify what attribute of the resource is actually measured. It is also important to provide the possibility to store attributes on these resources, such as their owners, the time they started and ended their existence, etc.
Relationship of entities and resources
Once this list of resource is collected, the next step is to list and filter them, based on any criteria. One might want to retrieve the list of resources created last week or the list of instances hosted on a particular node right now.
Resources also need to be specialized. Some resources have attributes that must be stored in order for filtering to be useful. Think about an instance name or a router network.
All of these requirements led to to the design of what's called the indexer. The indexer is responsible for indexing entities, resources, and link them together. The initial implementation is based on SQLAlchemy and should be pretty efficient. It's easy enough to index the most requested attributes (columns), and they are also correctly typed.
We plan to establish a model for all known OpenStack resources (instances, volumes, networks, …) to store and index them into the Gnocchi indexer in order to request them in an efficient way from one place. The generic resource class can be used to handle generic resources that are not tied to OpenStack. It'd be up to the users to store extra attributes.
Dropping the free form metadata we used to have in Ceilometer makes sure that querying the indexer is going to be efficient and scalable.
The indexer classes and their relations
All of this is exported via a REST API that was partially designed and documented in the Gnocchi specification in the Ceilometer repository; though the spec is not up-to-date yet. We plan to auto-generate the documentation from the code as we are currently doing in Ceilometer.
The REST API is pretty easy to use, and you can use it to manipulate entities and resources, and request the information back.
Macroscopic view of the Gnocchi architecture
Roadmap & Ceilometer integration
All of this plan has been exposed and discussed with the Ceilometer team during the last OpenStack summit in Atlanta in May 2014, for the Juno release. I led a session about this entire concept, and convinced the team that using Gnocchi for our metric storage would be a good approach to solve the Ceilometer collector scalability issue.
It was decided to conduct this project experiment in parallel of the current Ceilometer collector for the time being, and see where that would lead the project to.
Some engineers from Mirantis did a few benchmarks around Ceilometer and also against an early version of Gnocchi, and Dina Belova presented them to us during the mid-cycle sprint we organized in Paris in early July.
The following graph sums up pretty well the current Ceilometer performance issue. The more you feed it with metrics, the more slow it becomes.
For Gnocchi, while the numbers themselves are not fantastic, what is interesting is that all the graphs below show that the performances are stable without correlation with the number of resources, entities or measures. This proves that, indeed, most of the code is built around a complexity of O(1), and not O(n) anymore.
Clément drawing the logo
While the Juno cycle is being wrapped-up for most projects, including Ceilometer, Gnocchi development is still ongoing. Fortunately, the composite architecture of Ceilometer allows a lot of its features to be replaced by some other code dynamically. That, for example, enables Gnocchi to provides a Ceilometer dispatcher plugin for its collector, without having to ship the actual code in Ceilometer itself. That should help the development of Gnocchi to not be slowed down by the release process for now.
The Ceilometer team aims to provide Gnocchi as a sort of technology preview with the Juno release, allowing it to be deployed along and plugged with Ceilometer. We'll discuss how to integrate it in the project in a more permanent and strong manner probably during the OpenStack Summit for Kilo that will take place next November in Paris.
18 Aug 2014 3:00pm GMT
We have an opening in our Graphics Team to work on improving the state of open source GPU drivers. Your tasks would include working on various types of hardware and make sure it works great under linux and improving the general state of the Linux graphics stack. Since the work would include working on some specific pieces of hardware it would require the candidate to relocate to our Westford office, just north of Boston.
We are open to candidates with a range of backgrounds, but of course previous history with linux kernel codebase or the X.org codebase or Wayland is an advantage.
Please contact me at cschalle-at-redhat-com if you are interested.
18 Aug 2014 11:26am GMT
16 Aug 2014
There hasn't been a progress-report on DEP-11 for some time, but that doesn't mean there was no work going on on it.
DEP-11 is Debian's implementation of AppStream, as well as an effort to enhance the metadata available about software in Debian. While initially, AppStream was only about applications, DEP-11 was designed with a larger scope, to collect data about libraries, binaries and things like Python modules. Now, since AppStream 0.6, DEP-11 and AppStream have essentially the same scope, with the difference of DEP-11 metadata being described in YAML, while official AppStream data is XML. That was due to a request by our ftpmasters team, which doesn't like XML (which is also not used anywhere in Debian, as opposed to YAML). But this doesn't mean that people will have to deal with the YAML file format: The libappstream library will just take DEP-11 data as another data source for it's Xapian database, allowing anything using libappstream to access that data just like the XML stuff. Richards libappstream-glib will also receive support for the DEP-11 format soon, filling it's in-memory data cache and enabling the use of GNOME-Software on Debian.
So, what has been done so far? The past months, my Google Summer of Code student. Abhishek Bhattacharjee, was working hard to integrate DEP-11 support into dak, the Debian Archive Kit, which maintains the whole Debian archive. The result will be an additional metadata table in our internal Postgres database, storing detailed information about the software available in a Debian package, as well as "Components-<arch>.yml.gz" files in the Debian repositories. Dak will also produce an application icon-cache and a screenshots repository. During the time of the SoC, Abhishek focused mainly on the applications part of things, and less on the other components (like extracting data about Python modules or libraries) - these things can easily be implemented later.
The remaining steps will be to polish the code and make it merge-ready for Debian's dak (as soon as it has received enough testing, we will likely give it a try on the Tanglu Debian derivative). Following that, Apt will be extended to fetch the DEP-11 data on-demand on systems where it is useful (which is currently mostly desktop-systems) - if you want to save a little bit of space, you will be able to disable downloading this extra metadata in Apt. From there, libappstream will take the data for it's Xapian db. This will lead to the removal of the much-hated (from ftpmasters and maintainers side) app-install-data package, which has not been updated for two years and only contains a small fraction of the metadata provided by DEP-11.
What Debian will ultimately gain from this effort is support for software-centers like GNOME-Software, and improved support for tools like Apper and Muon in displaying applications. Long-term, with more metadata being available, It would be cool to add support for it to "specialized package managers", like Python's pip, npm or gem, to make them fetch information about available distribution software and install that instead of their own copies from 3rd-party repositories, if possible. This should ultimatively lead to less code duplication on distributions and will likely result in fewer security issues, since the officially maintained and integrated distribution packages can easily be used, if possible. This is no attempt to make tools like pip obsolete, but an attempt to have the different tools installing software on your machine communicate better, instead of creating parallel worlds in terms of software management. Another nice sideeffect of more metadata will be options to search for tools handling mimetypes in the software repos (in case you can't open a file), smart software centers installing missing firmware, and automatic suggestions for developers which software they need to install in order to build a specific software package. Also, the data allows us to match software across distributions, for that, I will have some news soon (not sure how soon though, as I am currently in thesis-writing-mode, and therefore have not that much spare time). Since the goal is to have these features available on all distributions supporting AppStream, it will take longer to realize - but we are on a good way.
So, if you want some more information about my student's awesome work, you can read his blogpost about it. He will also be at Debconf'14 (Portland). (I can't make it this time, but I surely won't miss the next Debconf)
Sadly, I only see a very small chance to have the basic DEP-11 stuff land in-time for Jessie (lots of review work needs to be done, and some more code needs to be written), but we will definitively have it in Jessie+1.
16 Aug 2014 2:50pm GMT
15 Aug 2014
Everyone has been blogging about GUADEC, but I'd like to talk about my other favorite conference of the year, which is GNOME.Asia. This year, it was in Beijing, a mightily interesting place. Giant megapolis, with grandiose architecture, but at the same time, surprisingly easy to navigate with its efficient metro system and affordable taxis. But the air quality is as bad as they say, at least during the incredibly hot summer days where we visited.
The conference itself was great, this year, co-hosted with FUDCon's asian edition, it was interesting to see a crowd that's really different from those who attend GUADEC. Many more people involved in evangelising, deploying and using GNOME as opposed to just developing it, so it allows me to get a different perspective.
On a related note, I was happy to see a healthy delegation from Asia at GUADEC this year!
15 Aug 2014 4:50am GMT
10 Aug 2014
I'll talk again about the interface between the Linux kernel and the userspace (mesa). After few weeks of work, I now have a full implementation which exposes NVIDIA's performance counters in Nouveau. I actually have two versions with different approachs. The first one is almost "all-userspace" which means that the configuration and the logic of performance counters are stored in the userspace, while the second one is almost "all-kernelspace" and only exposes what events can be monitored from the userspace. These two approachs use a set of software methods and the perfmon engine of Nouveau, initially written by Ben Skeggs, in order to set up performance counters.
This post will only focus on global counters, please refer to my latest article about MP counters on nv50/Tesla if you are interested. Before we continue, let me recall what is a performance counter for NVIDIA.
PCOUNTER: The performance counters engine
A hardware performance counter is a set of special registers which are used to store the counts of hardware-related activities. Hardware counters are oftenly used by developers to identify bottlenecks in their applications.
PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters. Counters do not sample one 8-bits signal, they sample a macro signal. A macro signal is the aggregation of 4 signals which have been combined using a function. An overview of this logic is represented by the figure below.
Now, let me talk a bit about graphics counter exposed by NVIDIA on nv50/Tesla family.
Graphics counter for 3D applications
Graphics counter can be used to give detailled information for OpenGL/Direct3D applications. These performance counters are only exposed by NVIDIA PerfKit, an advanced software suite for profiling OpenCL and Direct3D/OpenGL applications on Windows (only). Last year, I reverse engineered most of these graphics counter. You can take a quick look at the documentation for nva3 (for example), this will introduce the notion of complex hardware events.
Overview of complex hardware events
A complex hardware event is composed by one or two macro signals which have been combined with a counter mode. Some of them are sometimes multiplexed and thus a multiplexer (a tuple address and value) needs to be configured in the engine which generates the signal. Hardware events are so the aggregation of multiple 8-bits signals and they are harder to monitor than a simple signal. Some events are also too complex to be monitored at one time and thus need multiple passes. As perfkit polls counters after each frame, an event that requires multiple passes will need the same amount of frame to be monitored. For instance, for frame x, the counters are set for the pass #0 while they are set up for pass #1 at frame x+1. The results of the two passes are then combined to create the result of the event. Multi-passes events are thus less accurate because they need more frames to be monitored
The main goal of the interface between the kernel and mesa is to expose these complex hardware events to the userspace.
The first interface ("all-userspace" approach)
The main idea of this interface is to store the configuration of complex hardware events inside mesa. In this approach, the kernel only knows the list of 8-bits signals and exposes them with a unique string identifier, for example, the signal 0xcb on nva3 is associated to 'gr_idle' on the set 1. Then, the userspace can build complex events and send the configuration to the kernel through an ioctl call which allocates a NOUVEAU_PERFCTR_CLASS object. A NOUVEAU_PERFCTR_CLASS object is used to init, poll and read performance counters.
This interface is based on a set of softwared methods used to control performance counters. Basically, we first allocate a NOUVEAU_PERFCTR_CLASS object with the configuration (8-bits signal/function/mode …) of the counter. Then, before a frame is rendering (using the begin_query() hook of gallium) we send the handle of this object with a software method to start monitoring. At this time, the configuration is written to PCOUNTER and the counter starts to count hardware related activities. After the frame, we send a sequence number with an another software method to read out values using a notify buffer object which is allocated along the current channel. If you are interested, a previous post gives more details about that interface.
With this "all-userspace" approach, the kernel is not able to monitor complex hardware events because the configuration and the logic is stored in the userspace. Actually, the configuration is shared between the kernel and mesa. The kernel only knows 8-bits signals while the userspace knows the configuration of hardware events.
Perf also called perf_events, is a kernel-based interface for profiling Linux which is able to monitor performance counters like the number of instructions executed. Thus, if the configuration of hardware events is stored in the userspace, this will be a problem for exposing them in perf because we don't want to duplicate the configuration. I also talked with Daniel Vetter, the maintener of the i965 driver and the responsible of the major part of DRM, and he seems to be agree with the idea that it could be good to expose hardware events in perf.
We also have an another problem related to muxs because the userspace knows the configuration while the kernel does not. So, the kernel has to check address of muxs in order to avoid security issues.
The last problem is that the interface is closely based on the perfmon engine, so if perfmon changes in the future, this will require to add a new interface. But, we don't want to add another driver private ioctl or design a new interface in case of perfmon must be evolved in the future. However, with the "all-kernelspace" approach we don't have this problem since the kernel knows the logic and only exposes a list of monitorable events.
However, the "all-userspace" approach has the advantages to reduce the amount of code in the kernel and to facilitate the configuration of counters since all the logic is located in the userspace.
If you are interested you can take a look at the code :
mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_pcounter_pm
libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfctr_class
nouveau source code: https://github.com/hakzsam/nouveau/commits/expose_perfctr_class
The second interface ("all-kernelspace" approach)
This interface is kernel-based like Perf. The configuration and the logic (except multi-pass events which need two frames) are stored in the kernel only. The kernel exposes a list of monitorable events. Thus, the userspace just has to allocate a NOUVEAU_PERFEVENT_CLASS used to init, read and poll complex hardware events.
Like the previous interface, this is one is also based on a set of software methods used to control performance counters. The behaviour is almost the same than before except that we allocate a NOUVEAU_PERFEVENT_CLASS object which represents a complex hardware event instead of a NOUVEAU_PERFCTR_CLASS.
With this approach it's easy to monitor complex hardware events inside Nouveau and to expose them to Perf in the future. Also, there is no security issues because muxs are configured from and by the kernel, we don't have to check their address.
Since, the kernel only exposes a list of events and stores the configuration, pefmon can change without any impacts to the interface between the kernel and the userspace in the future. Basically, the userspace only knows the name of events, and some flags used to do scheduling. However, it's hard to expose to the userspace what events are monitorable simultaneously or not.
On nv50/Tesla, we have 8 domains (or sets) and 4 counters per domain. Thus, if all complex events only use one counter per domain, we can monitor 32 events simultaneously. Good! But actually not… Because some events use 2 counters per domain. To handle this case, the userspace can retrieve the number of available domains and the number of counters per domain through an ioctl call. Then, we expose the domain ID and the number of counters needed by an event. With this information, we can schedule events from the userspace. But we still have one problem, how to handle the case where two events on the same domain share a mux?
Some events are multiplexed but two or more events can use the same mux with a different value. To handle this special case, we expose conflicts to the userspace using some 64 bits flags. Thus, the userspace just has to do a AND comparison to check if two events can be monitored simultaneously.
The source code of this "all-kernelspace" version is available below :
mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_kernelspace_version
libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfevent_class
nouveau source code: https://github.com/hakzsam/nouveau/commits/nv50_kernelspace_version
What is the best approach ? pros & cons
- reduce the amount of code in the kernel
- easy to apply logic of performance counters
- not possible to monitor complex hardware events inside Nouveau and perf (Linux)
- configuration of counters is shared between the userspace (complex events) and the kernelspace (8-bits signals)
- possible security issues (the kernel must know address of muxs to check queries)
- the interface (and the userspace) must be changed if perfmon changes in the future
- possible to monitor complex hardware events inside Nouveau and perf (Linux)
- configuration and logic (except multi-pass events) are stored in the kernel only
- no security issues (muxs are configured by the kernel)
- perfmon can evolve without any impacts regarding the interface since it only exposes a list of events
- add more code in the kernel
- hard to expose to the userspace what events are monitorable simultaneously or not
These two interfaces have different pros and cons, but in my opinion, I think the "all-kernelspace" is more elegant and more future-proof since we can monitor complex hardware events inside Nouveau and expose them to perf (Linux) .
To sum up, we still have to choose one version of the interface between the kernel and mesa. I'll talk about this with Ben Skeggs, the maintener of Nouveau to get his opinion. We hope to get the code upstream in september or october, and before Linux 3.19.
Have a good day!
10 Aug 2014 10:04pm GMT
08 Aug 2014
At Collabora, we're always on the lookout for cool opportunities involving Wayland and we noticed recently that Mozilla had started to show some interest in porting Firefox to Wayland. In short, the Wayland display server is becoming very popular for being lightweight, versatile yet powerful and is designed to be a replacement for X11. Chrome and Webkit already got Wayland ports and we think that Firefox should have that too.
Some months ago, we wrote a simple proof-of-concept basically starting from actual Gecko's GTK3 paths and stripping all the MOZ_X11 ifdefs out of the way. We did a bunch of quick hacks fixing broken stuff but rather easily and quickly (couple days), we got Firefox to run on Weston (Wayland official reference compositor). Ok, because of hard X11 dependencies, keyboard input was broken and decorations suffered a little, but that's a very good start! Take a look at the below screenshot :)
08 Aug 2014 1:44pm GMT
07 Aug 2014
Over the past few months, working at Collabora, I have helped Mozilla get rid of Xlib surfaces for content on Linux platform. This task was the primary problem keeping Mozilla from turning OpenGL layers on by default on Linux, which is one of their long-term goals. I'll briefly explain this long-term goal and will thereafter give details about how I got rid of Xlib surfaces.
LONG-TERM GOAL - Enabling Skia layers by default on Linux
My work integrated into a wider, long-term goal that Mozilla currently has : To enable Skia layers by default on Linux (Bug 1038800). And for a glimpse into how Mozilla initially made Skia layers work on linux, see bug 740200. At the time of writing this article, Skia layers are still not enabled by default because there are some open bugs about failing Skia reftests and OMTC (off-main-thread compositing) not being fully stable on linux at the moment (Bug 722012). Why is OMTC needed to get Skia layers on by default on linux ? Simply because by design, users that choose OpenGL layers are being grandfathered OMTC on Linux… and since the MTC (main-thread compositing) path has been dropped lately, we must tackle the OMTC bugs before we can dream about turning Skia layers on by default on Linux.
For a more detailed explanation of issues and design considerations pertaining turning Skia layers on by default on Linux, see this wiki page.
MY TASK - Getting rig of Xlib surfaces for content
Xlib surfaces for content rendering have been used extensively for a long time now, but when OpenGL got attention as a means to accelerate layers, we quickly ran into interoperability issues between XRender and Texture_From_Pixmap OpenGL extension… issues that were assumed insurmountable after initial analysis. Also, and I quote Roc here, "We [had] lots of problems with X fallbacks, crappy X servers, pixmap usage, weird performance problems in certain setups, etc. In particular we [seemed] to be more sensitive to Xrender implementation quality that say Opera or Webkit/GTK+." (Bug 496204)
So for all those reasons, someone had to get rid of Xlib surfaces, and that someone was… me ;)
So problem was to get rid of Xlib surfaces (gfxXlibSurface) for content under Linux/GTK platform and implicitly, of course, replace them with Image surfaces (gfxImageSurface) so they become regular memory buffers in which we can render with GL/gles and from which we can composite using GPU. Now, it's pretty easy to force creation of Image surfaces (instead of Xlib ones) for just all content layers in gecko gfx/layers framework, just force gfxPlatformGTK::CreateOffscreenSurfaces(…) to create gfxImageSurfaces in any case.
Problem is, naively doing so gives rise to a series of perf. regressions and sub-optimal paths being taken, for example, to copy image buffers around when passing them across process boundaries, or unnecessary copying when compositing under X11 with Xrender support. So the real work was to fix everything after having pulled the gfxXlibSurface plug ;)
First glitch on the way was that GTK2 theme rendering, per design, *had* to happen on Xlib surfaces. We didn't have much choice as to narrow down our efforts to the GTK3 branch alone. What's nice with GTK3 on that front is that it makes integral use of cairo, thus letting theme rendering happen on any type of cairo_surface_t. For more detail on that decision, read this.
Upfront, we noticed that the already implemented GL compositor was properly managing and buffering image layer contents, which is a good thing, but on the way, we saw that the 'basic' compositor did not. So we started streamlining basic compositor under OMTC for GTK3.
The core of the solution here was about implementing server-side buffering of layer contents that were using image backends. Since targetted platform was Linux/GTK3 and since Xrender support is rather frequent, the most intuitive thing to do was to subclass BasicCompositor into a new X11BasicCompositor and make it use a new specialized DataTextureSource (that we called X11DataTextureSourceBasic) that basically buffers upcoming layer content in ::Update() to an gfxXlibSurface that we keep alive for the TextureSource lifetime (unless surface changes size and/or format).
Performance results were satisfying. For 64 bit systems, we had around 75% boost in tp5o_shutdown_paint, 6% perf gain for 'cart', 14% for 'tresize', 33% for tscrollx and 12% perf gain on tcanvasmark.
For complete details about this effort, design decisions and resulting performance numbers, please read the corresponding bugzilla ticket.
To see the code that we checked-in to solve this, look at those 2 patches :
07 Aug 2014 8:21pm GMT
- If you have an orientation sensor in your laptop that works under Windows 8, this tool might be of interest to you.
- Mattias will use that code as a base to add Compass support to Geoclue (you're on the hook!)
- I've made a hack to load games metadata using Grilo and Lua plugins (everything looks like nail when you have a hammer ;)
- I've replaced a Linux phone full of binary blobs by another Linux phone full of binary blobs
- I believe David Herrmann missed out on asking for a VT, and getting something nice in return.
- Cosimo will be writing some more animations for me! (and possibly for himself)
- I now know more about core dumps and stack traces than I would want to, but far less than I probably will in the future.
- Get Andrea to approve Timm Bädert's git account so he can move Corebird to GNOME. Don't forget to try out Charles, Timm!
- My team won FreeFA, and it's not even why I'm smiling ;)
- The cathedral has two towers!
Unfortunately for GUADEC guests, Bretzel Airlines opened its new (and first) shop on Friday, the last days of the BoFs.
(Lovely city, great job from Alexandre, Nathalie, Marc and all the volunteers, I'm sure I'll find excuses to come back :)
07 Aug 2014 6:39pm GMT
So with the 3.16 kernel out of the door it's time to look at what's queued up for the Intel graphics driver in 3.17.
This release features the universal plane support from Matt Roper, all enabled already by default. This is prep work for atomic modesetting and pageflipping support: Since a while we support additional (overlay) planes in the DRM core and the i915 driver, but there have always been two implicit planes directly attached to the CRTC: The primary plane used by the SetCrtc and PageFlip functions, and the optional cursor support. But with the atomic ioctl these implicit planes it's easier to handle everything as an explicit plane, so Matt's patches split them away into separate real plane objects. This is a nice cleanup of the kms api in general since a lot of SoC hardware has unified plane hardware, where cursor, primary plane and any overlays are fully interchangeable. So we already expose this to userspace, if it sets the corresponding feature flag.
Another big feature on the display side is the improved PSR support, which is now enabled by default on Haswell and Broadwell. The tricky bit with PSR (and also with FBC) and the reason we didn't yet enable this by default is correctly support legacy frontbuffer rendering (for example for X). The hardware provides a bit of support to do that, but it doesn't catch all possible frontbuffer rendering and has a lot of other limitations. To finally fix this for real we've added accurate frontbuffer tracking in software. This should finally allow us to enable a lot of display power saving features by default like PSR on Baytrail, FBC (on all platforms) and DRRS (dynamic refresh rate switching).
On actual platform display enabling we have lots of improvements all over: Baytrail MIPI DSI support has greatly stabilized, backlight and power sequencer fixes, mmio based flips to work around issues with stalls and hangs for blitter ring based flips and plenty of other work. The core drm pieces for plane rotation support have also landed, unfortunately the i915 parts didn't make the cut for 3.17.
Another big area, as usual, has been general power management improvements. We now support runtime PM for DPMS Off and not just when the output is completely disabled. This was fairly invasive work since our current modesetting code assumed that a DPMS Off/On cycle will not destroy register state, but that's exactly what runtime PM can do. On the plus side this reorganization greatly cleaned up the code base and prepared the driver for atomic modesetting, which requires a similar separation between state computation and actual hw state updating like this feature.
Jesse Barnes implemented S0ix support for system suspend/resume. Marketing has some crazy descriptions for this, but essentially this means that we use the same power saving knobs for system suspend as for runtime PM - the entire machine is still running, just at a very low power state. Long-term this should simplify our system suspend code a bit since we can just reuse all the code used to implement runtime PM.
Moving on to the render side of the gpu there have been again improvements to the rps code. Chris Wilson further tuned the rps boost logic, and Ville and Deepak implemented rps support for Cherrytrail.
Jesse contributed ppgtt support for Baytrail which will be a lot more interesting once we enable full ppgtt again (hopefully in 3.18).
For Broadwell semaphores support from Ben and Rodrigo was merged, but it looks like we need to disable that again due to stability issues. Oscar Mateo also implemented a large pile of interrupt handling improvements which hopefully address the small races and bugs we've had in the past on some platforms. There's also a lot of refactoring patches to prepare for execlist support from Oscar. Excelists are the new way of submitting work to the gpu, first supported on Broadwell (but not yet mandatory). The key feature compared to legacy ringbuffer submission is that we'll finally be able to preempt gpu tasks.
And as usual there have been tons of bugsfixes and improvements all over. Oh and: User mode setting has moved one step further on the path to deprecation and is now fully disabled. If no one complains about this we can finally rip out all that code in one of the next kernel releases.
07 Aug 2014 3:36pm GMT
04 Aug 2014
In April, when Solaris 11.2 Beta was released, I posted a list of changes to bundled software packages between Solaris 11.1 & 11.2. Now that the final release of Solaris 11.2 reached General Availability last week, I've gone back to compare the beta release via the GA release.
As you would expect, there are many fewer changes in the three months between beta & GA than in the 18 months before that. Most of the change came from upgrading the OpenStack packages from the Grizzly (2013.1) release to the Havana (2013.2) release, and adding the Swift OpenStack Object Storage components and other packages like Django which the new OpenStack components needed. There are also some general bug fix or security fix updates, such as upgrading OpenSSL from 1.0.1g to 1.0.1h.
One other change that showed up when gathering data for this list was that the Oracle Database 12c prerequisites package was renamed between beta & GA to better match the database naming style - previously it was called group/prerequisite/oracle/oracle-rdbms-server-12cR1-preinstall but is now group/prerequisite/oracle/oracle-rdbms-server-12-1-preinstall. Fortunately, you don't have to type in the whole FMRI to install it, pkg install oracle-rdbms-server-12-1-preinstall is enough.
Detailed list of changes
This table shows most of the changes to the bundled packages between the 11.2 beta released in April, and the 11.2 GA release in July.
As before, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn't change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.
|Package||Upstream||11.2 Beta||11.2 GA|
(Java SE 7u55)
(Java SE 7u65)
(Java SE 8u5)
(Java SE 8u11)
|library/python/cffi||Python CFFI||not included||0.8.2|
|library/python/six||pypi six||not included||1.6.1|
(Java SE 7u55)
(Java SE 7u65)
(Java SE 8u5)
(Java SE 8u11)
04 Aug 2014 5:13pm GMT