21 Aug 2014
Today I finally got X up on my vc4 driver using glamor. As you can see, there are a bunch of visual issues, and what you can't see is that after a few frames of those gears the hardware locked up and didn't come back. It's still major progress.
The code can be found in my vc4 branch of mesa and linux-2.6, and the glamor branch of my xf86-video-modesetting. I think the driver's at the point now that someone else could potentially participate. I've intentionally left a bunch of easy problems -- things like supporting the SCS, DST, DPH, and XPD opcodes, for which we have piglit tests (in glean) and are just a matter of translating the math from TGSI's vec4 instruction set (documented in tgsi.rst) to the scalar QIR opcodes.
21 Aug 2014 11:58pm GMT
19 Aug 2014
I've switched my Git repositories to GitHub recently, and started to watch my contributions statistics, which were very low considering I spend my days hacking on open source software, especially OpenStack.
OpenStack hosts its Git repositories on its own infrastructure at git.openstack.org, but also mirrors them on GitHub. Logically, I was expecting GitHub to track my commits there too, as I'm using the same email address everywhere.
It turns out that it was not the case, and the help page about that on GitHub describes the rule in place to compute statistics. Indeed, according to GitHub, I had no relations to the OpenStack repositories, as I never forked them nor opened a pull request on them (OpenStack uses Gerrit).
Starring a repository is enough to build a relationship between a user and a repository, so this is was the only thing needed to inform GitHub that I have contributed to those repositories. Considering OpenStack has hundreds of repositories, I decided to star them all by using a small Python script using pygithub.
And voilà, my statistics are now including all my contributions to OpenStack!
19 Aug 2014 5:00pm GMT
18 Aug 2014
A little more than 2 years ago, the Ceilometer project was launched inside the OpenStack ecosystem. Its main objective was to measure OpenStack cloud platforms in order to provide data and mechanisms for functionalities such as billing, alarming or capacity planning.
In this article, I would like to relate what I've been doing with other Ceilometer developers in the last 5 months. I've lowered my involvement in Ceilometer itself directly to concentrate on solving one of its biggest issue at the source, and I think it's largely time to take a break and talk about it.
Ceilometer early design
For the last years, Ceilometer didn't change in its core architecture. Without diving too much in all its parts, one of the early design decision was to build the metering around a data structure we called samples. A sample is generated each time Ceilometer measures something. It is composed of a few fields, such as the the resource id that is metered, the user and project id owning that resources, the meter name, the measured value, a timestamp and a few free-form metadata. Each time Ceilometer measures something, one of its components (an agent, a pollster…) constructs and emits a sample headed for the storage component that we call the collector.
This collector is responsible for storing the samples into a database. The Ceilometer collector uses a pluggable storage system, meaning that you can pick any database system you prefer. Our original implementation has been based on MongoDB from the beginning, but we then added a SQL driver, and people contributed things such as HBase or DB2 support.
The REST API exposed by Ceilometer allows to execute various reading requests on this data store. It can returns you the list of resources that have been measured for a particular project, or compute some statistics on metrics. Allowing such a large panel of possibilities and having such a flexible data structure allows to do a lot of different things with Ceilometer, as you can almost query the data in any mean you want.
The scalability issue
We soon started to encounter scalability issues in many of the read requests made via the REST API. A lot of the requests requires the data storage to do full scans of all the stored samples. Indeed, the fact that the API allows you to filter on any fields and also on the free-form metadata (meaning non indexed key/values tuples) has a terrible cost in terms of performance (as pointed before, the metadata are attached to each sample generated by Ceilometer and is stored as is). That basically means that the sample data structure is stored in most drivers in just one table or collection, in order to be able to scan them at once, and there's no good "perfect" sharding solution, making data storage scalability painful.
It turns out that the Ceilometer REST API is unable to handle most of the requests in a timely manner as most operations are O(n) where n is the number of samples recorded (see big O notation if you're unfamiliar with it). That number of samples can grow very rapidly in an environment of thousands of metered nodes and with a data retention of several weeks. There is a few optimizations to make things smoother in general cases fortunately, but as soon as you run specific queries, the API gets barely usable.
During this last year, as the Ceilometer PTL, I discovered these issues first hand since a lot of people were feeding me back with this kind of testimony. We engaged several blueprints to improve the situation, but it was soon clear to me that this was not going to be enough anyway.
Thinking outside the box
Unfortunately, the PTL job doesn't leave him enough time to work on the actual code nor to play with anything new. I was coping with most of the project bureaucracy and I wasn't able to work on any good solution to tackle the issue at its root. Still, I had a few ideas that I wanted to try and as soon as I stepped down from the PTL role, I stopped working on Ceilometer itself to try something new and to think a bit outside the box.
When one takes a look at what have been brought recently in Ceilometer, they can see the idea that Ceilometer actually needs to handle 2 types of data: events and metrics.
Events are data generated when something happens: an instance start, a volume is attached, or an HTTP request is sent to an REST API server. These are events that Ceilometer needs to collect and store. Most OpenStack components are able to send such events using the notification system built into oslo.messaging.
Metrics is what Ceilometer needs to store but that is not necessarily tied to an event. Think about an instance CPU usage, a router network bandwidth usage, the number of images that Glance is storing for you, etc… These are not events, since nothing is happening. These are facts, states we need to meter.
Computing statistics for billing or capacity planning requires both of these data sources, but they should be distinct. Based on that assumption, and the fact that Ceilometer was getting support for storing events, I started to focus on getting the metric part right.
I had been a system administrator for a decade before jumping into OpenStack development, so I know a thing or two on how monitoring is done in this area, and what kind of technology operators rely on. I also know that there's still no silver bullet - this made it a good challenge.
The first thing that came to my mind was to use some kind of time-series database, and export its access via a REST API - as we do in all OpenStack services. This should cover the metric storage pretty well.
A cloud of gnocchis!
At the end of April 2014, this led met to start a new project code-named Gnocchi. For the record, the name was picked after confusing so many times the OpenStack Marconi project, reading OpenStack Macaroni instead. At least one OpenStack project should have a "pasta" name, right?
The point of having a new project and not send patches on Ceilometer, was that first I had no clue if it was going to make something that would be any better, and second, being able to iterate more rapidly without being strongly coupled with the release process.
The first prototype started around the following idea: what you want is to meter things. That means storing a list of tuples of (timestamp, value) for it. I've named these things "entities", as no assumption are made on what they are. An entity can represent the temperature in a room or the CPU usage of an instance. The service shouldn't care and should be agnostic in this regard.
One feature that we discussed for several OpenStack summits in the Ceilometer sessions, was the idea of doing aggregation. Meaning, aggregating samples over a period of time to only store a smaller amount of them. These are things that time-series format such as the RRDtool have been doing for a long time on the fly, and I decided it was a good trail to follow.
I assumed that this was going to be a requirement when storing metrics into Gnocchi. The user would need to provide what kind of archiving it would need: 1 second precision over a day, 1 hour precision over a year, or even both.
The first driver written to achieve that and store those metrics inside Gnocchi was based on whisper. Whisper is the file format used to store metrics for the Graphite project. For the actual storage, the driver uses Swift, which has the advantages to be part of OpenStack and scalable.
Storing metrics for each entities in a different whisper file and putting them in Swift turned out to have a fantastic algorithm complexity: it was O(1). Indeed, the complexity needed to store and retrieve metrics doesn't depends on the number of metrics you have nor on the number of things you are metering. Which is already a huge win compared to the current Ceilometer collector design.
However, it turned out that whisper has a few limitations that I was unable to circumvent in any manner. I needed to patch it to remove a lot of its assumption about manipulating file, or that everything is relative to now (
time.time()). I've started to hack on that in my own fork, but… then everything broke. The whisper project code base is, well, not the state of the art, and have 0 unit test. I was starring at a huge effort to transform whisper into the time-series format I wanted, without being sure I wasn't going to break everything (remember, no test coverage).
I decided to take a break and look into alternatives, and stumbled upon Pandas, a data manipulation and statistics library for Python. Turns out that Pandas support time-series natively, and that it could do a lot of the smart computation needed in Gnocchi. I built a new file format leveraging Pandas for computing the time-series and named it carbonara (a wink to both the Carbon project and pasta, how clever!). The code is quite small (a third of whisper's, 200 SLOC vs 600 SLOC), does not have many of the whisper limitations and… it has test coverage. These Carbonara files are then, in the same fashion, stored into Swift containers.
Anyway, Gnocchi storage driver system is designed in the same spirit that the rest of OpenStack and Ceilometer storage driver system. It's a plug-in system with an API, so anyone can write their own driver. Eoghan Glynn has already started to write a InfluxDB driver, working closely with the upstream developer of that database. Dina Belova started to write an OpenTSDB driver. This helps to make sure the API is designed directly in the right way.
Measuring individual entities is great and needed, but you also need to link them with resources. When measuring the temperature and the number of a people in a room, it is useful to link these 2 separate entities to a resource, in that case the room, and give a name to these relations, so one is able to identify what attribute of the resource is actually measured. It is also important to provide the possibility to store attributes on these resources, such as their owners, the time they started and ended their existence, etc.
Relationship of entities and resources
Once this list of resource is collected, the next step is to list and filter them, based on any criteria. One might want to retrieve the list of resources created last week or the list of instances hosted on a particular node right now.
Resources also need to be specialized. Some resources have attributes that must be stored in order for filtering to be useful. Think about an instance name or a router network.
All of these requirements led to to the design of what's called the indexer. The indexer is responsible for indexing entities, resources, and link them together. The initial implementation is based on SQLAlchemy and should be pretty efficient. It's easy enough to index the most requested attributes (columns), and they are also correctly typed.
We plan to establish a model for all known OpenStack resources (instances, volumes, networks, …) to store and index them into the Gnocchi indexer in order to request them in an efficient way from one place. The generic resource class can be used to handle generic resources that are not tied to OpenStack. It'd be up to the users to store extra attributes.
Dropping the free form metadata we used to have in Ceilometer makes sure that querying the indexer is going to be efficient and scalable.
The indexer classes and their relations
All of this is exported via a REST API that was partially designed and documented in the Gnocchi specification in the Ceilometer repository; though the spec is not up-to-date yet. We plan to auto-generate the documentation from the code as we are currently doing in Ceilometer.
The REST API is pretty easy to use, and you can use it to manipulate entities and resources, and request the information back.
Macroscopic view of the Gnocchi architecture
Roadmap & Ceilometer integration
All of this plan has been exposed and discussed with the Ceilometer team during the last OpenStack summit in Atlanta in May 2014, for the Juno release. I led a session about this entire concept, and convinced the team that using Gnocchi for our metric storage would be a good approach to solve the Ceilometer collector scalability issue.
It was decided to conduct this project experiment in parallel of the current Ceilometer collector for the time being, and see where that would lead the project to.
Some engineers from Mirantis did a few benchmarks around Ceilometer and also against an early version of Gnocchi, and Dina Belova presented them to us during the mid-cycle sprint we organized in Paris in early July.
The following graph sums up pretty well the current Ceilometer performance issue. The more you feed it with metrics, the more slow it becomes.
For Gnocchi, while the numbers themselves are not fantastic, what is interesting is that all the graphs below show that the performances are stable without correlation with the number of resources, entities or measures. This proves that, indeed, most of the code is built around a complexity of O(1), and not O(n) anymore.
Clément drawing the logo
While the Juno cycle is being wrapped-up for most projects, including Ceilometer, Gnocchi development is still ongoing. Fortunately, the composite architecture of Ceilometer allows a lot of its features to be replaced by some other code dynamically. That, for example, enables Gnocchi to provides a Ceilometer dispatcher plugin for its collector, without having to ship the actual code in Ceilometer itself. That should help the development of Gnocchi to not be slowed down by the release process for now.
The Ceilometer team aims to provide Gnocchi as a sort of technology preview with the Juno release, allowing it to be deployed along and plugged with Ceilometer. We'll discuss how to integrate it in the project in a more permanent and strong manner probably during the OpenStack Summit for Kilo that will take place next November in Paris.
18 Aug 2014 3:00pm GMT
We have an opening in our Graphics Team to work on improving the state of open source GPU drivers. Your tasks would include working on various types of hardware and make sure it works great under linux and improving the general state of the Linux graphics stack. Since the work would include working on some specific pieces of hardware it would require the candidate to relocate to our Westford office, just north of Boston.
We are open to candidates with a range of backgrounds, but of course previous history with linux kernel codebase or the X.org codebase or Wayland is an advantage.
Please contact me at cschalle-at-redhat-com if you are interested.
18 Aug 2014 11:26am GMT
16 Aug 2014
There hasn't been a progress-report on DEP-11 for some time, but that doesn't mean there was no work going on on it.
DEP-11 is Debian's implementation of AppStream, as well as an effort to enhance the metadata available about software in Debian. While initially, AppStream was only about applications, DEP-11 was designed with a larger scope, to collect data about libraries, binaries and things like Python modules. Now, since AppStream 0.6, DEP-11 and AppStream have essentially the same scope, with the difference of DEP-11 metadata being described in YAML, while official AppStream data is XML. That was due to a request by our ftpmasters team, which doesn't like XML (which is also not used anywhere in Debian, as opposed to YAML). But this doesn't mean that people will have to deal with the YAML file format: The libappstream library will just take DEP-11 data as another data source for it's Xapian database, allowing anything using libappstream to access that data just like the XML stuff. Richards libappstream-glib will also receive support for the DEP-11 format soon, filling it's in-memory data cache and enabling the use of GNOME-Software on Debian.
So, what has been done so far? The past months, my Google Summer of Code student. Abhishek Bhattacharjee, was working hard to integrate DEP-11 support into dak, the Debian Archive Kit, which maintains the whole Debian archive. The result will be an additional metadata table in our internal Postgres database, storing detailed information about the software available in a Debian package, as well as "Components-<arch>.yml.gz" files in the Debian repositories. Dak will also produce an application icon-cache and a screenshots repository. During the time of the SoC, Abhishek focused mainly on the applications part of things, and less on the other components (like extracting data about Python modules or libraries) - these things can easily be implemented later.
The remaining steps will be to polish the code and make it merge-ready for Debian's dak (as soon as it has received enough testing, we will likely give it a try on the Tanglu Debian derivative). Following that, Apt will be extended to fetch the DEP-11 data on-demand on systems where it is useful (which is currently mostly desktop-systems) - if you want to save a little bit of space, you will be able to disable downloading this extra metadata in Apt. From there, libappstream will take the data for it's Xapian db. This will lead to the removal of the much-hated (from ftpmasters and maintainers side) app-install-data package, which has not been updated for two years and only contains a small fraction of the metadata provided by DEP-11.
What Debian will ultimately gain from this effort is support for software-centers like GNOME-Software, and improved support for tools like Apper and Muon in displaying applications. Long-term, with more metadata being available, It would be cool to add support for it to "specialized package managers", like Python's pip, npm or gem, to make them fetch information about available distribution software and install that instead of their own copies from 3rd-party repositories, if possible. This should ultimatively lead to less code duplication on distributions and will likely result in fewer security issues, since the officially maintained and integrated distribution packages can easily be used, if possible. This is no attempt to make tools like pip obsolete, but an attempt to have the different tools installing software on your machine communicate better, instead of creating parallel worlds in terms of software management. Another nice sideeffect of more metadata will be options to search for tools handling mimetypes in the software repos (in case you can't open a file), smart software centers installing missing firmware, and automatic suggestions for developers which software they need to install in order to build a specific software package. Also, the data allows us to match software across distributions, for that, I will have some news soon (not sure how soon though, as I am currently in thesis-writing-mode, and therefore have not that much spare time). Since the goal is to have these features available on all distributions supporting AppStream, it will take longer to realize - but we are on a good way.
So, if you want some more information about my student's awesome work, you can read his blogpost about it. He will also be at Debconf'14 (Portland). (I can't make it this time, but I surely won't miss the next Debconf)
Sadly, I only see a very small chance to have the basic DEP-11 stuff land in-time for Jessie (lots of review work needs to be done, and some more code needs to be written), but we will definitively have it in Jessie+1.
16 Aug 2014 2:50pm GMT
15 Aug 2014
Everyone has been blogging about GUADEC, but I'd like to talk about my other favorite conference of the year, which is GNOME.Asia. This year, it was in Beijing, a mightily interesting place. Giant megapolis, with grandiose architecture, but at the same time, surprisingly easy to navigate with its efficient metro system and affordable taxis. But the air quality is as bad as they say, at least during the incredibly hot summer days where we visited.
The conference itself was great, this year, co-hosted with FUDCon's asian edition, it was interesting to see a crowd that's really different from those who attend GUADEC. Many more people involved in evangelising, deploying and using GNOME as opposed to just developing it, so it allows me to get a different perspective.
On a related note, I was happy to see a healthy delegation from Asia at GUADEC this year!
15 Aug 2014 4:50am GMT
10 Aug 2014
I'll talk again about the interface between the Linux kernel and the userspace (mesa). After few weeks of work, I now have a full implementation which exposes NVIDIA's performance counters in Nouveau. I actually have two versions with different approachs. The first one is almost "all-userspace" which means that the configuration and the logic of performance counters are stored in the userspace, while the second one is almost "all-kernelspace" and only exposes what events can be monitored from the userspace. These two approachs use a set of software methods and the perfmon engine of Nouveau, initially written by Ben Skeggs, in order to set up performance counters.
This post will only focus on global counters, please refer to my latest article about MP counters on nv50/Tesla if you are interested. Before we continue, let me recall what is a performance counter for NVIDIA.
PCOUNTER: The performance counters engine
A hardware performance counter is a set of special registers which are used to store the counts of hardware-related activities. Hardware counters are oftenly used by developers to identify bottlenecks in their applications.
PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters. Counters do not sample one 8-bits signal, they sample a macro signal. A macro signal is the aggregation of 4 signals which have been combined using a function. An overview of this logic is represented by the figure below.
Now, let me talk a bit about graphics counter exposed by NVIDIA on nv50/Tesla family.
Graphics counter for 3D applications
Graphics counter can be used to give detailled information for OpenGL/Direct3D applications. These performance counters are only exposed by NVIDIA PerfKit, an advanced software suite for profiling OpenCL and Direct3D/OpenGL applications on Windows (only). Last year, I reverse engineered most of these graphics counter. You can take a quick look at the documentation for nva3 (for example), this will introduce the notion of complex hardware events.
Overview of complex hardware events
A complex hardware event is composed by one or two macro signals which have been combined with a counter mode. Some of them are sometimes multiplexed and thus a multiplexer (a tuple address and value) needs to be configured in the engine which generates the signal. Hardware events are so the aggregation of multiple 8-bits signals and they are harder to monitor than a simple signal. Some events are also too complex to be monitored at one time and thus need multiple passes. As perfkit polls counters after each frame, an event that requires multiple passes will need the same amount of frame to be monitored. For instance, for frame x, the counters are set for the pass #0 while they are set up for pass #1 at frame x+1. The results of the two passes are then combined to create the result of the event. Multi-passes events are thus less accurate because they need more frames to be monitored
The main goal of the interface between the kernel and mesa is to expose these complex hardware events to the userspace.
The first interface ("all-userspace" approach)
The main idea of this interface is to store the configuration of complex hardware events inside mesa. In this approach, the kernel only knows the list of 8-bits signals and exposes them with a unique string identifier, for example, the signal 0xcb on nva3 is associated to 'gr_idle' on the set 1. Then, the userspace can build complex events and send the configuration to the kernel through an ioctl call which allocates a NOUVEAU_PERFCTR_CLASS object. A NOUVEAU_PERFCTR_CLASS object is used to init, poll and read performance counters.
This interface is based on a set of softwared methods used to control performance counters. Basically, we first allocate a NOUVEAU_PERFCTR_CLASS object with the configuration (8-bits signal/function/mode …) of the counter. Then, before a frame is rendering (using the begin_query() hook of gallium) we send the handle of this object with a software method to start monitoring. At this time, the configuration is written to PCOUNTER and the counter starts to count hardware related activities. After the frame, we send a sequence number with an another software method to read out values using a notify buffer object which is allocated along the current channel. If you are interested, a previous post gives more details about that interface.
With this "all-userspace" approach, the kernel is not able to monitor complex hardware events because the configuration and the logic is stored in the userspace. Actually, the configuration is shared between the kernel and mesa. The kernel only knows 8-bits signals while the userspace knows the configuration of hardware events.
Perf also called perf_events, is a kernel-based interface for profiling Linux which is able to monitor performance counters like the number of instructions executed. Thus, if the configuration of hardware events is stored in the userspace, this will be a problem for exposing them in perf because we don't want to duplicate the configuration. I also talked with Daniel Vetter, the maintener of the i965 driver and the responsible of the major part of DRM, and he seems to be agree with the idea that it could be good to expose hardware events in perf.
We also have an another problem related to muxs because the userspace knows the configuration while the kernel does not. So, the kernel has to check address of muxs in order to avoid security issues.
The last problem is that the interface is closely based on the perfmon engine, so if perfmon changes in the future, this will require to add a new interface. But, we don't want to add another driver private ioctl or design a new interface in case of perfmon must be evolved in the future. However, with the "all-kernelspace" approach we don't have this problem since the kernel knows the logic and only exposes a list of monitorable events.
However, the "all-userspace" approach has the advantages to reduce the amount of code in the kernel and to facilitate the configuration of counters since all the logic is located in the userspace.
If you are interested you can take a look at the code :
mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_pcounter_pm
libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfctr_class
nouveau source code: https://github.com/hakzsam/nouveau/commits/expose_perfctr_class
The second interface ("all-kernelspace" approach)
This interface is kernel-based like Perf. The configuration and the logic (except multi-pass events which need two frames) are stored in the kernel only. The kernel exposes a list of monitorable events. Thus, the userspace just has to allocate a NOUVEAU_PERFEVENT_CLASS used to init, read and poll complex hardware events.
Like the previous interface, this is one is also based on a set of software methods used to control performance counters. The behaviour is almost the same than before except that we allocate a NOUVEAU_PERFEVENT_CLASS object which represents a complex hardware event instead of a NOUVEAU_PERFCTR_CLASS.
With this approach it's easy to monitor complex hardware events inside Nouveau and to expose them to Perf in the future. Also, there is no security issues because muxs are configured from and by the kernel, we don't have to check their address.
Since, the kernel only exposes a list of events and stores the configuration, pefmon can change without any impacts to the interface between the kernel and the userspace in the future. Basically, the userspace only knows the name of events, and some flags used to do scheduling. However, it's hard to expose to the userspace what events are monitorable simultaneously or not.
On nv50/Tesla, we have 8 domains (or sets) and 4 counters per domain. Thus, if all complex events only use one counter per domain, we can monitor 32 events simultaneously. Good! But actually not… Because some events use 2 counters per domain. To handle this case, the userspace can retrieve the number of available domains and the number of counters per domain through an ioctl call. Then, we expose the domain ID and the number of counters needed by an event. With this information, we can schedule events from the userspace. But we still have one problem, how to handle the case where two events on the same domain share a mux?
Some events are multiplexed but two or more events can use the same mux with a different value. To handle this special case, we expose conflicts to the userspace using some 64 bits flags. Thus, the userspace just has to do a AND comparison to check if two events can be monitored simultaneously.
The source code of this "all-kernelspace" version is available below :
mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_kernelspace_version
libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfevent_class
nouveau source code: https://github.com/hakzsam/nouveau/commits/nv50_kernelspace_version
What is the best approach ? pros & cons
- reduce the amount of code in the kernel
- easy to apply logic of performance counters
- not possible to monitor complex hardware events inside Nouveau and perf (Linux)
- configuration of counters is shared between the userspace (complex events) and the kernelspace (8-bits signals)
- possible security issues (the kernel must know address of muxs to check queries)
- the interface (and the userspace) must be changed if perfmon changes in the future
- possible to monitor complex hardware events inside Nouveau and perf (Linux)
- configuration and logic (except multi-pass events) are stored in the kernel only
- no security issues (muxs are configured by the kernel)
- perfmon can evolve without any impacts regarding the interface since it only exposes a list of events
- add more code in the kernel
- hard to expose to the userspace what events are monitorable simultaneously or not
These two interfaces have different pros and cons, but in my opinion, I think the "all-kernelspace" is more elegant and more future-proof since we can monitor complex hardware events inside Nouveau and expose them to perf (Linux) .
To sum up, we still have to choose one version of the interface between the kernel and mesa. I'll talk about this with Ben Skeggs, the maintener of Nouveau to get his opinion. We hope to get the code upstream in september or october, and before Linux 3.19.
Have a good day!
10 Aug 2014 10:04pm GMT
08 Aug 2014
At Collabora, we're always on the lookout for cool opportunities involving Wayland and we noticed recently that Mozilla had started to show some interest in porting Firefox to Wayland. In short, the Wayland display server is becoming very popular for being lightweight, versatile yet powerful and is designed to be a replacement for X11. Chrome and Webkit already got Wayland ports and we think that Firefox should have that too.
Some months ago, we wrote a simple proof-of-concept basically starting from actual Gecko's GTK3 paths and stripping all the MOZ_X11 ifdefs out of the way. We did a bunch of quick hacks fixing broken stuff but rather easily and quickly (couple days), we got Firefox to run on Weston (Wayland official reference compositor). Ok, because of hard X11 dependencies, keyboard input was broken and decorations suffered a little, but that's a very good start! Take a look at the below screenshot :)
08 Aug 2014 1:44pm GMT
07 Aug 2014
Over the past few months, working at Collabora, I have helped Mozilla get rid of Xlib surfaces for content on Linux platform. This task was the primary problem keeping Mozilla from turning OpenGL layers on by default on Linux, which is one of their long-term goals. I'll briefly explain this long-term goal and will thereafter give details about how I got rid of Xlib surfaces.
LONG-TERM GOAL - Enabling Skia layers by default on Linux
My work integrated into a wider, long-term goal that Mozilla currently has : To enable Skia layers by default on Linux (Bug 1038800). And for a glimpse into how Mozilla initially made Skia layers work on linux, see bug 740200. At the time of writing this article, Skia layers are still not enabled by default because there are some open bugs about failing Skia reftests and OMTC (off-main-thread compositing) not being fully stable on linux at the moment (Bug 722012). Why is OMTC needed to get Skia layers on by default on linux ? Simply because by design, users that choose OpenGL layers are being grandfathered OMTC on Linux… and since the MTC (main-thread compositing) path has been dropped lately, we must tackle the OMTC bugs before we can dream about turning Skia layers on by default on Linux.
For a more detailed explanation of issues and design considerations pertaining turning Skia layers on by default on Linux, see this wiki page.
MY TASK - Getting rig of Xlib surfaces for content
Xlib surfaces for content rendering have been used extensively for a long time now, but when OpenGL got attention as a means to accelerate layers, we quickly ran into interoperability issues between XRender and Texture_From_Pixmap OpenGL extension… issues that were assumed insurmountable after initial analysis. Also, and I quote Roc here, "We [had] lots of problems with X fallbacks, crappy X servers, pixmap usage, weird performance problems in certain setups, etc. In particular we [seemed] to be more sensitive to Xrender implementation quality that say Opera or Webkit/GTK+." (Bug 496204)
So for all those reasons, someone had to get rid of Xlib surfaces, and that someone was… me ;)
So problem was to get rid of Xlib surfaces (gfxXlibSurface) for content under Linux/GTK platform and implicitly, of course, replace them with Image surfaces (gfxImageSurface) so they become regular memory buffers in which we can render with GL/gles and from which we can composite using GPU. Now, it's pretty easy to force creation of Image surfaces (instead of Xlib ones) for just all content layers in gecko gfx/layers framework, just force gfxPlatformGTK::CreateOffscreenSurfaces(…) to create gfxImageSurfaces in any case.
Problem is, naively doing so gives rise to a series of perf. regressions and sub-optimal paths being taken, for example, to copy image buffers around when passing them across process boundaries, or unnecessary copying when compositing under X11 with Xrender support. So the real work was to fix everything after having pulled the gfxXlibSurface plug ;)
First glitch on the way was that GTK2 theme rendering, per design, *had* to happen on Xlib surfaces. We didn't have much choice as to narrow down our efforts to the GTK3 branch alone. What's nice with GTK3 on that front is that it makes integral use of cairo, thus letting theme rendering happen on any type of cairo_surface_t. For more detail on that decision, read this.
Upfront, we noticed that the already implemented GL compositor was properly managing and buffering image layer contents, which is a good thing, but on the way, we saw that the 'basic' compositor did not. So we started streamlining basic compositor under OMTC for GTK3.
The core of the solution here was about implementing server-side buffering of layer contents that were using image backends. Since targetted platform was Linux/GTK3 and since Xrender support is rather frequent, the most intuitive thing to do was to subclass BasicCompositor into a new X11BasicCompositor and make it use a new specialized DataTextureSource (that we called X11DataTextureSourceBasic) that basically buffers upcoming layer content in ::Update() to an gfxXlibSurface that we keep alive for the TextureSource lifetime (unless surface changes size and/or format).
Performance results were satisfying. For 64 bit systems, we had around 75% boost in tp5o_shutdown_paint, 6% perf gain for 'cart', 14% for 'tresize', 33% for tscrollx and 12% perf gain on tcanvasmark.
For complete details about this effort, design decisions and resulting performance numbers, please read the corresponding bugzilla ticket.
To see the code that we checked-in to solve this, look at those 2 patches :
07 Aug 2014 8:21pm GMT
- If you have an orientation sensor in your laptop that works under Windows 8, this tool might be of interest to you.
- Mattias will use that code as a base to add Compass support to Geoclue (you're on the hook!)
- I've made a hack to load games metadata using Grilo and Lua plugins (everything looks like nail when you have a hammer ;)
- I've replaced a Linux phone full of binary blobs by another Linux phone full of binary blobs
- I believe David Herrmann missed out on asking for a VT, and getting something nice in return.
- Cosimo will be writing some more animations for me! (and possibly for himself)
- I now know more about core dumps and stack traces than I would want to, but far less than I probably will in the future.
- Get Andrea to approve Timm Bädert's git account so he can move Corebird to GNOME. Don't forget to try out Charles, Timm!
- My team won FreeFA, and it's not even why I'm smiling ;)
- The cathedral has two towers!
Unfortunately for GUADEC guests, Bretzel Airlines opened its new (and first) shop on Friday, the last days of the BoFs.
(Lovely city, great job from Alexandre, Nathalie, Marc and all the volunteers, I'm sure I'll find excuses to come back :)
07 Aug 2014 6:39pm GMT
So with the 3.16 kernel out of the door it's time to look at what's queued up for the Intel graphics driver in 3.17.
This release features the universal plane support from Matt Roper, all enabled already by default. This is prep work for atomic modesetting and pageflipping support: Since a while we support additional (overlay) planes in the DRM core and the i915 driver, but there have always been two implicit planes directly attached to the CRTC: The primary plane used by the SetCrtc and PageFlip functions, and the optional cursor support. But with the atomic ioctl these implicit planes it's easier to handle everything as an explicit plane, so Matt's patches split them away into separate real plane objects. This is a nice cleanup of the kms api in general since a lot of SoC hardware has unified plane hardware, where cursor, primary plane and any overlays are fully interchangeable. So we already expose this to userspace, if it sets the corresponding feature flag.
Another big feature on the display side is the improved PSR support, which is now enabled by default on Haswell and Broadwell. The tricky bit with PSR (and also with FBC) and the reason we didn't yet enable this by default is correctly support legacy frontbuffer rendering (for example for X). The hardware provides a bit of support to do that, but it doesn't catch all possible frontbuffer rendering and has a lot of other limitations. To finally fix this for real we've added accurate frontbuffer tracking in software. This should finally allow us to enable a lot of display power saving features by default like PSR on Baytrail, FBC (on all platforms) and DRRS (dynamic refresh rate switching).
On actual platform display enabling we have lots of improvements all over: Baytrail MIPI DSI support has greatly stabilized, backlight and power sequencer fixes, mmio based flips to work around issues with stalls and hangs for blitter ring based flips and plenty of other work. The core drm pieces for plane rotation support have also landed, unfortunately the i915 parts didn't make the cut for 3.17.
Another big area, as usual, has been general power management improvements. We now support runtime PM for DPMS Off and not just when the output is completely disabled. This was fairly invasive work since our current modesetting code assumed that a DPMS Off/On cycle will not destroy register state, but that's exactly what runtime PM can do. On the plus side this reorganization greatly cleaned up the code base and prepared the driver for atomic modesetting, which requires a similar separation between state computation and actual hw state updating like this feature.
Jesse Barnes implemented S0ix support for system suspend/resume. Marketing has some crazy descriptions for this, but essentially this means that we use the same power saving knobs for system suspend as for runtime PM - the entire machine is still running, just at a very low power state. Long-term this should simplify our system suspend code a bit since we can just reuse all the code used to implement runtime PM.
Moving on to the render side of the gpu there have been again improvements to the rps code. Chris Wilson further tuned the rps boost logic, and Ville and Deepak implemented rps support for Cherrytrail.
Jesse contributed ppgtt support for Baytrail which will be a lot more interesting once we enable full ppgtt again (hopefully in 3.18).
For Broadwell semaphores support from Ben and Rodrigo was merged, but it looks like we need to disable that again due to stability issues. Oscar Mateo also implemented a large pile of interrupt handling improvements which hopefully address the small races and bugs we've had in the past on some platforms. There's also a lot of refactoring patches to prepare for execlist support from Oscar. Excelists are the new way of submitting work to the gpu, first supported on Broadwell (but not yet mandatory). The key feature compared to legacy ringbuffer submission is that we'll finally be able to preempt gpu tasks.
And as usual there have been tons of bugsfixes and improvements all over. Oh and: User mode setting has moved one step further on the path to deprecation and is now fully disabled. If no one complains about this we can finally rip out all that code in one of the next kernel releases.
07 Aug 2014 3:36pm GMT
04 Aug 2014
In April, when Solaris 11.2 Beta was released, I posted a list of changes to bundled software packages between Solaris 11.1 & 11.2. Now that the final release of Solaris 11.2 reached General Availability last week, I've gone back to compare the beta release via the GA release.
As you would expect, there are many fewer changes in the three months between beta & GA than in the 18 months before that. Most of the change came from upgrading the OpenStack packages from the Grizzly (2013.1) release to the Havana (2013.2) release, and adding the Swift OpenStack Object Storage components and other packages like Django which the new OpenStack components needed. There are also some general bug fix or security fix updates, such as upgrading OpenSSL from 1.0.1g to 1.0.1h.
One other change that showed up when gathering data for this list was that the Oracle Database 12c prerequisites package was renamed between beta & GA to better match the database naming style - previously it was called group/prerequisite/oracle/oracle-rdbms-server-12cR1-preinstall but is now group/prerequisite/oracle/oracle-rdbms-server-12-1-preinstall. Fortunately, you don't have to type in the whole FMRI to install it, pkg install oracle-rdbms-server-12-1-preinstall is enough.
Detailed list of changes
This table shows most of the changes to the bundled packages between the 11.2 beta released in April, and the 11.2 GA release in July.
As before, some were excluded for clarity, or to reduce noise and duplication. All of the bundled packages which didn't change the version number in their packaging info are not included, even if they had updates to fix bugs, security holes, or add support for new hardware or new features of Solaris.
|Package||Upstream||11.2 Beta||11.2 GA|
(Java SE 7u55)
(Java SE 7u65)
(Java SE 8u5)
(Java SE 8u11)
|library/python/cffi||Python CFFI||not included||0.8.2|
|library/python/six||pypi six||not included||1.6.1|
(Java SE 7u55)
(Java SE 7u65)
(Java SE 8u5)
(Java SE 8u11)
04 Aug 2014 5:13pm GMT
A bit more than a year ago, I ordered a Geeksphone Peak, one of the first widely available Firefox OS phones to explore this new OS.
Those notes are probably not very useful on their own, but they might give a few hints to stuck Android developers.
The device has a Qualcomm Snapdragon S4 MSM8225Q SoC, which uses the Adreno 203 and a 540x960 Protocol A (4 touchpoints) touchscreen.
The Adreno 203 (Note: might have been 205) is not supported by Freedreno, and is unlikely to be. It's already a couple of generations behind the latest models, and getting a display working on this device would also require (re-)writing a working panel driver.
At least the CPU is an ARMv7 with a hardware floating-point (unlike the incompatible ARMv6 used by the Raspberry Pi), which means that much more software is available for it.
Getting a shell
Start by installing the android-tools package, and copy the udev rules file to the correct location (it's mentioned with the rules file itself).
Then, on the phone, turn on the developer mode. Plug it in, and run "adb devices", you should see something like:
$ adb devices
List of devices attached
Now run "adb shell" and have a browse around. You'll realise that the kernel, drivers, init system, baseband stack, and much more, is plain Android. That's a good thing, as I could then order Embedded Android, and dive in further.
If you're feeling a bit restricted by the few command-line applications available, download an all-in-one precompiled busybox, and push it to the device with "adb push".
You can also use aafm, a simple GUI file manager, to browse around.
Getting a Fedora chroot
After formatting a MicroSD card in ext4 and unpacking a Fedora system image in it, I popped it inside the phone. You won't be able to use this very fragile script to launch your chroot just yet though, as we lack a number of kernel features that are required to run Fedora. You'll also note that this is an old version of Fedora. There are probably newer versions available around, but I couldn't pinpoint them while writing this article.
Runnning Fedora, even in a chroot, on such a system will allow us to compile natively (I wouldn't try to build WebKit on it though) and run against a glibc setup rather than Android's bionic libc.
Let's recompile the kernel to be able to use our new chroot.
Avoiding the brick
Before recompiling the kernel and bricking our device, we'll probably want to make sure that we have the ability to restore the original software. Nothing worse than a bricked device, right?
First, we'll unlock the bootloader, so we can modify the kernel, and eventually the bootloader. I took the instructions from this page, but ignored the bits about flashing the device, as we'll be doing that a different way.
You can grab the restore image from my Fedora people page, as, as seems to be the norm for Android(-ish) devices makers to deny any involvement in devices that are more than a couple of months old. No restore software, no product page.
The recovery should be as easy as
$ adb reboot-bootloader
$ fastboot flash boot boot.img
$ fastboot flash system system.img
$ fastboot flash userdata userdata.img
$ fastboot reboot
$ export ARCH=arm
$ export CROSS_COMPILE=/usr/bin/arm-linux-gnu-
$ make C8680_defconfig
# Make sure that CONFIG_DEVTMPFS and CONFIG_EXT4_FS_SECURITY get enabled in the .config
$ adb shell 'cat /dev/block/mmcblk0p8 > /mnt/sdcard/disk.img'
$ adb pull /mnt/sdcard/disk.img
$ bootunpack boot.img
$ mkbootimg --kernel /path/to/kernel-source/out/arch/arm/boot/zImage --ramdisk p8.img-ramdisk.cpio.gz --base 0x200000 --cmdline 'androidboot.hardware=qcom loglevel=1' --pagesize 4096 -o boot.img
$ adb reboot-bootloader
$ fastboot flash boot boot.img
/dev/mmcblk1 / ext4 defaults 1 1
$ dracut -o "dm dmraid dmsquash-live lvm mdraid multipath crypt mdraid dasd zfcp i18n" initramfs.img
04 Aug 2014 2:02pm GMT
30 Jul 2014
or why publishing code is STEP ZERO.
If you've been developing code internally for a kernel contribution, you've probably got a lot of reasons not to default to working in the open from the start, you probably don't work for Red Hat or other companies with default to open policies, or perhaps you are scared of the scary kernel community, and want to present a polished gem.
If your company is a pain with legal reviews etc, you have probably spent/wasted months of engineering time on internal reviews and stuff, so think all of this matters later, because why wouldn't it, you just spent (wasted) a lot of time on it, so it must matter.
So you have your polished codebase, why wouldn't those kernel maintainers love to merge it.
Then you publish the source code.
Oh, look you just left your house. The merging of your code is many many miles distant and you just started walking that road, just now, not when you started writing it, not when you started legal review, not when you rewrote it internally the 4th time. You just did it this moment.
You might have to rewrite it externally 6 times, you might never get it merged, it might be something your competitors are also working on, and the kernel maintainers would rather you cooperated with people your management would lose their minds over, that is the kernel development process.
step zero: publish the code. leave the house.
(lately I've been seeing this problem more and more, so I decided to write it up, and it really isn't directed at anyone in particular, I think a lot of vendors are guilty of this).
30 Jul 2014 1:51am GMT
25 Jul 2014
Now that we have a few years of experience with the Wayland protocol, I thought I would put some of my observations in writing. This, what will hopefully become a series rather than just one post, considers how to design Wayland protocol extensions the right way.
This first post considers protocol object lifespan and the related races between the compositor/server and the client. I assume that the reader is already aware of the Wayland protocol basics. If not, I suggest reading Chapter 4. Wayland Protocol and Model of Operation.
How protocol objects are created
On a new Wayland connection, the only object that exists is the wl_display which is a specially constructed object. You always have it, and there is no wire protocol for creating it.
The only thing the client can create next is a wl_registry through the wl_display. Registry is the root of the whole interface (class) hierarchy. Wl_registry advertises the global objects by numerical name, and using wl_registry.bind request to bind to a global is the first normal way to create a protocol object.
Binding is slightly special still, as the protocol specification in XML for wl_registry uses the new_id argument type, but does not specify the interface (class) for the new object. In the wire protocol, this special argument gets turned into three arguments: interface name (string), interface version (uint32_t), and the new object ID (uint32_t). This is unique in the Wayland core protocol.
The usual way to create a new protocol object is for the client to send a request that has a new_id type of argument. The protocol specification (XML) defines what the interface is, so there is no need to communicate the interface type over the wire. All that is needed on the wire is the new object ID. Almost all object creation happens this way.
Although rare, also the server may create protocol objects for the client. This happens by having a new_id type of argument in an event. Every time the client receives this event, it receives a new protocol object.
As all requests and events are always part of some interface (like a member of a class), this creates an interface hierarchy. For example, wl_compositor objects are created from wl_registry, and wl_surface objects are created from wl_compositor.
Object creation never fails. Once the request or event is sent, the new objects it creates exists, period. This keeps the protocol asynchronous, as there is no need to reply or check that the creation succeeded.
How protocol objects are destroyed
There are two ways to destroy a protocol object. By far the most common one is to have a request in the interface that is specified to be a destructor. Most often this request is called "destroy". When the client code calls the function wl_foobar_destroy(), the request is sent to the server and the client side proxy (struct wl_proxy) for the object gets destroyed. The server then handles the destructor request at some point in the future.
The other way is to destroy the object by an event. In that case, no destructor must be defined in the interface's protocol specification, and the event must be clearly documented to be destructive as there is no automation nor safeties for this. This is for cases where the server decides when an object dies, and requires extreme care in protocol design to work right in all cases. When a client receives such an event, all it can do is destroy the proxy. The (in)famous example of an interface like this is wl_callback.
Enter the boogeyman: races
It is very important that both the client and the server agree on which protocol objects exist. If the client sends a request on, or references as an argument, an object that does not exist in the server's opinion, the server raises a protocol error, and disconnects the client. Obviously this should never happen, nor should it happen that the server sends an event to an object that the client destroyed.
Wayland being a completely asynchronous protocol, we have no implicit guarantees. The server may send an event at the same time as the client destroys the object, and now the event targets an object the client does not know about anymore. Rather than the client shooting itself dead (that's the server's job), we have a trick in libwayland-client: it silently ignores events to destroyed objects, until the server confirms that the object is truly gone.
This works very well for interfaces where the destructor is a request. If the client first sends the destructor request and then sends another request on the destroyed object, it just shot its own head off - no race needed.
Things get tricky for the other case, destructor events. The server may send the destructor event at the same time the client is sending a request on the same object. When the server finally gets the request, the object is already gone, and the client gets taken behind the shed and shot. Therefore pretty much the only safe way to use destructor events is if the interface does not define any requests at all. Ever, not even in future extensions. Furthermore, objects with that interface should not be used as arguments anywhere, or you may hit the race. That is why destructor events are difficult to use right.
The boogeyman's brother
There is yet another nasty race with events that create objects, i.e. server-created objects. If the client is destroying the (parent) object at the same time as the server is sending an event on that object, creating a new (child) object, the server cannot know if the client actually handled the event or not. If the client ignored the event, it will never tell the server to destroy that new object, and you leak in the server.
You could try to make your way out of that pitfall by writing in your protocol specification, that when the (parent) object is destroyed, all the child objects will be destroyed implicitly. But then the client must not send the destructor request for the child objects after it has destroyed the parent, because otherwise the server sees requests on objects it does not know about, and kicks you in the groin, hard. If the child interface defines a destructor, the client cannot destroy its proxies after destroying the parent object. If the child interface does not define a destructor, you can never free the server-side resources until the parent gets destroyed.
The client could destroy all the child objects with a defined destructor in one go, and then immediately destroy the parent object. I am not sure if that works, but it might. If it does not, you have to specify a whole tear-down protocol sequence. The client tells the server it wants to destroy the parent object, the server acks and guarantees it no longer sends any events on it, then the client actually destroys the parent object. Hey, you have a round-trip and just turned a beautiful asynchronous protocol into synchronous, congratulations!
Concluding with recommendations
Here are my recommendations when designing Wayland protocol extensions:
- Always make sure there is a guaranteed way to destroy all objects. This may sound obvious, but we have fixed several cases in the Wayland core protocol where there was no way to destroy a created protocol object such, that all resources on both server and client side could be freed. And there are still some cases not fixed.
- Always define a destructor request. If you have any doubt whether your new interface needs a destructor request, just put it there. It is more awkward to add later than normal requests. If you do not have one, the client cannot tell the server to free those protocol object resources.
- Do not use destructor events. They are hard to design right, and extending the interface later will be a bitch. The client cannot tell the server to free the resources, so objects with destructor events should be short-lived, and the destruction must be guaranteed.
- Do not use server-side created objects without a serious thought. Designing the destruction sequence such that it never leaks nor explodes is tricky.
25 Jul 2014 7:01pm GMT
23 Jul 2014
DRI3 has plenty of necessary fixes for X.org and Wayland, but it's still young in its integration. It's been integrated in the upcoming Fedora 21, and recently in Arch as well.
If WebKitGTK+ applications hang or become unusably slow when an HTML5 video is supposed to be, you might be hitting this bug.
If Totem crashes on startup, it's likely this problem, reported against cogl for now.
Feel free to add a comment if you see other bugs related to DRI3, or have more information about those.
Update: Wayland is already perfect, and doesn't use DRI3. The "DRI2" structures in Mesa are just that, structures. With Wayland, the DRI2 protocol isn't actually used.
23 Jul 2014 12:18pm GMT