26 Oct 2016


Bastien Nocera: Dual-GPU integration in GNOME

Thanks to the work of Hans de Goede and many others, dual-GPU (aka NVidia Optimus or AMD Hybrid Graphics) support works better than ever in Fedora 25.

On my side, I picked up some work I originally did for Fedora 24, but ended up being blocked by hardware support. This brings better integration into GNOME.

The Details Settings panel now shows which video cards you have in your (most likely) laptop.

dual-GPU Graphics

The second feature is what Blender and 3D video games users have been waiting for: a contextual menu item to launch the application on the more powerful GPU in your machine.

Mooo Powaa!

This demonstration uses a slightly modified GtkGLArea example, which shows which of the GPUs is used to render the application in the title bar.

on the integrated GPU

on the discrete GPU

Behind the curtain

Behind those 2 features, we have a simple D-Bus service, which runs automatically on boot, and stays running to offer a single property (HasDualGpu) that system components can use to detect what UI to present. This requires the "switcheroo" driver to work on the machine in question.

Because of the way applications are launched on the discrete GPU, we cannot currently support D-Bus activated applications, but GPU-heavy D-Bus-integrated applications are few and far between right now.

Future plans

There's plenty more to do in this area, to polish the integration. We might want applications to tell us whether they'd prefer being run on the integrated or discrete GPU, as live switching between renderers is still something that's out of the question on Linux.

Wayland dual-GPU support, as well as support for the proprietary NVidia drivers are also things that will be worked on, probably by my colleagues though, as the graphics stack really isn't my field.

And if the hardware becomes more widely available, we'll most certainly want to support hardware with hotpluggable graphics support (whether gaming laptop "power-ups" or workstation docks).


All the patches necessary to make this work are now available in GNOME git (targeted at GNOME 3.24), and backports are integrated in Fedora 25, due to be released shortly.

26 Oct 2016 1:37pm GMT

24 Oct 2016


Nicolai Hähnle: Compiling shaders: dynamically uniform variables and "convergent" intrinsics

There are some program transformations that are obviously correct when compiling regular single-threaded or even multi-threaded code, but that cannot be used for shader code. For example:

 v = texture(u_sampler, texcoord);  
if (cond) {
gl_FragColor = v;
} else {
gl_FragColor = vec4(0.);

... cannot be transformed to ...

if (cond) {
// The implicitly computed derivate of texcoord
// may be wrong here if neighbouring pixels don't
// take the same code path.
gl_FragColor = texture(u_sampler, texcoord);
} else {
gl_FragColor = vec4(0.);

... but the reverse transformation is allowed.

Another example is:

 if (cond) {
v = texelFetch(u_sampler[1], texcoord, 0);
} else {
v = texelFetch(u_sampler[2], texcoord, 0);

... cannot be transformed to ...

v = texelFetch(u_sampler[cond ? 1 : 2], texcoord, 0);
// Incorrect, unless cond happens to be dynamically uniform.

... but the reverse transformation is allowed.

Using GL_ARB_shader_ballot, yet another example is:

 bool cond = ...;
uint64_t v = ballotARB(cond);
if (other_cond) {

... cannot be transformed to ...

bool cond = ...;
if (other_cond) {
// Here, ballotARB returns 1-bits only for threads/work items
// that take the if-branch.

... and the reverse transformation is also forbidden.

These restrictions are all related to the GPU-specific SPMD/SIMT execution model, and they need to be taught to the compiler. Unfortunately, we partially fail at that today.

Here are some types of restrictions to think about (each of these restrictions should apply on top of any other restrictions that are expressible in the usual, non-SIMT-specific ways, of course):

  1. An instruction can be moved from location A to location B only if B dominates or post-dominates A.

    This restriction applies e.g. to instructions that take derivatives (like in the first example) or that explicitly take values from neighbouring threads (like in the third example). It also applies to barrier instructions.

    This is LLVM's convergent function attribute as I understand it.

  2. An instruction can be moved from location A to location B only if A dominates or post-dominates B.

    This restriction applies to the ballot instruction above, but it is not required for derivative computations or barrier instructions.

    This is in a sense dual to LLVM's convergent attribute, so it's co-convergence? Divergence? Not sure what to call this.

  3. Something vague about not introducing additional non-uniformity in the arguments of instructions / intrinsic calls.

    This last one applies to the sampler parameter of texture intrinsics (for the second example), to the ballot instruction, and also to the texture coordinates on sampling instructions that implicitly compute derivatives.

For the last type of restriction, consider the following example:

 uint idx = ...;
if (idx == 1u) {
v = texture(u_sampler[idx], texcoord);
} else if (idx == 2u) {
v = texture(u_sampler[idx], texcoord);

... cannot be transformed to ...

uint idx = ...;
if (idx == 1u || idx == 2u) {
v = texture(u_sampler[idx], texcoord);

In general, whenever an operation has this mysterious restriction on its arguments, then the second restriction above must apply: we can move it from A to B only if A dominates or post-dominates B, because only then can we be certain that the move introduces no non-uniformity. (At least, this rule applies to transformations that are not SIMT-aware. A SIMT-aware transformation might be able to prove that idx is dynamically uniform even without the predication on idx == 1u or idx == 2u.)

However, the control flow rule is not enough:

 v1 = texture(u_sampler[0], texcoord);
v2 = texture(u_sampler[1], texcoord);
v = cond ? v1 : v2;

... cannot be transformed to ...

v = texture(u_sampler[cond ? 0 : 1], texcoord);

The transformation does not break any of the CFG-related rules, and it would clearly be correct for a single-threaded program (given the knowledge that texture(...) is an operation without side effects). So the CFG-based restrictions really aren't sufficient to model the real set of restrictions that apply to the texture instruction. And it gets worse:

 v1 = texelFetch(u_sampler, texcoord[0], 0);
v2 = texelFetch(u_sampler, texcoord[1], 0);
v = cond ? v1 : v2;

... is equivalent to ...

v = texelFetch(u_sampler, texcoord[cond ? 0 : 1], 0);

After all, texelFetch computes no implicit derivatives.

Calling the three kinds of restrictions 'convergent', 'co-convergent', and 'uniform', we get:

 texture(uniform sampler, uniform texcoord) convergent (co-convergent)
texelFetch(uniform sampler, texcoord, lod) (co-convergent)
ballotARB(uniform cond) convergent co-convergent
barrier() convergent

For the texturing instructions, I put 'co-convergent' in parentheses because these instructions aren't inherently 'co-convergent'. The attribute is only there because of the 'uniform' function argument.

Actually, looking at the examples, it seems that co-convergent only appears when a function has a uniform argument. Then again, the texelFetch function can be moved freely in the CFG by a SIMT-aware pass that can prove that the move doesn't introduce non-uniformity to the sampler argument, so being able to distinguish functions that are inherently co-convergent (like ballotARB) from those that are only implicitly co-convergent (like texture and texelFetch) is still useful.

For added fun, things get muddier when you notice that in practice, AMDGPU doesn't even flag texturing intrinsics as 'convergent' today. Conceptually, the derivative-computing intrinsics need to be convergent to ensure that the texture coordinates for neighbouring pixels are preserved (as in the very first example). However, the AMDGPU backend does register allocation after the CFG has been transformed into the wave-level control-flow graph. So register allocation automatically preserves neighbouring pixels even when a texture instruction is sunk into a location with additional control-flow dependencies.

When we reach a point where vector register allocation happens with respect to the thread-level control-flow graph, then texture instructions really need to be marked as convergent for correctness. (This change would be beneficial overall, but is tricky because scalar register allocation must happen with respect to the wave-level control flow graph. LLVM currently wants to allocate all registers in one pass.)

24 Oct 2016 7:36pm GMT

Eric Anholt: This week in vc4 (2016-10-24): HDMI audio, DSI, simulator, linux-next

The most interesting work I did last week was starting to look into HDMI audio. I've been putting this off because really I'm a 3D developer, not modesetting, and I'm certainly not an audio developer. But the work needs to get done, so here I am.

HDMI audio on SOCs looks like it's not terribly hard. VC4's HDMI driver needs to instantiate a platform device for audio when it gets loaded, and attach a function table to it for calls to be made when starting/stopping audio. Then I write a sound driver that will bind to the node that VC4 created, accepting a parameter of which register to DMA into. That sound driver will tell ALSA about what formats it can acccept and actually drive the DMA engine to write audio frames into HDMI. On the VC4 HDMI side, we do a bit of register setup (clocks, channel mappings, and the HDMI audio infoframe) when the sound driver tells us it wants to start.

It all sounds pretty straightforward so far. I'm currently at the "lots of typing" stage of development of this.

Second, I did a little poking at the DSI branch again. I've rebased onto 4.9 and tested an idea I had for how DSI might be failing. Still no luck. There's code on drm-vc4-dsi-panel-restart-cleanup-4.9, and the length of that name suggests how long I've been bashing my head against DSI. I would love for anyone out there with the inclination to do display debug to spend some time looking into it. I know the history there is terrible, but I haven't prioritized cleanup while trying to get the driver working in the first place.

I also finished the simulator cleanup from last week, found a regression in the major functional improvement in the cleanup, and landed what worked. Dave Airlie pointed out that he took a different approach for testing with virgl called vtest, and I took a long look at that. Now that I've got my partial cleanup in place, the vtest-style simulator wrapper looks a lot more doable, and if it doesn't have the major failing that swrast did with visuals then it should actually make vc4 simulation more correct.

I landed a very tiny optimization to Mesa that I wrote while debugging DEQP failures.

Finally, I put together the -next branches for upstream bcm2835 now that 4.9-rc1 is out. So far I've landed a major cleanup of pinctrl setup in the DT (started by me and finished by Gerd Hoffmann from Red Hat, to whom I'm very grateful) and a little USB setup fix. I did a lot of review of VCHIQ patches for staging, and sort of rewrote Linus Walleij's RFC patch for getting descriptive names for the GPIOs in lsgpio.

24 Oct 2016 6:32pm GMT

Hans de Goede: Switchable / Hybrid Graphics support in Fedora 25

Recently I've been working on improving hybrid graphics support for the upcoming Fedora 25 release. Although Fedora 25 Workstation will use Wayland by default for its GNOME 3 desktop, my work has been on hybrid gfx support under X11 (Xorg) as GNOME 3 on Wayland does not yet support hybrid gfx,

So no Wayland, still there are a lot of noticable hybrid gfx support and users of laptops with hybrid gfx using the open-source drivers should have a much smoother userexperience then before. Here is an (incomplete) list of generic improvements:

Besides this a lot of work has been done on fixing hybrid gfx issues in the modesetting driver, this is important since with Fedora we use the modesetting driver on Skylake and newer Intel integrated gfx as well as on Maxwell and newer Nvidia discrete GPUs, the following issues have been fixed in the modesetting driver (and thus on laptops with a Skylake CPU and/or a Maxwell or newer GPU):

Note coming Wednesday (October 26th) we're having a Fedora Better Switchable Graphics Test Day, if you have a laptop with hybrid gfx please join us to help further improving hybrid gfx support.

24 Oct 2016 12:56pm GMT

20 Oct 2016


Iago Toral: OpenGL terrain renderer update: Bloom and Cascaded Shadow Maps

Bloom Filter

The bloom filter makes bright objects glow and bleed through other objects positioned in between them and the camera. It is a common post-processing effect used all the time in video games and animated movies. The demo supports a couple of configuration options that control the intensity and behavior of the filter, here are some screenshots with different settings:

Bloom filter Off
Bloom filter On, default settings
Bloom filter On, intensity increased

I particularly like the glow effect that this brings to the specular reflections on the water surface, although to really appreciate that you need to run the demo and see it in motion.

Cascaded Shadow Maps

I should really write a post about basic shadow mapping before going into the details of Cascaded Shadow Maps, so for now I'll just focus on the problem they try to solve.

One of the problems with shadow mapping is rendering high resolution shadows, specially for shadows that are rendered close to the camera. Generally, basic shadow mapping provides two ways in which we can improve the resolution of the shadows we render:

1. Increase the resolution of the shadow map textures. This one is obvious but comes at a high performance (and memory) hit.

2. Reduce the distance at which we can render shadows. But this is not ideal of course.

One compromise solution is to notice that, as usual with 3D computer graphics, it is far more important to render nearby objects in high quality than distant ones.

Cascaded Shadow Maps allow us to use different levels of detail for shadows that are rendered at different distances from the camera. Instead of having a single shadow map for all the shadows, we split the viewing frustum into slices and render shadows in each slice to a different shadow map.

There are two immediate benefits of this technique:

1. We have flexibility to define the resolution of the shadow maps for each level of the cascade, allowing us, for example, to increase the resolution of the levels closest to the camera and maybe reduce those that are further away.

2. Each level only records shadows in a slice of the viewing frustum, which increases shadow resolution even if we keep the same texture resolution we used with the original shadow map implementation for each shadow map level.

This approach also has some issues:

1. We need to render multiple shadow maps, which can be a serious performance hit depending on the resolutions of the shadow maps involved. This is why we usually lower the resolution of the shadow maps as distance from the camera increases.

2. As we move closer to or further from shadowed objects we can see the changes in shadow quality pop-in. Of course we can control this by avoiding drastic quality changes between consecutive levels in the cascade.

Here is an example that illustrates the second issue (in this case I have lowered the resolution of the 2nd and 3d cascade levels to 50% and 25% respectively so the effect was more obvious). The screenshots show the rendering of the shadows at different distances. We can see how the shadows in the close-up shot are very sharp and as the distance increases they become blurrier due to the use of a lower resolution shadow map:

CSM level 0 (4096×4096)
CSM level 1 (2048×2048)
CSM level 2 (1024×1024)

The demo supports up to 4 shadow map levels although the default configuration is to use 3. The resolution of each level can be configured separately too, in the default configuration I lowered the shadow resolution of the second and third levels to 75% and 50% respectively. If we configure the demo to run on a single level (with 100% texture resolution), we are back to the original shadow map implementation, so it is easy to experiment with both techniques.

I intend to cover the details behind shadow mapping and the implementation of the bloom filter in more detail in a future post, so again, stay tuned for more!

20 Oct 2016 10:32am GMT

18 Oct 2016


Eric Anholt: This week in vc4 (2016-10-18): X performance, DEQP, simulation

The most visible change this week is a fix to Mesa for texture upload performance. A user had reported that selection rectangles in LXDE's file manager were really slow. I brought up sysprof, and it showed that we were doing uncached reads from the GPU in a situation that should have been entirely write-combined writes.

The bug was that when the size of the texture wasn't aligned to utiles, we were loading the previous contents into the temporary buffer before writing the new texture data in and uploading, even if the full texture was being updated. I've fixed it to check for when the full texture is being uploaded, and not do the initial download. This bug was getting it on almost any Cairo vector graphics operation with its Xlib backend, so hopefully this helps a lot of people's desktops.

I also worked on a cleanup of the simulator mode. I use the closed source simulator regularly as part of my work -- it's fairly accurate to hardware behavior, and allows you to trace what's happening when things go wrong, all with the driver code running like "normal" for apps on my x86 desktop.

However, my simulator support is a little invasive to the driver, replacing vc4 ioctl calls with alternative ioctls to the i915 driver. I started on a series to make vc4_simulator.c take plain VC4 ioctls and translate them, so that the simulator code is entirely contained. The last step I think I have is figuring out which level to put the simulator's copying in and out of window system framebuffers at.

Last, I got DEQP's GLES2 tests up and running on Raspberry Pi. These are approximately equivalent to the official conformance tests. The initial results were quite good -- 18630/19098 (97.5%) passing when run in the piglit framework. I found a couple of fixes to be made in glClear() support, one of which will affect all gallium drivers. Of the remainder, there are a few tests that are failing due to test bugs (we expose extensions that allow features the tests don't expect), but most are failing in register allocation. For the register allocation failures, I see a relatively quick fix that should reduce register pressure in loops.

18 Oct 2016 6:59pm GMT

Julien Danjou: Running an open source and upstream oriented team in agile mode

For the last 3 years, I've been working in the OpenStack Telemetry team at eNovance, and then at Red Hat. Our mission is to maintain the OpenStack Telemetry stack, both upstream and downstream (i.e. inside Red Hat products). Besides the technical challenges, the organization of the team always have played a major role in our accomplishments.

Here, I'd like to share some of my hindsight with you, faithful readers.

Meet the team

The team I work in changed a bit during those 3 years, but the core components always have been the same: a few software engineers, a QE engineer, a product owner, and an engineering manager. That meant the team size has been always been between 6 and 8 people.

I cannot emphasize enough how team size is important. Not having more than 8 persons in a team fits with the two pizzas rule from Jeff Bezzos, which turned out to be key in our team composition.

The group dynamic that is applied by teams not bigger than this is excellent. It offers the possibility to know and connect with everyone - each time member has only up to 7 people to talk to on a daily basis, which means only 28 communication axis between people. Having a team of e.g. 16 people means 120 different links in your team. Double your team size, and multiply by 4 your communication overhead. My practice shows that the less communication axis you have in a team, the less overhead your will have and the swifter your team will be.

All team members being remote workers, is is even more challenging to build relationship and bound. We had the opportunity to know each other during the OpenStack summit twice a year and doing regular video-conference via Google Hangout or BlueJeans really helped.

The atmosphere you set-up in your team will also forge the outcome of your team work. Run your team with trust, peace and humor (remember I'm on the team 🤣) and awesome things will happen. Run your team with fear, pressure, and finger-pointing, nothing good will happen.

There's little chance that when a team is built, everyone will be on the same level. We were no exception, we had more and less experienced engineers. But the most experienced engineers took the time needed to invest and mentor the less experienced. That also helped to build trust and communication links between members of the team. And over the long run, everyone is getting more efficient: the less experienced engineers are getting better and the more experienced can delegate a lot of stuff to their fellows.

Then they can chill or work on bigger stuff. Win-win.

It's actually no more different than that the way you should run an open source team, as I already claimed in a previous article on FOSS projects management.

Practicing agility

I might be bad at practicing agility: contrary to many people, I don't see agility as a set of processes. I see that as a state of mind, as a team organization based on empowerment. No more, no less.

And each time I meet people and explain that our team is "agile", they start shivering, explaining how they hate sprints, daily stand-ups, scrum, and planning poker, that this is all a waste of time and energy.

Well, it turns out that you can be agile without all of that.

Planning poker

In our team, we tried at first to run 2-weeks sprints and used planning poker to schedule our user stories from our product backlog (= todo list). It never worked as expected.

First, most people had the feeling to lose their time because they already knew exactly what they were supposed to do. Having any doubt, they would have just gone and talked to the product owner or another fellow engineer.

Secondly, some stories were really specialized and only one of the team member was able to understand it in details and evaluate it. So most of the time, a lot of the team members playing planning poker would just vote a random number based on the length of the explanation of the story teller. For example, if an engineer said "I just need to change that flag in the configuration file" then everyone would vote 1. If they started rambling for 5 minutes about "how the configuration option is easy to switch, but that there might be other things to change at the same time, and things to check for impact bigger than expected, and code refactoring to do", then most people would just announce a score of 13 on that story. Just because the guy talked for 3 minutes straight and everything sounded complicated and out of their scope.

That meant that the poker score had no meaning to us. We never managed to have a number of points that we knew we could accomplish during a sprint (the team velocity as they call it).

The only benefit that we identified from planning poker, in our case, is that it forces people to keep sit down and communicate about a user story. Though, it turned out that making people communicate was not a problem we needed to solve in our team, so we decided to stop doing that. But it can be a pretty good tool to make people talking to each other.

Therefore, the 2-weeks sprint never made much sense as we were unable to schedule our work reliably. Furthermore, doing most of our daily job in open source communities, we were unable to schedule anything. When sending patches to an upstream project, you have no clue when they will be reviewed. What you know for sure, is that in order to maximize your code merge throughput with this high latency of code review, you need to parallelize your patch submission a lot. So as soon as you receive some feedback from your reviewers, you need to (almost) drop everything, rework your code and resubmit it.

There's no need to explain what this does not absolutely work with a sprint approach. Most of the scrum framework lays on the fact that you own workflow from top to bottom, which is far from being true when working in open source communities.

Daily stand-up meetings

We used to run a daily stand-up meeting every day, then every other day. Doing that remotely kills the stand-up part, obviously, so there is less guarantee the meeting will be short. Considering all team members are working remotely in different time zones, with some freedom to organize their schedule, it was very difficult to synchronize those meetings. With member spread from the US to Eastern Europe, the meeting was in the middle of the afternoon for me. I found it frustrating to have to stop my activities in the middle of every afternoon to chat with my team. We all know the cost of context switching to us, humans.

So we drifted from our 10 minutes daily meeting to a one-hour weekly meeting with the whole team. It's way easier to synchronize for a large chunk of time once a week and to have this high-throughput communication channel.

Our (own) agile framework

Drifting from the original scrum implementation, we ended up running our own agility framework. It turned out to have similarity with kanban - you don't always have to invent new things!

Our main support is a Trello board that we share with the whole team. It consists of different columns, where we put card representing small user stories or simple to-do items. Each column is the state of the current card, and we move them left to right:

We started to automate some of our Trello workflow in a tool called Trelloha. For example, it allows us to track upstream patches sent through Gerrit or GitHub and tick the checkbox items in any card when those are merged.

We actually don't put many efforts on our Trello board. It's just a slightly organized chaos, as are upstream projects. We use it as a lightweight system for taking notes, organizing our thought and letting know what we're doing and why we're doing it. That's where Trello is wonderful because using it has a very low friction: creating, updating and moving card is a one click operation.

One bias of most engineers is to overthink and over-engineer their workflow, trying to rationalize it. Most of the time, they end up automating which means building processes and bureaucracy. It just slows things down and builds frustration upon everyone. Just embrace chaos and spend time on what matters.

Most of the things we do are linked to external Launchpad bugs, Gerrit reviews or GitHub issues. That means the cards in Trello carry very little information, as everything happens outside, in the wild Internet of open source communities. This is very important as we need to avoid any kind of retention of knowledge and information from contributors outside the company. This also makes sure that our internal way of running does not leak outside and (badly) influence outside communities.


We also run a retrospective every 2 weeks, which might be the only thing we kept from the scrum practice. It's actually a good opportunity for us to share our feelings, concerns or jokes. We used to do it using the six thinking hats method, but it slowly faded away. In the end, we now use a different Trello board with those columns:

All teammates fill the board with the card they want, and everyone is free to add themselves to any card. We then run through each card and let people who added their name to it talk about it. The column "Action Items" is usually filled as we speak and discover we should do things. We can then move cards created there to our regular board, in the To Do column.

Central communication

Sure, people have different roles in a team, but we dislike bottleneck and single point of failure. Therefore, we are using an internal mailing list where we ask people to send their request and messages to. If people send things related to our team job to one of us personally, we just forward or Cc the list when replying so everyone is aware of what one might be talking about with people external to the team.

This is very important, as it emphasizes that no team member should be considered special. Nobody owns more information and knowledge than others, and anybody can jump into a conversation if it has valuable knowledge to share.

The same applies for our internal IRC channel.

We also make sure that we discuss only company-specific things on this list or on our internal IRC channel. Everything that can be public and is related to upstream is discussed on external communication medium (IRC, upstream mailing list, etc). This is very important to make sure that we are not blocking anybody outside the Red Hat to join us and contribute to the projects or ideas we work on. We also want to make sure that people working in our company are no more special than other contributors.


We're pretty happy with our set-up right now, and the team runs pretty smoothly since a few months. We're still trying to improve, and having a general sense of trust among team members make sure we can openly speak about whatever problem we might have.

Feel free to share your feedback and own experience of running your own teams in the comment section.

18 Oct 2016 5:21pm GMT

13 Oct 2016


José Fonseca: Apitrace maintenance

Lots of individuals and companies have made substantial contributions to Apitrace. But maintenance has always rested with me, the original author.

I would prefer have shared that responsibility with a wider team, but things haven't turned out that way. For several reasons I suppose:

Apitrace has always been something I worked on the spare time, or whenever I had an itch to scratch. That is still true, with the exception that after having a kid I have scarcely any free time left.

Furthermore the future is not bright: I believe Apitrace will have a long life in graphics driver test automation, and perhaps whenever somebody needs to debug an old OpenGL application, but I doubt it will flourish beyond that. And this facts weighs in whenever I need to decide whether to spend some time on Apitrace vs everything else.

The end result is that I haven't been a responsive maintainer for some time (long time merging patches, providing feedback, resolving issue, etc), and I'm afraid that will continue for the foreseeable future.

I don't feel any obligation to do more (after all, the license does say the software is provided as is), but I do want to set right expectations to avoid frustrating users/contributors who might otherwise timely feedback, hence this post.

13 Oct 2016 9:50pm GMT

Iago Toral: OpenGL terrain renderer: rendering the terrain mesh

In this post I'll discuss how I setup and render terrain mesh in the OpenGL terrain rendering demo. Most of the relevant code for this is in the ter-terrain.cpp file.

Setting up a grid of vertices

Unless you know how to use a 3D modeling program properly, a reasonable way to create a decent mesh for a terrain consists in using a grid of vertices and elevate them according to a height map image. In order to create the grid we only need to decide how many rows and columns we want. This, in the end, determines the number of polygons and the resolution of the terrain.

We need to map these vertices to world coordinates too. We do that by defining a tile size, which is the distance between consecutive vertices in world units. Larger tile sizes increase the size of the terrain but lower the resolution by creating larger polygons.

The image below shows an 8×6 grid that defines 35 tiles. Each tile is rendered using 2 triangles:

8×6 terrain grid

The next step is to elevate these vertices so we don't end up with a boring flat surface. We do this by sampling the height map image for each vertex in the grid. A height map is a gray scale image where the values of the pixels represent altitudes at different positions. The closer the color is to white, the more elevated it is.

Heightmap image

Adding more vertices to the grid increases the number of sampling points from the height map and reduces the sampling distances, leading to a smoother and more precise representation of the height map in the resulting terrain.

Sampling the heightmap to compute vertex heights

Of course, we still need to map the height map samples (gray scale colors) to altitudes in world units. In the demo I do this by normalizing the color values to [-1,+1] and then applying a scale factor to compute the altitude values in world space. By playing with the scaling factor we can make our terrain look more or less abrupt.

Altitude scale=6.0
Altitude scale=12.0

For reference, the height map sampling is implemented in ter_terrain_set_heights_from_texture().

Creating the mesh

At this point we know the full position (x, y, z) in world coordinates of all the vertices in our grid. The next step is to build the actual triangle mesh that we will use to render the terrain and the normal vectors for each triangle. This process is described below and is implemented in the ter_terrain_build_mesh() function.

Computing normals

In order to get nice lighting on our terrain we need to compute normals for each vertex in the mesh. A simple way to achieve this is would be to compute the normal for each face (triangle) and use that normal for each vertex in the triangle. This works, but it has 3 problems:

1. Every vertex of each triangle has the same exact normal, which leads to a rather flat result.

2. Adjacent triangles with different orientations showcase abrupt changes in the normal value, leading to significantly different lighting across the surfaces that highlight the individual triangles in the mesh.

3. Because each vertex in the mesh can have a different normal value for each triangle it participates in, we need to replicate the vertices when we render, which is not optimal.

Alternatively, we can compute the normals for each vertex considering the heights of its neighboring vertices. This solves all the problems mentioned above and leads to much better results thanks to the interpolation of the normal vectors across the triangles, which leads to smooth lighting reflection transitions:

Flat normals
Smooth normals

The implementation for this is in the function calculate_normal(), which takes the column and row indices of the vertex in the grid and calculates the Y coordinate by sampling the heights of the 4 nearby vertices in the grid.

Preparing the draw call

Now that we know the positions of all the vertices and their normal vectors we have all the information that we need to render the terrain. We still have to decide how exactly we want to render all the polygons.

The simplest way to render the terrain using a single draw call is to setup a vertex buffer with data for each triangle in the mesh (including position and normal information) and use GL_TRIANGLES for the primitive of the draw call. This, however, is not the best option from the point of view of performance.

Because the terrain will typically contain a large number of vertices and most of them participate in multiple triangles, we end up uploading a large amount of vertex data to the GPU and processing a lot of vertices in the draw call. The result is large memory requirements and suboptimal performance.

For reference, the terrain I used in the demo from my original post used a 251×251 grid. This grid represents 250×250 tiles, each one rendered as two triangles (6 vertices/tile), so we end up with 250x250x6=375,000 vertices. For each of these vertices we need to upload 24 bytes of vertex data with the position and normal, so we end up with a GPU buffer that is almost 9MB large.

One obvious way to reduce this is to render the terrain using triangle strips. The problem with this is that in theory, we can't render the terrain with just one strip, we would need one strip (and so, one draw call) per tile column or one strip per tile row. Fortunately, we can use degenerate triangles to link separate strips for each column into a single draw call. With this we trim down the number of vertices to 126,000 and the size of the buffer to a bit below 3 MB. This alone produced a 15%-20% performance increase in the demo.

We can do better though. A lot of the vertices in the terrain mesh participate in various triangles across the large triangle strip in the draw call, so we can reduce memory requirements by using an index buffer to render the strip. If we do this, we trim things down to 63,000 vertices and ~1.5MB. This added another 4%-5% performance bonus over the original implementation.


So far we have been rendering the full mesh of the terrain in each frame and we do this by uploading the vertex data to the GPU just once (for example in the first frame). However, depending on where the camera is located and where it is looking at, just a fraction of the terrain may be visible.

Although the GPU will discard all the geometry and fragments that fall outside the viewport, it still has to process each vertex in the vertex shader stage before it can clip non-visible triangles. Because the number of triangles in the terrain is large, this is suboptimal and to address this we want to do CPU-side clipping before we render.

Doing CPU-side clippig comes with some additional complexities though: it requires that we compute the visible region of the terrain and upload new vertex data to the GPU in each frame preventin GPU stalls.

In the demo, we implement the clipping by computing a quad sub-region of the terrain that includes the visible area that we need to render. Once we know the sub-region that we want to render, we compute the new indices of the vertices that participate in the region so we can render it using a single triangle strip. Finally, we upload the new index data to the index buffer for use in the follow-up draw call.

Avoiding GPU stalls

Although all the above is correct, it actually leads, as described, to much worse performance in general. The reason for this is that our uploads of vertex data in each frame lead to frequent GPU stalls. This happens in two scenarios:

1. In the same frame, because we need to upload different vertex data for the rendering of the terrain for the shadow map and the scene (the shadow map renders the terrain from the point of view of the light, so the visible region of the terrain is different). This creates stalls because the rendering of the terrain for the shadow map might not have completed before we attempt to upload new data to the index buffer in order to render the terrain for the scene.

2. Between different frames. Because the GPU might not be completely done rendering the previous frame (and thus, stills needs the index buffer data available) before we start preparing the next frame and attempt to upload new terrain index data for it.

In the case of the Intel Mesa driver, these GPU stalls can be easily identified by using the environment variable INTEL_DEBUG=perf. When using this, the driver will detect these situations and produce warnings informing about the stalls, the buffers affected and the regions of the buffers that generate the stall, such as:

Stalling on glBufferSubData(0, 503992) (492kb) to a busy (0-1007984)
buffer object.  Use glMapBufferRange() to avoid this.

The solution to this problem that I implemented (other than trying to put as much work as possible between read/write accesses to the index buffer) comes in two forms:

1. Circular buffers

In this case, we allocate a larger buffer than we need so that each subsequent upload of new index data happens in a separate sub-region of the allocated buffer. I set up the demo so that each circular buffer is large enough to hold the index data required for all updates of the index buffer happening in each frame (the shadow map and the scene).

2. Multi-buffering

We allocate more than one circular buffer. When we don't have enough free space at the end of the current buffer to upload the new index buffer data, we upload it to a different circular buffer instead. When we run out of buffers we circle back to the first one (which at this point will hopefully be free to be re-used again).

So why not just use a single, very large circular buffer? Mostly because there are limits to the size of the buffers that the GPU may be able to handle correctly (or efficiently). Also, why not having many smaller independent buffers instead of circular buffers? That would work just fine, but using fewer, larger buffers reduces the number of objects we need to bind/unbind and is better to prevent memory fragmentation, so that's a plus.

Final touches

We are almost done, at this point we only need to add a texture to the terrain surface, add some slight fog effect for distant pixels to create a more realistic look, add a skybox (it is important to choose the color of the fog so it matches the color of the sky!) and tweak the lighting parameters to get a nice result:

Final rendering

I hope to cover some of these aspects in future posts, so stay tuned for more!

13 Oct 2016 2:38pm GMT

11 Oct 2016


Eric Anholt: This week in vc4 (2016-10-11): QPU user shaders, Processing performance

Last week I started work on making hello_fft work with the vc4 driver loaded.

hello_fft is a demo program for doing FFTs using the QPUs (the shader core in vc4). Instead of drawing primitives into a framebuffer like a GL pipeline does, though, it uses the User QPU pipeline, which just hands a uniform stream to a QPU shader instance.

Originally I didn't build a QPU user shader ioctl for vc4 because there's no good way to use this pipeline of the VC4. The hardware isn't quite capable enough to be exposed as OpenCL or GL compute shaders (some research had been done into this, and the designers' conclusion was that the memory acccess support wasn't quite good enough to be useful). That leaves you with writing VC4 shaders in raw assembly, which I don't recommend.

The other problem for vc4 exposing user shaders is that, since the GPU sits directly on main memory with no MMU in between, GPU shader programs could access anywhere in system memory. For 3D, there's no need (in GL 2.1) to do general write back to system memory, so the kernel reads your shader code and rejects shaders if they try to do it. It also makes sure that you bounds-check all reads through the uniforms or texture sampler. For QPU user shaders, though, the expected mode of using them is to have the VPM DMA units do general loads and stores, so we'd need new validation support.

If it was just this one demo, we might be willing to lose support for it in the transition to the open driver. However, some modes of accelerated video decode also use QPU shaders at some stage, so we have to have some sort of solution.

My plan is basically to not do validation and have root-only execution of QPU shaders for now. The firmware communication driver captures hello_fft's request to have the firmware pass QPU shaders to the V3D hardware, and it redirects it into VC4. VC4 then maintains the little queue of requests coming in, powers up the hardware, feeds them in, and collects the interrupts from the QPU shaders for when they're done (each job request is required to "mov interrupt, 1" at the end of its last shader to be run).

Dom is now off experimenting in what it will take to redirect video decode's QPU usage onto this code.

The other little project last week was fixing Processing's performance on vc4. It's one of the few apps that is ported to the closed driver, so it's a likely thing to be compared on.

Unfortunately, Processing was clearing its framebuffers really inefficiently. For non-tiled hardware it was mostly OK, with just a double-clear of the depth buffer, but for vc4 its repeated glClear()s (rather than a single glClear with all the buffers to be cleared set) were triggering flushes and reloads of the scene, and in one case a render of a full screen quad (and a full screen load before doing so) rather than fast clearing.

The solution was to improve the tracking of what buffers are cleared and whether any primitives have been drawn yet, so that I can coalesce sets of repeated or partial clears together while only updating the colors to be cleared. There's always a tension in this sort of optimization: Should the GL driver be clever to work around apps behaving badly, or should it push that off to the app developer (more like Vulkan does). In this case, tiled renderer behavior is sufficiently different from non-tiled renderers, and enough apps will hit this path, that it's well worth it. For Processing, I saw an improvement of about 10% on the demo I was looking at.

Sadly, Processing won't be a good comparison for open vs closed drivers even after these fixes. For the closed driver Processing uses EGL, which says that depth buffers become undefined on eglSwapBuffers. In its default mode, though, processing uses GLX, which says that the depth buffer is retained on glXSwapBuffers(). To avoid this extra overhead in its GLX mode, you'd need to use glDiscardFramebufferEXT() right before the swap. Unfortunately, this isn't connected in gallium yet but shouldn't be hard to do once we have an app that we can test.

11 Oct 2016 8:28pm GMT

07 Oct 2016


Daniel Vetter: Neat drm/i915 Stuff for 4.8

I procristanated rather badly on this one, so instead of the previous kernel release happening the v4.8 release is already out of the door. Read on for my slightly more terse catch-up report.

Since I'm this late I figured instead of the usual comprehensive list I'll do something new and just list some of the work that landed in 4.8, but with a bit more focus on the impact and why things have been done.

Midlayers, Be Gone!

The first thing I want to highlight is the driver de-midlayering. In the linux kernel community the mid-layer mistake or helper library design pattern, see the linked article from LWN is a set of rules to design subsystems and common support code for drivers. The underlying rule is that the driver itself must be in control of everything, like allocating memory, handling all requests. Common code is only shared in helper library functions, which the driver can call if they are suitable. The reason for that is that there is always some hardware which needs special treatment, and when you have a special case and there's a midlayer, it will get in the way.

Due to the shared history with BSD kernels DRM originally had a full-blown midlayer, but over time this has been fixed. For example kernel modesetting was designed from the start with the helper library pattern. The last hold is the device structure itself, and for the Intel driver this is now fixed. This has two main benefits:

Thundering Herds

GPUs process rendering asynchronously, and sometimes the CPU needs to wait for them. For this purpose there's a wait queue in the driver. Userspace processes block on that until the interrupt handler wakes them up. The trouble now is that thus far there was just one wait queue per engine, which means every time the GPU completed something all waiters had to be woken up. Then they checked whether the work they needed to wait for completed, and if not, again block on the wait queue until the next batch job completed. That's all rather inefficient. On top there's only one per-engine knob to enable interrupts. Which means even if there was only one waiting process, it was woken for every completed job. And GPUs can have a lot of jobs in-flight.

In summary, waiting for the GPU worked more like a frantic herd trampling all over things instead of something orderly. To fix this the request and completion tracking was entirely revamped, to make sure that the driver has a much better understanding of what's going on. On top there's now also an efficient search structure of all current waiting processes. With that the interrupt handler can quickly check whether the just completed GPU job is of interest, and if so, which exact process should be woken up.

But this wasn't just done to make the driver more efficient. Better tracking of pending and completed GPU requests is an important fundation to implement proper GPU scheduling on top of. And it's also needed to interface the completion tracking with other drivers, to finally fixing tearing for multi-GPU machines. Having a thundering herd in your own backyard is unsightly, letting it loose on your neighbours is downright bad! A lot of this follow-up work already landed for the 4.9 kernel, hence I will talk more about this in a future installement of this seris.

07 Oct 2016 12:00pm GMT

06 Oct 2016


Aleksander Morgado: Naming devices in ModemManager


No more "which is now the index of this modem…?"

DBus object path and index

When modems are detected by ModemManager and exposed in DBus, they are assigned an unique DBus object path, with a common prefix and a unique index number, e.g.:


This path is the one used by the mmcli command line tool to operate on a modem, so users can identify the device by the full path or just by the index, e.g. this two calls are totally equivalent:

$ mmcli -m /org/freedesktop/ModemManager1/Modem/0
$ mmcli -m 0

This logic looks good, except for the fact that there isn't a fixed DBus object path for each modem detected: i.e. the index given to a device is the next one available, and if the device is power cycled or unplugged and replugged, a different index will be given to it.


Systems like NetworkManager handle this index change gracefully, just by assuming that the exposed device isn't the same one as the one exposed earlier with a different index. If settings need to be applied to a specific device, they will be stored associated with the EquipmentIdentifier property of the modem, which is the same across reboots (i.e. the IMEI for GSM/UMTS/LTE devices).

User-provided names

The 1.8 stable release of ModemManager will come with support for user-provided names assigned to devices. A use case of this new feature is for example those custom systems where the user would like to assign a name to a device based on the USB port in which it is connected (e.g. assuming the USB hardware layout doesn't change across reboots).

The user can specify the names (UID, unique IDs) just by tagging in udev the physical device that owns all ports of a modem with the new ID_MM_PHYSDEV_UID property. This tags need to be applied before the ID_MM_CANDIDATE properties, and therefore the rules file should be named before the 80-mm-candidate.rules one, for example like this:

$ cat /lib/udev/rules.d/78-mm-naming.rules

ACTION!="add|change|move", GOTO="mm_naming_rules_end"

The value of the new ID_MM_PHYSDEV_UID property will be used in the Device property exposed in the DBus object, and can also be used directly in mmcli calls instead of the path or index, e.g.:

$ mmcli -m USB4
 System | device: 'USB4'
        | drivers: 'qmi_wwan, qcserial'
        | plugin: 'Sierra'
        | primary port: 'cdc-wdm2'

Given that the same property value will always be set for the modem in a specific device path, this user provided names may unequivocally identify a specific modem even when the device is power-cycled, unplugged and replugged or even the whole system rebooted.

Binding the property to the device path is just an example of what could be done. There is no restriction on what the logic is to apply the ID_MM_PHYSDEV_UID property, so users may also choose other different approaches.

This support is already in ModemManager git master, and as already said, will be included in the stable 1.8 release, whenever that is.

TL;DR? ModemManager now supports assigning unique names to devices that stay even across full system reboots.

Filed under: Development, FreeDesktop Planet, GNOME Planet, Planets Tagged: gnome, gnu/linux, ModemManager, udev

06 Oct 2016 11:40am GMT

05 Oct 2016


Iago Toral: Source code for the OpenGL terrain renderer demo

I have been quite busy with various things in the last few weeks, but I have finally found some time to clean up and upload the code of the OpenGL terrain render demo to Github.

Since this was intended as a programming exercise I have not tried to be very elegant or correct during the implementation, so expect things like error handling to be a bit rough around the edges, but otherwise I think the code should be easy enough to follow.

Notice that I have only tested this on Intel GPUs. I know it works on NVIDIA too (thanks to Samuel and Chema for testing this) but there are a couple of rendering artifacts there, specifically at the edges of the skybox and some "pillars" showing up in the distance some times, probably because I am rendering one to many "rows" of the terrain and I end up rendering garbage. I may fix these some day.

The code I uploaded to the repository includes a few new features too:

Here is a new screenshot:


In future posts I will talk a bit about some of the implementation details, so having the source code around will be useful. Enjoy!

05 Oct 2016 12:25pm GMT

04 Oct 2016


Lennart Poettering: systemd.conf 2016 Over Now

systemd.conf 2016 is Over Now!

A few days ago systemd.conf 2016 ended, our second conference of this kind. I personally enjoyed this conference a lot: the talks, the atmosphere, the audience, the organization, the location, they all were excellent!

I'd like to take the opportunity to thanks everybody involved. In particular I'd like to thank Chris, Daniel, Sandra and Henrike for organizing the conference, your work was stellar!

I'd also like to thank our sponsors, without which the conference couldn't take place like this, of course. In particular I'd like to thank our gold sponsor, Red Hat, our organizing sponsor Kinvolk, as well as our silver sponsors CoreOS and Facebook. I'd also like to thank our bronze sponsors Collabora, OpenSUSE, Pantheon, Pengutronix, our supporting sponsor Codethink and last but not least our media sponsor Linux Magazin. Thank you all!

I'd also like to thank the Video Operation Center ("VOC") for their amazing work on live-streaming the conference and making all talks available on YouTube. It's amazing how efficient the VOC is, it's simply stunning! Thank you guys!

In case you missed this year's iteration of the conference, please have a look at our YouTube Channel. You'll find all of this year's talks there, as well the ones from last year. (For example, my welcome talk is available here). Enjoy!

We hope to see you again next year, for systemd.conf 2017 in Berlin!

04 Oct 2016 10:00pm GMT

Eric Anholt: This week in vc4 (2016-10-03): HDMI, X testing and cleanups, VCHIQ

Last week Adam Jackson landed my X testing series, so now at least 3 of us pushing code are testing glamor with every commit. I used the series to test a cleanup of the Render extension related code in glamor, deleting 550 lines of junk. I did this while starting to work on using GL sampler objects, which should help reduce our pixmap allocation and per-draw Render overhead. That patch is failing make check so far.

The big project for the week was tackling some of the HDMI bug reports. I pulled out my bigger monitor that actually takes HDMI input, and started working toward getting it to display all of its modes correctly. We had an easy bug in the clock driver that broke changing to low resolutions: trying to drive the PLL too slow, instead of driving the PLL in spec and using the divider to get down to the clock we want. We needed infoframes support to fix purple borders on the screen. We were doing our interlaced video mode programming pretty much entirely wrong (it's not perfect yet, but 1 of the 3 interlaced modes on my montior displays correctly and they all display something). Finally, I added support for double-clocked CEA modes (generally the low resolution, low refresh ones in your list. These are probably particularly interesitng for RPi due to all the people running emulators).

Finally, one of the more exciting things for me was that I got general agreement from gregkh to merge the VCHIQ driver through the staging tree, and he even put the initial patch together for me. If we add in a couple more small helper drivers, we should be able to merge the V4L2 camera driver, and hopefully get closer to doing a V4L2 video decode driver for upstream. I've now tested a messy series that at least executes "vcdbg log msg." Those other drivers will definitely need some cleanups, though.

04 Oct 2016 12:28am GMT

03 Oct 2016


Julien Danjou: Gnocchi 3.0 release

After a few weeks of hard work with the team, here is the new major version of Gnocchi, stamped 3.0.0. It was very challenging, as we wanted to implement a few big changes in it.

Gnocchi is now using reno to its maximum and you can read the release notes of the 3.0 branch online. Some notes might be missing as it is our first release with it, but we are making good progress at writing changelogs for most of our user facing and impacting changes.

Therefore, I'll only write here about our big major feature that made us bump the major version number.

New storage engine

And so the most interesting thing that went in the 3.0 release, is the new storage engine that has been built by me and Gordon Chung during those last months. The original approach of writing data in Gnocchi was really naive, so we had an iterative improvement process since version 1.0, and we're getting close to something very solid.

This new version leverages several important features which increase performance by a large factor on Ceph (using write(offset) rather than read()+write() to append new points), our recommended back-end.

read+write (Swift and file drivers) vs offset (Ceph)

To summarize, since most data points are sent sequentially and ordered, we enhanced the data format to profit from that fact and be able to be appended without reading anything. That only works on Ceph though, which provides the needed features.

We also enabled data compression on all storage drivers by enabling LZ4 compression (see my previous article and research on the subject), which obviously offers its own set of challenges when using append-only write. The results are tremendous and decrease data usage by a huge factor:

Disk size comparison between Gnocchi 2 and Gnocchi 3

The rest of the processing pipeline also has been largely improved:

Processing time of new measures Compression time comparison

Overall, we're delighted with the performance improvement we achieved, and we're looking forward making even better more progress. Gnocchi is now one of the most performing and scalable timeseries databases out there.

Upcoming challenges

With that big change done, we're now heading toward a set of more lightweight improvements. Our bug tracker is a good place to learn what might be on our mind (check for the wishlist bugs).

Improving our API features and offering a better experience for those coming outside of the real of OpenStack are now on my top priority list.

But let me know if there's anything you have scratching you, obviously. 😎

03 Oct 2016 3:58pm GMT