18 Sep 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Long Week

Blog Returns

Once again, I ended up not blogging for most of the week. When this happens, there's one of two possibilities: I'm either taking a break or I'm so deep into some code that I've forgotten about everything else in my life including sleep.

This time was the latter. I delved into the deepest parts of zink and discovered that the driver is, in fact, functioning only through a combination of sheer luck and a truly unbelievable amount of driver stalls that provide enough forced synchronization and slow things down enough that we don't explode into a flaming mess every other frame.

Oops.

I've fixed all of the crazy things I found, and, in the process, made some sizable performance gains that I'm planning to spend a while blogging about in considerable depth next week.

And when I say sizable, I'm talking in the range of 50-100% fps gains.

But it's Friday, and I'm sure nobody wants to just see numbers or benchmarks. Let's get into something that's interesting on a technical level.

Samplers

Yes, samplers.

In Vulkan, samplers have a lot of rules to follow. Specifically, I'm going to be examining part of the spec that states "If a VkImageView is sampled with VK_FILTER_LINEAR as a result of this command, then the image view's format features must contain VK_FORMAT_FEATURE_SAMPLED_IMAGE_FILTER_LINEAR_BIT".

This is a problem for zink. Gallium gives us info about the sampler in the struct pipe_context::create_sampler_state hook, but the created sampler won't actually be used until draw time. As a result, there's no way to know which image is going to be sampled, and thus there's no way to know what features the sampled image's format flags will contain. This only becomes known at the time of draw.

The way I saw it, there were two options:

In theory, the first option is probably more performant in the best case scenario where a sampler is only ever used with a single image, as it would then only ever create a single sampler object.

Unfortunately, this isn't realistic. Just as an example, u_blitter creates a number of samplers up front, and then it also makes assumptions about filtering based on ideal operations which may not be in sync with the underlying Vulkan driver's capabilities. So for these persistent samplers, the first option may initially allow the sampler to be created with LINEAR filtering, but it may later then be used for an image which can't support it.

So I went with the second option. Now any time a LINEAR sampler is created by gallium, we're actually creating both types so that the appropriate one can be used, ensuring that we can always comply with the spec and avoid any driver issues.

Hooray.

18 Sep 2020 12:00am GMT

14 Sep 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Draw Parameters

Hoo Boy

Let's talk about ARB_shader_draw_parameters. Specifically, let's look at gl_BaseVertex.

In OpenGL, this shader variable's value depends on the parameters passed to the draw command, and the value is always zero if the command has no base vertex.

In Vulkan, the value here is only zero if the first vertex is zero.

The difference here means that for arrayed draws without base vertex parameters, GL always expects zero, and Vulkan expects first vertex.

Hooray.

Compatibilizing

The easiest solution here would be to just throw a shader key at the problem, producing variants of the shader for use with indexed vs non-indexed draws, and using NIR passes to modify the variables for the non-indexed case and zero the value. It's quick, it's easy, and it's not especially great for performance since it requires compiling the shader multiple times and creating multiple pipeline objects.

This is where push constants come in handy once more.

Avid readers of the blog will recall the last time I used push constants was for TCS injection when I needed to generate my own TCS and have it read the default inner/outer tessellation levels out of a push constant.

Since then, I've created a struct to track the layout of the push constant:

struct zink_push_constant {
   unsigned draw_mode_is_indexed;
   float default_inner_level[2];
   float default_outer_level[4];
};

Now just before draw, I update the push constant value for draw_mode_is_indexed:

if (ctx->gfx_stages[PIPE_SHADER_VERTEX]->nir->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)) {
   unsigned draw_mode_is_indexed = dinfo->index_size > 0;
   vkCmdPushConstants(batch->cmdbuf, gfx_program->layout, VK_SHADER_STAGE_VERTEX_BIT,
                      offsetof(struct zink_push_constant, draw_mode_is_indexed), sizeof(unsigned),
                      &draw_mode_is_indexed);
}

And now the shader can be made aware of whether the draw mode is indexed.

Now comes the NIR, as is the case for most of this type of work.

static bool
lower_draw_params(nir_shader *shader)
{
   if (shader->info.stage != MESA_SHADER_VERTEX)
      return false;

   if (!(shader->info.system_values_read & (1ull << SYSTEM_VALUE_BASE_VERTEX)))
      return false;

   return nir_shader_instructions_pass(shader, lower_draw_params_instr, nir_metadata_dominance, NULL);
}

This is the future, so I'm now using Eric Anholt's recent helper function to skip past iterating over the shader's function/blocks/instructions, instead just passing the lowering implementation as a parameter and letting the helper create the nir_builder for me.

static bool
lower_draw_params_instr(nir_builder *b, nir_instr *in, void *data)
{
   if (in->type != nir_instr_type_intrinsic)
      return false;
   nir_intrinsic_instr *instr = nir_instr_as_intrinsic(in);
   if (instr->intrinsic != nir_intrinsic_load_base_vertex)
      return false;

I'm filtering out everything except for nir_intrinsic_load_base_vertex here, which is the instruction for loading gl_BaseVertex.

   
   b->cursor = nir_after_instr(&instr->instr);

I'm modifying instructions after this one, so I set the cursor after.

   nir_intrinsic_instr *load = nir_intrinsic_instr_create(b->shader, nir_intrinsic_load_push_constant);
   load->src[0] = nir_src_for_ssa(nir_imm_int(b, 0));
   nir_intrinsic_set_range(load, 4);
   load->num_components = 1;
   nir_ssa_dest_init(&load->instr, &load->dest, 1, 32, "draw_mode_is_indexed");
   nir_builder_instr_insert(b, &load->instr);

I'm loading the first 4 bytes of the push constant variable that I created according to my struct, which is the draw_mode_is_indexed value.

   nir_ssa_def *composite = nir_build_alu(b, nir_op_bcsel,
                                          nir_build_alu(b, nir_op_ieq, &load->dest.ssa, nir_imm_int(b, 1), NULL, NULL),
                                          &instr->dest.ssa,
                                          nir_imm_int(b, 0),
                                          NULL);

This adds a new ALU instruction of type bcsel, AKA the ternary operator (condition ? true : false). The condition here is another ALU of type ieq, AKA integer equals, and I'm testing whether the loaded push constant value is equal to 1. If true, this is an indexed draw, so I continue using the loaded gl_BaseVertex value. If false, this is not an indexed draw, so I need to use zero instead.

   nir_ssa_def_rewrite_uses_after(&instr->dest.ssa, nir_src_for_ssa(composite), composite->parent_instr);

With my bcsel composite gl_BaseVertex value constructed, I can now rewrite all subsequent uses of gl_BaseVertex in the shader to use the composite value, which will automatically swap between the Vulkan gl_BaseVertex and zero based on the value of the push constant without the need to rebuild the shader or make a new pipeline.

   return true;
}

And now the shader gets the expected value and everything works.

Billy Mays

It's also worth pointing out here that gl_DrawID from the same extension has a similar problem: gallium doesn't pass multidraws in full to the driver, instead iterating for each draw, which means that the shader value is never what's expected either. I've employed a similar trick to jam the draw index into the push constant and read that back in the shader to get the expected value there too.

Extensions.

14 Sep 2020 12:00am GMT

10 Sep 2020

feedplanet.freedesktop.org

Adam Jackson: worse is better: making late buffer swaps tear

In an ideal world, every frame your application draws would appear on the screen exactly on time. Sadly, as anyone living in the year 2020 CE can attest, this is far from an ideal world. Sometimes the scene gets more complicated and takes longer to draw than you estimated, and sometimes the OS scheduler just decides it has more important things to do than pay attention to you.

When this happens, for some applications, it would be best if you could just get the bits on the screen as fast as possible rather than wait for the next vsync. The Present extension for X11 has a option to let you do exactly this:

If 'options' contains PresentOptionAsync, and the 'target-msc'
is less than or equal to the current msc for 'window', then
the operation will be performed as soon as possible, not
necessarily waiting for the next vertical blank interval.

But you don't use Present directly, usually, usually Present is the mechanism for GLX and Vulkan to put bits on the screen. So, today I merged some code to Mesa to enable the corresponding features in those APIs, namely GLX_EXT_swap_control_tear and VK_PRESENT_MODE_FIFO_RELAXED_KHR. If all goes well these should be included in Mesa 21.0, with a backport to 20.2.x not out of the question. As the GLX extension name suggests, this can introduce some visual tearing when the buffer swap does come in late, but for fullscreen games or VR displays that can be an acceptable tradeoff in exchange for reduced stuttering.

10 Sep 2020 7:44pm GMT

Bastien Nocera: power-profiles-daemon: new project announcement

Despite what this might look like, I don't actually enjoy starting new projects: it's a lot easier to clean up some build warnings, or add a CI, than it is to start from an empty directory.

But sometimes needs must, and I've just released version 0.1 of such a project. Below you'll find an excerpt from the README, which should answer most of the questions. Please read the README directly in the repository if you're getting to this blog post more than a couple of days after it was first published.

Feel free to file new issues in the tracker if you have ideas on possible power-saving or performance enhancements. Currently the only supported "Performance" mode supported will interact with Intel CPUs with P-State support. More hardware support is planned.

TLDR; this setting in the GNOME 3.40 development branch soon, Fedora packages are done, API docs available:

From the README:

Introduction

power-profiles-daemon offers to modify system behaviour based upon user-selected power profiles. There are 3 different power profiles, a "balanced" default mode, a "power-saver" mode, as well as a "performance" mode. The first 2 of those are available on every system. The "performance" mode is only available on select systems and is implemented by different "drivers" based on the system or systems it targets.

In addition to those 2 or 3 modes (depending on the system), "actions" can be hooked up to change the behaviour of a particular device. For example, this can be used to disable the fast-charging for some USB devices when in power-saver mode.

GNOME's Settings and shell both include interfaces to select the current mode, but they are also expected to adjust the behaviour of the desktop depending on the mode, such as turning the screen off after inaction more aggressively when in power-saver mode.

Note that power-profiles-daemon does not save the currently active profile across system restarts and will always start with the "balanced" profile selected.

Why power-profiles-daemon

The power-profiles-daemon project was created to help provide a solution for two separate use cases, for desktops, laptops, and other devices running a "traditional Linux desktop".

The first one is a "Low Power" mode, that users could toggle themselves, or have the system toggle for them, with the intent to save battery. Mobile devices running iOS and Android have had a similar feature available to end-users and application developers alike.

The second use case was to allow a "Performance" mode on systems where the hardware maker would provide and design such a mode. The idea is that the Linux kernel would provide a way to access this mode which usually only exists as a configuration option in some machines' "UEFI Setup" screen.

This second use case is the reason why we didn't implement the "Low Power" mode in UPower, as was originally discussed.

As the daemon would change kernel settings, we would need to run it as root, and make its API available over D-Bus, as has been customary for more than 10 years. We would also design that API to be as easily usable to build graphical interfaces as possible.

Why not...

This section will contain explanations of why this new daemon was written rather than re-using, or modifying an existing one. Each project obviously has its own goals and needs, and those comparisons are not meant as a slight on the project.

As the code bases for both those projects listed and power-profiles-daemon are ever evolving, the comments were understood to be correct when made.

thermald

thermald only works on Intel CPUs, and is very focused on allowing maximum performance based on a "maximum temperature" for the system. As such, it could be seen as complementary to power-profiles-daemon.

tuned and TLP

Both projects have similar goals, allowing for tweaks to be applied, for a variety of workloads that goes far beyond the workloads and use cases that power-profiles-daemon targets.

A fair number of the tweaks that could apply to devices running GNOME or another free desktop are either potentially destructive (eg. some of the SATA power-saving mode resulting in corrupted data), or working well enough to be put into place by default (eg. audio codec power-saving), even if we need to disable the power saving on some hardware that reacts badly to it.

Both are good projects to use for the purpose of experimenting with particular settings to see if they'd be something that can be implemented by default, or to put some fine-grained, static, policies in place on server-type workloads which are not as fluid and changing as desktop workloads can be.

auto-cpufreq

It doesn't take user-intent into account, doesn't have a D-Bus interface and seems to want to work automatically by monitoring the CPU usage, which kind of goes against a user's wishes as a user might still want to conserve as much energy as possible under high-CPU usage.

10 Sep 2020 6:00pm GMT

Bastien Nocera: Avoid “Tag: v-3.38.0-fixed-brown-paper-bag”

Over the past couple of (gasp!) decades, I've had my fair share of release blunders: forgetting to clean the tree before making a tarball by hand, forgetting to update the NEWS file, forgetting to push after creating the tarball locally, forgetting to update the appdata file (causing problems on Flathub)...

That's where check-news.sh comes in, to replace the check-news function of the autotools. Ideally you would:

- make sure your CI runs a dist job

- always use a merge request to do releases

- integrate check-news.sh to your meson build (though I would relax the appdata checks for devel releases)

10 Sep 2020 3:47pm GMT

09 Sep 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Query Stats

Long, Long Day

I fell into the abyss of query code again, so this is just a short post to (briefly) touch on the adventure that is pipeline statistics queries.

Pipeline statistics queries include a collection of statistics about things that happened while the query was active, such as the count of shader invocations.

Thankfully, these weren't too difficult to plug into the ever-evolving zink query architecture, as my previous adventure into QBOs (more on this another time) ended up breaking out a lot of helper functions for various things that simplified the process.

Mapping

The first step, as always, is figuring out how to map gallium API to vulkan. The below table handles the basic conversion for single query types:

unsigned map[] = {
   [PIPE_STAT_QUERY_IA_VERTICES] = VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_VERTICES_BIT,
   [PIPE_STAT_QUERY_IA_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_INPUT_ASSEMBLY_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_VS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_VERTEX_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_GS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_GS_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_GEOMETRY_SHADER_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_C_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_CLIPPING_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_C_PRIMITIVES] = VK_QUERY_PIPELINE_STATISTIC_CLIPPING_PRIMITIVES_BIT,
   [PIPE_STAT_QUERY_PS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_FRAGMENT_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_HS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_TESSELLATION_CONTROL_SHADER_PATCHES_BIT,
   [PIPE_STAT_QUERY_DS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_TESSELLATION_EVALUATION_SHADER_INVOCATIONS_BIT,
   [PIPE_STAT_QUERY_CS_INVOCATIONS] = VK_QUERY_PIPELINE_STATISTIC_COMPUTE_SHADER_INVOCATIONS_BIT
};

Pretty straightforward.

With this in place, it's worth mentioning that OpenGL has facilities for performing either "all" statistics queries, which include all the above types, or "single" statistics queries, which is just one of the types.

Building on that, I'm going to reach way back to the original loop that I've been using for handling query results. That's now its own helper function called check_query_results(). It's invoked from get_query_result() like so:

int result_size = 1;
   /* these query types emit 2 values */
if (query->vkqtype == VK_QUERY_TYPE_TRANSFORM_FEEDBACK_STREAM_EXT ||
    query->type == PIPE_QUERY_PRIMITIVES_GENERATED ||
    query->type == PIPE_QUERY_PRIMITIVES_EMITTED)
   result_size = 2;
else if (query->type == PIPE_QUERY_PIPELINE_STATISTICS)
   result_size = 11;

if (query->type == PIPE_QUERY_PIPELINE_STATISTICS)
   num_results = 1;
for (unsigned last_start = query->last_start; last_start + num_results <= query->curr_query; last_start++) {
   /* verify that we have the expected number of results pending */
   assert(num_results <= ARRAY_SIZE(results) / result_size);
   VkResult status = vkGetQueryPoolResults(screen->dev, query->query_pool,
                                           last_start, num_results,
                                           sizeof(results),
                                           results,
                                           sizeof(uint64_t),
                                           flags);
   if (status != VK_SUCCESS)
      return false;

   if (query->type == PIPE_QUERY_PRIMITIVES_GENERATED) {
      status = vkGetQueryPoolResults(screen->dev, query->xfb_query_pool[0],
                                              last_start, num_results,
                                              sizeof(xfb_results),
                                              xfb_results,
                                              2 * sizeof(uint64_t),
                                              flags | VK_QUERY_RESULT_64_BIT);
      if (status != VK_SUCCESS)
         return false;

   }

   check_query_results(query, result, num_results, result_size, results, xfb_results);
}

Beautiful, isn't it?

For the case of any xfb-related query, 2 result values are returned (and then also PRIMITIVES_GENERATED gets its own separate xfb pool for correctness), while most others only get 1 result value.

And then there's PIPELINE_STATISTICS, which gets eleven. Since this is a weird result size to handle, I decided to just grab the results for each query one at a time in order to avoid any kind of craziness with allocating enough memory for the actual result values. Plenty of time for that later.

The handling for the results is fun too:

case PIPE_QUERY_PIPELINE_STATISTICS: {
   uint64_t *statistics = (uint64_t*)&result->pipeline_statistics;
   for (unsigned j = 0; j < 11; j++)
      statistics[j] += results[i + j];
   break;
}

Each bit set in the query object returns its own query result, so there's 11 of them that need to be accumulated for each query result.

Surprisingly though, that's the craziest part of implementing this. Not really much compared to some of the other insanity.

Corrections

The other day, I made wild claims about zink being the fastest graphics driver in history.

I'm not going to say they were inaccurate.

What I am going to say is that I had a great chat with well-known, long-time mesa developer Kenneth Graunke, who pointed out that most of the time differences in the softfp64 tests were due to NIR optimizations related to loop unrolling and the different settings used between IRIS and zink there.

He also helpfully pointed out that I forgot to set a number of important pipe caps, including the one that holds the value for gl_MaxVaryingComponents. As a result, zink has been advertising a very illegal and too-small number of varyings, which made the shaders smaller, which made us faster.

This has since been fixed, and zink is now fully compliant with the required GL specs for this value.

However.

This has only increased the time of the one test I showcased from ~2 seconds to ~12 seconds.

Due to different mechanics between GLSL and SPIRV shader construction, all XFB outputs have to be explicitly specified when they're converted in zink, and they count against the regular output limits. As a result, it's necessary to reserve half of the driver's advertised vertex outputs for potential XFB usage. This leaves us with much fewer available output slots for user shaders in order to preserve XFB functionality.

09 Sep 2020 12:00am GMT

08 Sep 2020

feedplanet.freedesktop.org

Bastien Nocera: Videos in GNOME 3.38

This is going to be a short post, as changes to Videos have been few and far between in the past couple of releases.

The major change to the latest release is that we've gained Tracker 3 support through a grilo plugin (which meant very few changes to our own code). But the Tracker 3 libraries are incompatible with the Tracker 2 daemon that's usually shipped in distributions, including on this author's development system.

So we made use of the ability of Tracker to run inside a Flatpak sandbox along with the video player, removing the need to have Tracker installed by the distribution, on the host. This should also make it easier to give users control of the directories they want to use to store their movies, in the future.

The release candidate for GNOME 3.38 is available right now as the stable version on Flathub.

08 Sep 2020 1:31pm GMT

07 Sep 2020

feedplanet.freedesktop.org

Alejandro Piñeiro: v3dv status update 2020-09-07

So here a new update of the evolution of the Vulkan driver for the rpi4 (broadcom GPU).

Features

Since my last update we finished the support for two features. Robust buffer access and multisampling.

Robust buffer access is a feature that allows to specify that accesses to buffers are bounds-checked against the range of the buffer descriptor. Usually this is used as a debug tool during development, but disabled on release (this is explained with more detail on this ARM guide). So sorry, no screenshot here.

On my last update I mentioned that we have started the support for multisampling, enough to get some demos working. Since then we were able to finish the rest of the mulsisampling support, and even implemented the optional feature sample rate shading. So now the following Sascha Willems's demo is working:

Sascha Willems deferred multisampling demo run on rpi4

Bugfixing

Taking into account that most of the features towards support Vulkan core 1.0 are implemented now, a lot of the effort since the last update was done on bugfixing, focusing on the specifics of the driver. Our main reference for this is Vulkan CTS, the official Khronos testsuite for Vulkan and OpenGL.

As usual, here some screenshots from the nice Sascha Willems's demos, showing demos that were failing when I wrote the last update, and are working now thanks of the bugfixing work.

Sascha Willems hdr demo run on rpi4

Sascha Willems gltf skinning demo run on rpi4

Next

At this point there are no full features pending to implement to fulfill the support for Vulkan core 1.0. So our focus would be on getting to pass all the Vulkan CTS tests.

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31

07 Sep 2020 2:37pm GMT

Mike Blumenkrantz: Funday

Just For Fun

I finally managed to get a complete piglit run over the weekend, and, for my own amusement, I decided to check the timediffs against a reference run from the IRIS driver. Given that the Intel drivers are of extremely high quality (and are direct interfaces to the underlying hardware that I happen to be using), I tend to use ANV and IRIS as my references whenever I'm trying to debug things.

Both runs used the same base checkout from mesa, so all the core/gallium/nir parts were identical.

The results weren't what I expected.

My expectation when I clicked into the timediffs page was that zink would be massively slower in a huge number of tests, likely to a staggering degree in some cases.

We were, but then also on occasion we weren't.

As a final disclaimer before I dive into this, I feel like given the current state of people potentially rushing to conclusions I need to say that I'm not claiming zink is faster than a native GL driver, only that for some cases, our performance is oddly better than I expected.

The Good

piglit-misc-bench.png

The first thing to take note of here is that IRIS is massively better than zink in successful test completion, with a near-perfect 99.4% pass rate compared to zink's measly 91%, and that's across 2500 more tests too. This is important also since timediff only compares between passing tests.

With that said, somehow zink's codepath is significantly faster when it comes to dealing with high numbers of varying outputs, and also, weirdly, a bunch of dmat4 tests, even though they're both using the same softfp64 path since my icelake hardware doesn't support native 64bit operations.

I was skeptical about some of the numbers here, particularly the ext_transform_feedback max-varying-arrays-of-arrays cases, but manual tests were even weirder:

time MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink bin/ext_transform_feedback-max-varyings -auto -fbo

MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=zink  -auto -fbo  2.13s user 0.03s system 98% cpu 2.197 total

time MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=iris bin/ext_transform_feedback-max-varyings -auto -fbo

MESA_GLSL_CACHE_DISABLE=true MESA_LOADER_DRIVER_OVERRIDE=iris  -auto -fbo  301.64s user 0.52s system 99% cpu 5:02.45 total

wat.

I don't have a good explanation for this since I haven't dug into it other than to speculate that ANV is just massively better at handling large numbers of varying outputs.

The Bad

By contrast, zink gets thrashed pretty decisively in arb_map_buffer_alignment-map-invalidate-range, and we're about 150x slower.

Yikes. Looks like that's going to be a target for some work since potentially an application might hit that codepath.

The Weird

piglit-fp64-bench.png

Somehow, zink is noticeably slower in a bunch of other fp64 tests (and this isn't the full list, only a little over half). It's strange to me that zink can perform better in certain fp64 cases but then also worse in others, but I'm assuming this is just the result of different shader optimizations happening between the drivers, shifting them onto slightly less slow parts of the softfp64 codepath in certain cases.

Possibly something to look into.

Probably not in too much depth since softfp64 is some pretty crazy stuff.

In Closing

Tests (and especially piglit ones) are not indicative of real world performance.

07 Sep 2020 12:00am GMT

Alyssa Rosenzweig: Fun and Games with Exposure Notifications

Exposure Notifications is a protocol developed by Apple and Google for facilitating COVID-19 contact tracing on mobile phones by exchanging codes with nearby phones over Bluetooth, implemented within the Android and iOS operating systems, now available here in Toronto.

Wait - phones? Android and iOS only? Can't my Debian laptop participate? It has a recent Bluetooth chip. What about phones running GNU/Linux distributions like the PinePhone or Librem 5?

Exposure Notifications breaks down neatly into three sections: a Bluetooth layer, some cryptography, and integration with local public health authorities. Linux is up to the task, via BlueZ, OpenSSL, and some Python.

Given my background, will this build to be a reverse-engineering epic resulting in a novel open stack for a closed system?

Not at all. The specifications for the Exposure Notifications are available for both the Bluetooth protocol and the underlying cryptography. A partial reference implementation is available for Android, as is an independent Android implementation in microG. In Canada, the key servers run an open source stack originally built by Shopify and now maintained by the Canadian Digital Service, including open protocol documentation.

All in all, this is looking to be a smooth-sailing weekend1 project.

The devil's in the details.

Bluetooth

Exposure Notifications operates via Bluetooth Low Energy "advertisements". Scanning for other devices is as simple as scanning for advertisements, and broadcasting is as simple as advertising ourselves.

On an Android phone, this is handled deep within Google Play Services. Can we drive the protocol from userspace on a regular GNU/Linux laptop? It depends. Not all laptops support Bluetooth, not all Bluetooth implementations support Bluetooth Low Energy, and I hear not all Bluetooth Low Energy implementations properly support undirected transmissions ("advertising").

Luckily in my case, I develop on an Debianized Chromebook with a Wi-Fi/Bluetooth module. I've never used the Bluetooth, but it turns out the module has full support for advertisements, verified with the lescan (Low Energy Scan) command of the hcitool Bluetooth utility.

hcitool is a part of BlueZ, the standard Linux library for Bluetooth. Since lescan is able to detect nearby phones running Exposure Notifications, pouring through its source code is a good first step to our implementation. With some minor changes to hcitool to dump packets as raw hex and to filter for the Exposure Notifications protocol, we can print all nearby Exposure Notifications advertisements. So far, so good.

That's about where the good ends.

While scanning is simple with reference code in hcitool, advertising is complicated by BlueZ's lack of an interface at the time of writing. While a general "enable advertising" routine exists, routines to set advertising parameters and data per the Exposure Notifications specification are unavailable. This is not a showstopper, since BlueZ is itself an open source userspace library. We can drive the Bluetooth module the same way BlueZ does internally, filling in the necessary gaps in the API, while continuing to use BlueZ for the heavy-lifting.

Some care is needed to multiplex scanning and advertising within a single thread while remaining power efficient. The key is that advertising, once configured, is handled entirely in hardware without CPU intervention. On the other hand, scanning does require CPU involvement, but it is not necessary to scan continuously. Since COVID-19 is thought to transmit from sustained exposure, we only need to scan every few minutes. (Food for thought: how does this connect to the sampling theorem?)

Thus we can order our operations as:

Since most of the time the program is asleep, this loop is efficient. It additionally allows us to reconfigure advertising every ten to fifteen minutes, in order to change the Bluetooth address to prevent tracking.

All of the above amounts to a few hundred lines of C code, treating the Exposure Notifications packets themselves as opaque random data.

Cryptography

Yet the data is far from random; it is the result of a series of operations in terms of secret keys defined by the Exposure Notifications cryptography specification. Every day, a "temporary exposure key" is generated, from which a "rolling proximity identifier key" and an "associated encrypted metadata key" are derived. These are used to generate a "rolling proximity identifier" and the "associated encrypted metadata", which are advertised over Bluetooth and changed in lockstep with the Bluetooth random addresses.

There are lots of moving parts to get right, but each derivation reuses a common encryption primitive: HKDF-SHA256 for key derivation, AES-128 for the rolling proximity identifier, and AES-128-CTR for the associated encrypted metadata. Ideally, we would grab a state-of-the-art library of cryptography primitives like NaCl or libsodium and wire everything up.

First, some good news: once these routines are written, we can reliably unit test them. Though the specification states that "test vectors… are available upon request", it isn't clear who to request from. But Google's reference implementation is itself unit-tested, and sure enough, it contains a TestVectors.java file, from which we can grab the vectors for a complete set of unit tests.

After patting ourselves on the back for writing unit tests, we'll need to pick a library to implement the cryptography. Suppose we try NaCl first. We'll quickly realize the primitives we need are missing, so we move onto libsodium, which is backwards-compatible with NaCl. For a moment, this will work - libsodium has upstream support for HKDF-SHA256. Unfortunately, the version of libsodium shipping in Debian testing is too old for HKDF-SHA256. Not a big problem - we can backwards port the implementation, written in terms of the underlying HMAC-SHA256 operations, and move on to the AES.

AES is a standard symmetric cipher, so libsodium has excellent support… for some modes. However standard, AES is not one cipher; it is a family of ciphers with different key lengths and operating modes, with dramatically different security properties. "AES-128-CTR" in the Exposure Notifications specification is clearly 128-bit AES in CTR (Counter) mode, but what about "AES-128" alone, stated to operate on a "single AES-128 block"?

The mode implicitly specified is known as ECB (Electronic Codebook) mode and is known to have fatal security flaws in most applications. Because AES-ECB is generally insecure, libsodium does not have any support for this cipher mode. Great, now we have two problems - we have to rewrite our cryptography code against a new library, and we have to consider if there is a vulnerability in Exposure Notifications.

ECB's crucial flaw is that for a given key, identical plaintext will always yield identical ciphertext, regardless of position in the stream. Since AES is block-based, this means identical blocks yield identical ciphertext, leading to trivial cryptanalysis.

In Exposure Notifications, ECB mode is used only to derive rolling proximity identifiers from the rolling proximity identifier key and the timestamp, by the equation:

RPI_ij = AES_128_ECB(RPIK_i, PaddedData_j)

…where PaddedData is a function of the quantized timestamp. Thus the issue is avoided, as every plaintext will be unique (since timestamps are monotonically increasing, unless you're trying to contact trace Back to the Future).

Nevertheless, libsodium doesn't know that, so we'll need to resort to a ubiquitous cryptography library that doesn't, uh, take security quite so seriously…

I'll leave the implications up to your imagination.

Database

While the Bluetooth and cryptography sections are governed by upstream specifications, making sense of the data requires tracking a significant amount of state. At minimum, we must:

If we were so inclined, we could handwrite all the serialization and concurrency logic and hope we don't have a bug that results in COVID-19 mayhem.

A better idea is to grab SQLite, perhaps the most deployed software in the world, and express these actions as SQL queries. The database persists to disk, and we can even express natural unit tests with a synthetic in-memory database.

With this infrastructure, we're now done with the primary daemon, recording Exposure Notification identifiers to the database and broadcasting our own identifiers. That's not interesting if we never do anything with that data, though. Onwards!

Key retrieval

Once per day, Exposure Notifications implementations are expected to query the server for Temporary Encryption Keys associated with diagnosed COVID-19 cases. From these keys, the cryptography implementation can reconstruct the associated Rolling Proximity Identifiers, for which we can query the database to detect if we have been exposed.

Per Google's documentation, the servers are expected to return a zip file containing two files:

The signature is not terribly interesting to us. On Android, it appears the system pins the public keys of recognized public health agencies as an integrity check for the received file. However, this public key is given directly to Google; we don't appear to have an easy way to access it.

Does it matter? For our purposes, it's unlikely. The Canadian key retrieval server is already transport-encrypted via HTTPS, so tampering with the data would already require compromising a certificate authority in addition to intercepting the requests to https://canada.ca. Broadly speaking, that limits attackers to nation-states, and since Canada has no reason to attack its own infrastructure, that limits our threat model to foreign nation-states. International intelligence agencies probably have better uses of resources than getting people to take extra COVID tests.

It's worth noting other countries' implementations could serve this zip file over plaintext HTTP, in which case this signature check becomes important.

Focusing then on export.bin, we may import the relevant protocol buffer definitions to extract the keys for matching against our database. Since this requires only read-only access to the database and executes infrequently, we can safely perform this work from a separate process written in a higher-level language like Python, interfacing with the cryptography routines over the Python foreign function interface ctypes. Extraction is easy with the Python protocol buffers implementation, and downloading should be as easy as a GET request with the standard library's urllib, right?

Here we hit a gotcha: the retrieval endpoint is guarded behind an HMAC, requiring authentication to download the zip. The protocol documentation states:

Of course there's no reliable way to truly authenticate these requests in an environment where millions of devices have immediate access to them upon downloading an Application: this scheme is purely to make it much more difficult to casually scrape these keys.

Ah, security by obscurity. Calculating the HMAC itself is simple given the documentation, but it requires a "secret" HMAC key specific to the server. As the documentation is aware, this key is hardly secret, but it's not available on the Canadian Digital Service's official repositories. Interoperating with the upstream servers would require some "extra" tricks.

From purely academic interest, we can write and debug our implementation without any such authorization by running our own sandbox server. Minus the configuration, the server source is available, so after spinning up a virtual machine and fighting with Go versioning, we can test our Python script.

Speaking of a personal sandbox…

Key upload

There is one essential edge case to the contact tracing implementation, one that we can't test against the Canadian servers. And edge cases matter. In effect, the entire Exposure Notifications infrastructure is designed for the edge cases. If you don't care about edge cases, you don't care about digital contact tracing (so please, stay at home.)

The key feature - and key edge case - is uploading Temporary Exposure Keys to the Canadian key server in case of a COVID-19 diagnosis. This upload requires an alphanumeric code generated by a healthcare provider upon diagnosis, so if we used the shared servers, we couldn't test an implementation. With our sandbox, we can generate as many alphanumeric codes as we'd like.

Once sandboxed, there isn't much to the implementation itself: the keys are snarfed out of the SQLite database, we handshake with the server over protocol buffers marshaled over POST requests, and we throw in some public-key cryptography via the Python bindings to libsodium.

This functionality neatly fits into a second dedicated Python script which does not interface with the main library. It's exposed as a command line interface with flow resembling that of the mobile application, adhering reasonably to the UNIX philosophy. Admittedly I'm not sure wrestling with the command line is top on the priority list of a Linux hacker ill with COVID-19. Regardless, the interface is suitable for higher-level (graphical) abstractions.

Problem solved, but of course there's a gotcha: if the request is malformed, an error should be generated as a key robustness feature. Unfortunately, while developing the script against my sandbox, a bug led the request to be dropped unexpectedly, rather than returning with an error message. On the server implemented in Go, there was an apparent nil dereference. Oops. Fixing this isn't necessary for this project, but it's still a bug, even if it requires a COVID-19 diagnosis to trigger. So I went and did the Canadian thing and sent a pull request.

Conclusion

All in all, we end up with a Linux implementation of Exposure Notifications functional in Ontario, Canada. What's next? Perhaps supporting contact tracing systems elsewhere in the world - patches welcome. Closer to home, while functional, the aesthetics are not (yet) anything to write home about - perhaps we could write a touch-based Linux interface for mobile Linux interfaces like Plasma Mobile and Phosh, maybe even running it on a Android flagship flashed with postmarketOS to go full circle.

Source code for liben is available for any one who dares go near. Compiling from source is straightforward but necessary at the time of writing. As for packaging?

Here's hoping COVID-19 contact tracing will be obsolete by the time liben hits Debian stable.


  1. Today (Monday) is Labour Day, so this is a 3-day weekend. But I started on Saturday and posted this today, so it technically counts.↩︎

07 Sep 2020 12:00am GMT

04 Sep 2020

feedplanet.freedesktop.org

Christian Schaller: PipeWire Late Summer Update 2020

Wim Taymans

Wim Taymans talking about current state of PipeWire


Wim Taymans did an internal demonstration yesterday for the desktop team at Red Hat of the current state of PipeWire. For those still unaware PipeWire is our effort to bring together audio, video and pro-audio under Linux, creating a smooth and modern experience. Before PipeWire there was PulseAudio for consumer audio, Jack for Pro-audio and just unending pain and frustration for video. PipeWire is being done with the aim of being ABI compatible with ALSA, PulseAudio and JACK, meaning that PulseAudio and Jack apps should just keep working on top of Pipewire without the need for rewrites (and with the same low latency for JACK apps).

As Wim reported yesterday things are coming together with both the PulseAudio, Jack and ALSA backends being usable if not 100% feature complete yet. Wim has been running his system with Pipewire as the only sound server for a while now and things are now in a state where we feel ready to ask the wider community to test and help provide feedback and test cases.

Carla on PipeWire

Carla running on PipeWire

Carla as shown above is a popular Jack applications and it provides among other things this patchbay view of your audio devices and applications. I recommend you all to click in and take a close look at the screenshot above. That is the Jack application Carla running and as you see PulseAudio applications like GNOME Settings and Google Chrome are also showing up now thanks to the unified architecture of PipeWire, alongside Jack apps like Hydrogen. All of this without any changes to Carla or any of the other applications shown.

At the moment Wim is primarily testing using Cheese, GNOME Control center, Chrome, Firefox, Ardour, Carla, vlc, mplayer, totem, mpv, Catia, pavucontrol, paman, qsynth, zrythm, helm, Spotify and Calf Studio Gear. So these are the applications you should be getting the most mileage from when testing, but most others should work too.

Anyway, let me quickly go over some of the highlight from Wim's presentation.

Session Manager

PipeWire now has a functioning session manager that allows for things like

Currently this is a simple sample session manager that Wim created himself, but we also have a more advanced session manager called Wireplumber being developed by Collabora, which they developed for use in automotive Linux usecases, but which we will probably be moving to over time also for the desktop.

Human readable handling of Audio Devices

Wim took the code and configuration data in Pulse Audio for ALSA Card Profiles and created a standalone library that can be shared between PipeWire and PulseAudio. This library handles ALSA sound card profiles, devices, mixers and UCM (use case manager used to configure the newer audio chips (like the Lenovo X1 Carbon) and lets PipeWire provide the correct information to provide to things like GNOME Control Center or pavucontrol. Using the same code as has been used in PulseAudio for this has the added benefit that when you switch from PulseAudio to PipeWire your devices don't change names. So everything should look and feel just like PulseAudio from an application perspective. In fact just below is a screenshot of pavucontrol, the Pulse Audio mixer application running on top of Pipewire without a problem.

PulSe Audio Mixer

Pavucontrol, the Pulse Audio mixer on Pipewire

Creating audio sink devices with Jack
Pipewire now allows you to create new audio sink devices with Jack. So the example command below creates a Pipewire sink node out of calfjackhost and sets it up so that we can output for instance the audio from Firefox into it. At the moment you can do that by running your Jack apps like this:

PIPEWIRE_PROPS="media.class=Audio/Sink" calfjackhost

But eventually we hope to move this functionality into the GNOME Control Center or similar so that you can do this setup graphically. The screenshot below shows us using CalfJackHost as an audio sink, outputing the audio from Firefox (a PulseAudio application) and CalfJackHost generating an analyzer graph of the audio.

Calfjackhost on pipewire

The CalfJackhost being used as an audio sink for Firefox

Creating devices with GStreamer
We can also use GStreamer to create PipeWire devices now. The command belows take the popular Big Buck Bunny animation created by the great folks over at Blender and lets you set it up as a video source in PipeWire. So for instance if you always wanted to play back a video inside Cheese for instance, to apply the Cheese effects to it, you can do that this way without Cheese needing to change to handle video playback. As one can imagine this opens up the ability to string together a lot of applications in interesting ways to achieve things that there might not be an application for yet. Of course application developers can also take more direct advantage of this to easily add features to their applications, for instance I am really looking forward to something like OBS Studio taking full advantage of PipeWire.

gst-launch-1.0 uridecodebin uri=file:///home/wim/data/BigBuckBunny_320x180.mp4 ! pipewiresink mode=provide stream-properties="props,media.class=Video/Source,node.description=BBB"

Cheese paying a video through pipewire

Cheese playing a video provided by GStreamer through PipeWire.

How to get started testing PipeWire
Ok, so after seeing all of this you might be thinking, how can I test all of this stuff out and find out how my favorite applications work with PipeWire? Well first thing you should do is make sure you are running Fedora Workstation 32 or later as that is where we are developing all of this. Once you done that you need to make sure you got all the needed pieces installed:

sudo dnf install pipewire-libpulse pipewire-libjack pipewire-alsa

Once that dnf command finishes you run the following to get PulseAudio replaced by PipeWire.


cd /usr/lib64/

sudo ln -sf pipewire-0.3/pulse/libpulse-mainloop-glib.so.0 /usr/lib64/libpulse-mainloop-glib.so.0.999.0
sudo ln -sf pipewire-0.3/pulse/libpulse-simple.so.0 /usr/lib64/libpulse-simple.so.0.999.0
sudo ln -sf pipewire-0.3/pulse/libpulse.so.0 /usr/lib64/libpulse.so.0.999.0

sudo ln -sf pipewire-0.3/jack/libjack.so.0 /usr/lib64/libjack.so.0.999.0
sudo ln -sf pipewire-0.3/jack/libjacknet.so.0 /usr/lib64/libjacknet.so.0.999.0
sudo ln -sf pipewire-0.3/jack/libjackserver.so.0 /usr/lib64/libjackserver.so.0.999.0

sudo ldconfig

(you can also find those commands here

Once you run these commands you should be able to run

pactl info

and see this as the first line returned:
Server String: pipewire-0

I do recommend rebooting, to be 100% sure you are on a PipeWire system with everything outputting through PipeWire. Once that is done you are ready to start testing!

Our goal is to use the remainder of the Fedora Workstation 32 lifecycle and the Fedora Workstation 33 lifecycle to stabilize and finish the last major features of PipeWire and then start relying on it in Fedora Workstation 34. So I hope this article will encourage more people to get involved and join us on gitlab and on the PipeWire IRC channel at #pipewire on Freenode.

As we are trying to stabilize PipeWire we are working on it on a bug by bug basis atm, so if you end up testing out the current state of PipeWire then be sure to report issues back to us through the PipeWire issue tracker, but do try to ensure you have a good test case/reproducer as we are still so early in the development process that we can't dig into 'obscure/unreproducible' bugs.

Also if you want/need to go back to PulseAudio you can run the commands here

Also if you just want to test a single application and not switch your whole system over you should be able to do that by using the following commands:

pw-pulse

or

pw-jack

Next Steps
So what are our exact development plans at this point? Well here is a list in somewhat priority order:

  1. Stabilize - Our top priority now is to make PipeWire so stable that the power users that we hope to attract us our first batch of users are comfortable running PipeWire as their only audio server. This is critical to build up a userbase that can help us identify and prioritize remaining issues and ensure that when we do switch Fedora Workstation over to using PipeWire as the default and only supported audio server it will be a great experience for users.
  2. Jackdbus - We want to implement support for the jackdbus API soon as we know its an important feature for the Fedora Jam folks. So we hope to get to this in the not to distant future
  3. Flatpak portal for JACK/audio applications - The future of application packaging is Flatpaks and being able to sandbox Jack applications properly inside a Flatpak is something we want to enable.
  4. Bluetooth - Bluetooth has been supported in PipeWire from the start, but as Wims focus has moved elsewhere it has gone a little stale. So we are looking at cycling back to it and cleaning it up to get it production ready. This includes proper support for things like LDAC and AAC passthrough, which is currently not handled in PulseAudio. Wim hopes to push an updated PipeWire in Fedora out next week which should at least get Bluetooth into a basic working state, but the big fix will come later.
  5. Pulse effects - Wim has looked at this, but there are some bugs that blocks data from moving through the pipeline.
  6. Latency compensation - We want complete latency compensation implemented. This is not actually in Jack currently, so it would be a net new feature.
  7. Network audio - PulseAudio style network audio is not implemented yet.

04 Sep 2020 4:47pm GMT

Peter Hutterer: User-specific XKB configuration - putting it all together

This is the continuation from these posts: part 1, part 2, part 3

This is the part where it all comes together, with (BYO) fireworks and confetti, champagne and hoorays. Or rather, this is where I'll describe how to actually set everything up. It's a bit complicated because while libxkbcommon does the parsing legwork now, we haven't actually changed other APIs and the file formats which are still 1990s-style nerd cool and requires significant experience in CS [1] to understand what goes where.

The below relies on software using libxkbcommon and libxkbregistry. At the time of writing, libxkbcommon is used by all mainstream Wayland compositors but not by the X server. libxkbregistry is not yet used because I'm typing this before we had a release for it. But at least now I have a link to point people to.

libxkbcommon has a xkbcli-scaffold-new-layout tool that The xkblayout tool creates the template files as shown below. At the time of writing, this tool must be run from the git repo build directory, it is not installed.

I'll explain here how to add the us(banana) variant and the custom:foo option, and I will optimise for simplicity and brevity.

Directory structure

First, create the following directory layout:


$ tree $XDG_CONFIG_HOME/xkb
/home/user/.config/xkb
├── compat
├── keycodes
├── rules
│ ├── evdev
│ └── evdev.xml
├── symbols
│ ├── custom
│ └── us
└── types

If $XDG_CONFIG_HOME is unset, fall back to $HOME/.config.

Rules files

Create the rules file and add an entry to map our custom:foo option to a section in the symbols/custom file.


$ cat $XDG_CONFIG_HOME/xkb/rules/evdev
! option = symbols
custom:foo = +custom(foo)

// Include the system 'evdev' file
! include %S/evdev

Note that no entry is needed for the variant, that is handled by wildcards in the system rules file. If you only want a variant and no options, you technically don't need this rules file.

Second, create the xml file used by libxkbregistry to display your new entries in the configuration GUIs:


$ cat $XDG_CONFIG_HOME/xkb/rules/evdev.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xkbConfigRegistry SYSTEM "xkb.dtd">
<xkbConfigRegistry version="1.1">
<layoutList>
<layout>
<configItem>
<name>us</name>
</configItem>
<variantList>
<variant>
<configItem>
<name>banana</name>
<shortDescription>banana</shortDescription>
<description>US(Banana)</description>
</configItem>
</variant>
</variantList>
</layout>
</layoutList>
<optionList>
<group allowMultipleSelection="true">
<configItem>
<name>custom</name>
<description>custom options</description>
</configItem>
<option>
<configItem>
<name>custom:foo</name>
<description>This option does something great</description>
</configItem>
</option>
</group>
</optionList>
</xkbConfigRegistry>

Our variant needs to be added as a layoutList/layout/variantList/variant, the option to the optionList/group/option. libxkbregistry will combine this with the system-wide evdev.xml file in /usr/share/X11/xkb/rules/evdev.xml.

Overriding and adding symbols

Now to the actual mapping. Add a section to each of the symbols files that matches the variant or option name:


$ cat $XDG_CONFIG_HOME/xkb/symbols/us
partial alphanumeric_keys modifier_keys
xkb_symbols "banana" {
name[Group1]= "Banana (us)";

include "us(basic)"

key <CAPS> { [ Escape ] };
};

with this, the us(banana) layout will be a US keyboard layout but with the CapsLock key mapped to Escape. What about our option? Mostly the same, let's map the tilde key to nothing:


$ cat $XDG_CONFIG_HOME/xkb/symbols/custom
partial alphanumeric_keys modifier_keys
xkb_symbols "foo" {
key <TLDE> { [ VoidSymbol ] };
};

A note here: NoSymbol means "don't overwrite it" whereas VoidSymbol is "map to nothing".

Notes

You may notice that the variant and option sections are almost identical. XKB doesn't care about variants vs options, it only cares about components to combine. So the sections do what we expect of them: variants include enough other components to make them a full keyboard layout, options merely define a few keys so they can be combined with layouts(variants). Due to how the lookups work, you could load the option template as layout custom(foo).

For the actual documentation of keyboard configuration, you'll need to google around, there are quite a few posts on how to map keys. All that has changed is where from and how things are loaded but not the actual data formats.

If you wanted to install this as system-wide custom rules, replace $XDG_CONFIG_HOME with /etc.

The above is a replacement for xmodmap. It does not require a script to be run manually to apply the config, the existing XKB integration will take care of it. It will work in Wayland (but as said above not in X, at least not for now).

A final word

Now, I fully agree that this is cumbersome, clunky and feels outdated. This is easy to fix, all that is needed is for someone to develop a better file format, make sure it's backwards compatible with the full spec of the XKB parser (the above is a fraction of what it can do), that you can generate the old files from the new format to reduce maintenance, and then maintain backwards compatibility with the current format for the next ten or so years. Should be a good Google Decade of Code beginner project.

[1] Cursing and Swearing

04 Sep 2020 4:36am GMT

Peter Hutterer: No user-specific XKB configuration in X

This is the continuation from these posts: part 1, part 2, part 3 and part 4.

In the posts linked above, I describe how it's possible to have custom keyboard layouts in $HOME or /etc/xkb that will get picked up by libxkbcommon. This only works for the Wayland stack, the X stack doesn't use libxkbcommon. In this post I'll explain why it's unlikely this will ever happen in X.

As described in the previous posts, users configure with rules, models, layouts, variants and options (RMLVO). What XKB uses internally though are keycodes, compat, geometry, symbols types (KcCGST) [1].

There are, effectively, two KcCGST keymap compilers: libxkbcommon and xkbcomp. libxkbcommon can go from RMLVO to a full keymap, xkbcomp relies on other tools (e.g. setxkbmap) which in turn use a utility library called libxkbfile to can parse rules files. The X server has a copy of the libxkbfile code. It doesn't use libxkbfile itself but it relies on the header files provided by it for some structs.

Wayland's keyboard configuration works like this:

The advantage we have here is that only the full keymap is passed between entities. Changing how that keymap is generated does not affect the client. This, coincidentally [2], is also how Xwayland gets the keymap passed to it and why Xwayland works with user-specific layouts.

X works differently. Notably, KcCGST can come in two forms, the partial form specifying names only and the full keymap. The partial form looks like this:


$ setxkbmap -print -layout fr -variant azerty -option ctrl:nocaps
xkb_keymap {
xkb_keycodes { include "evdev+aliases(azerty)" };
xkb_types { include "complete" };
xkb_compat { include "complete" };
xkb_symbols { include "pc+fr(azerty)+inet(evdev)+ctrl(nocaps)" };
xkb_geometry { include "pc(pc105)" };
};

This defines the component names but not the actual keymap, punting that to the next part in the stack. This will turn out to be the achilles heel. Keymap handling in the server has two distinct aproaches:

This has been the approach for decades. To give you an indication of how fast-moving this part of the server is: XKM caching was the latest feature added... in 2009.

Driver initialisation is nice, but barely used these days. You set your keyboard layout in e.g. GNOME or KDE and that will apply it in the running session. Or run setxkbmap, for those with a higher affinity to neckbeards. setxkbmap works like this:

Notably, the RMLVO to KcCGST conversion is done on the client side, not the server side. And the only way to send a keymap to the server is that XkbGetKeyboardByName request - which only takes KcCGST, you can't even pass it a full keymap. This is also a long-standing potential issue with XKB: if your client tools uses different XKB data files than the server, you don't get the keymap you expected.

Other parts of the stack do basically the same as setxkbmap which is just a thin wrapper around libxkbfile anyway.

Now, you can use xkbcomp on the client side to generate a keymap, but you can't hand it as-is to the server. xkbcomp can do this (using libxkbfile) by updating the XKB state one-by-one (XkbSetMap, XkbSetCompatMap, XkbSetNames, etc.). But at this point you're at the stage where you ask the server to knowingly compile a wrong keymap before updating the parts of it.

So, realistically, the only way to get user-specific XKB layouts into the X server would require updating libxkbfile to provide the same behavior as libxkbcommon, update the server to actually use libxkbfile instead of its own copy, and updating xkbcomp to support the changes in part 2, part 3. All while ensuring no regressions in code that's decades old, barely maintained, has no tests, and, let's be honest, not particularly pretty to look at. User-specific XKB layouts are somewhat a niche case to begin with, so I don't expect anyone to ever volunteer and do this work [3], much less find the resources to review and merge that code. The X server is unlikely to see another real release and this is definitely not something you want to sneak in in a minor update.

The other option would be to extend XKB-the-protocol with a request to take a full keymap so the server. Given the inertia involved and that the server won't see more full releases, this is not going to happen.

So as a summary: if you want custom keymaps on your machine, switch to Wayland (and/or fix any remaining issues preventing you from doing so) instead of hoping this will ever work on X. xmodmap will remain your only solution for X.

[1] Geometry is so pointless that libxkbcommon doesn't even implement this. It is a complex format to allow rendering a picture of your keyboard but it'd be a per-model thing and with evdev everyone is using the same model, so ...
[2] totally not coincidental btw
[3] libxkbcommon has been around for a decade now and no-one has volunteered to do this in the years since, so...

04 Sep 2020 12:39am GMT

Mike Blumenkrantz: More Map Gains

The Optimizations Continue

Optimizing transfer_map is one of the first issues I created, and it's definitely one of the most important, at least as it pertains to unit tests. So many unit tests perform reads on buffers that it's crucial to ensure no unnecessary flushing or stalling is happening here.

Today, I've made further strides in this direction for piglit's spec@glsl-1.30@execution@texelfetch fs sampler2d 1x281-501x281:

before

MESA_LOADER_DRIVER_OVERRIDE=zink bin/texelFetch 4.41s user 1.92s system 71% cpu 8.801 total

after

MESA_LOADER_DRIVER_OVERRIDE=zink bin/texelFetch 4.22s user 1.72s system 76% cpu 7.749 total

More Speed Loops

As part of ensuring test coherency, a lot of explicit fencing was added around transfer_map and transfer_unmap. Part of this was to work around the lack of good barrier usage, and part of it was just to make sure that everything was properly synchronized.

An especially non-performant case of this was in transfer_unmap, where I added a fence to block after the buffer was unmapped. In reality, this makes no sense other than for the case of synchronizing with the compute batch, which is a bit detached from everything else.

The reason this fence continued to be needed comes down to barrier usage for descriptors. At the start, all image resources used for descriptors just used VK_IMAGE_LAYOUT_GENERAL for the barrier layout, which is fine and good; certainly this is spec compliant. It isn't, however, optimally informing the underlying driver about the usage for the case where the resource was previously used with VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL to copy data back from a staging image.

Instead, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL can be used for sampler images since they're read-only. This differs from VK_IMAGE_LAYOUT_GENERAL in that VK_IMAGE_LAYOUT_GENERAL contains both read and write-not what's actually going on in this case given that sampler images are read-only.

No Other Results

I'd planned to post more timediffs from some piglit results, but I hit some weird i915 bugs and so the tests have yet to complete after ~14 hours. More to come next week.

04 Sep 2020 12:00am GMT

03 Sep 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Buffer Ranging

A Quick Optimization

As I mentioned last week, I'm turning a small part of my attention now to doing some performance improvements. One of the low-hanging fruits here is adding buffer ranges; in short, this means that for buffer resources (i.e., not images), the driver tracks the ranges in memory of the buffer that have data written, which allows avoiding gpu stalls when trying to read or write from a range in the buffer that's known to not have anything written.

util_range

The util_range API in gallium is extraordinarily simple. Here's the key parts:

struct util_range {
   unsigned start; /* inclusive */
   unsigned end; /* exclusive */

   /* for the range to be consistent with multiple contexts: */
   simple_mtx_t write_mutex;
};

/* This is like a union of two sets. */
static inline void
util_range_add(struct pipe_resource *resource, struct util_range *range,
               unsigned start, unsigned end)
{
   if (start < range->start || end > range->end) {
      if (resource->flags & PIPE_RESOURCE_FLAG_SINGLE_THREAD_USE) {
         range->start = MIN2(start, range->start);
         range->end = MAX2(end, range->end);
      } else {
         simple_mtx_lock(&range->write_mutex);
         range->start = MIN2(start, range->start);
         range->end = MAX2(end, range->end);
         simple_mtx_unlock(&range->write_mutex);
      }
   }
}

static inline boolean
util_ranges_intersect(const struct util_range *range,
                      unsigned start, unsigned end)
{
   return MAX2(start, range->start) < MIN2(end, range->end);
}

When the driver writes to the buffer, util_range_add() is called with the range being modified. When the user then tries to map the buffer using a specified range, util_ranges_intersect() is called with the range being mapped to determine if a stall is needed or if it can be skipped.

Where do these ranges get added for the buffer?

Lots of places. Here's a quick list:

Some Numbers

The tests are still running to see what happens here, but I found some interesting improvements using my piglit timediff display after rebasing my branch yesterday for the first time in a bit and running the tests again overnight:

piglit-dmat.png

I haven't made any changes to this codepath myself (that I'm aware of), so it looks like I've pulled in some improvement that's massively cutting down the time required for the codepath that handles implicit 32bit -> 64bit conversions, and timediff picked it up.

03 Sep 2020 12:00am GMT

01 Sep 2020

feedplanet.freedesktop.org

Mike Blumenkrantz: Test Xzibiting

Benchmarking vs Testing

Just a quick post for today to talk about a small project I undertook this morning.

As I've talked about repeatedly, I run a lot of tests.

It's practically all I do when I'm not dumping my random wip code by the truckload into the repo.

The problem with this approach is that it doesn't leave much time for performance improvements. How can I possibly have time to run benchmarks if I'm already spending all my time running tests?

Motivation

As one of the greatest bench markers of all time once said, If you want to turn a vision into reality, you have to give 100% and never stop believing in your dream.

Thus I decided to look for those gains while I ran tests.

Piglit already possesses facilities for providing HTML summaries of all tests (sidebar: the results that I posted last week are actually a bit misleading since they included a number of regressions that have since been resolved, but I forgot to update the result set). This functionality includes providing the elapsed time for each test. It has not, however, included any way of displaying changes in time between result sets.

Until now.

The Future Is Here

With my latest MR, the HTML view can show off those big gains (or losses) that have been quietly accumulating through all these high volume sets of unit tests:

piglit-timing.png

Is it more generally useful to other drivers?

Maybe? It'll pick up any unexpected performance regressions (inverse gains) too, and it doesn't need any extra work to display, so it's an easy check, at the least.

01 Sep 2020 12:00am GMT