14 Jan 2021

feedplanet.freedesktop.org

Alyssa Rosenzweig: Desktop OpenGL 3.1 on Mali GPUs with Panfrost

The open source Panfrost driver for Arm Mali Midgard and Bifrost GPUs now provides non-conformant OpenGL ES 3.0 on Bifrost and desktop OpenGL 3.1 on Midgard (Mali T760 and newer) and Bifrost, in time for Mesa's first release of 2021. This follows the OpenGL ES 3.0 support on Midgard that landed over the summer, as well as the initial OpenGL ES 2.0 support that recently debuted for Bifrost. OpenGL ES 3.0 is now tested on Mali G52 in Mesa's continuous integration, achieving a 99.9% pass rate on the corresponding drawElements Quality Program tests.

Architecturally, Bifrost shares most of its fixed-function data structures with Midgard, but features a brand new instruction set. Our work for bringing up OpenGL ES 3.0 on Bifrost reflects this division. Some fixed-function features, like instancing and transform feedback, worked without any Bifrost-specific changes since we already did bring-up on Midgard. Other shader features, like uniform buffer objects, required "from scratch" implementations in the Bifrost compiler, a task facilitated by the compiler's maturing intermediate representation with first-class builder support. Yet other features like multiple render targets required some Bifrost-specific code while leveraging other code shared with Midgard. All in all, the work progressed much more quickly the second time around, a testament to the power of code sharing. But there is no need to limit sharing to just Panfrost GPUs; open source drivers can share code across vendors.

Indeed, since Mali is an embedded GPU, the proprietary driver only exposes exposes OpenGL ES, not desktop OpenGL. However, desktop OpenGL 3.1 support comes nearly "for free" for us as an upstream Mesa driver by leveraging common infrastructure. This milestone shows the technical advantage of open source development: Compared to layered implementations of desktop GL like gl4es or Zink, Panfrost's desktop OpenGL support is native, reducing CPU overhead. Furthermore, applications can make use of the hardware's hidden features, like explicit primitive restart indices, alpha testing, and quadrilaterals. Although these features could be emulated, the native solutions are more efficient.

Mesa's shared code also extends to OpenCL support via Clover. Once a driver supports compute shaders and sufficient compiler features, baseline OpenCL is just a few patches and a bug-fixing spree away. While OpenCL implementations could be layered (for example with clvk), an open source Mesa driver avoids the indirection.

I would like to thank Collaboran Boris Brezillon, who has worked tirelessly to bring OpenGL ES 3.0 support to Bifrost, as well as the prolific Icecream95, who has spearheaded OpenCL and desktop OpenGL support.

Originally posted on Collabora's blog

14 Jan 2021 2:16pm GMT

Mike Blumenkrantz: Overhead

A Quintessential Metric

There's been a lot of talk about driver overhead in the Mesa community as of late, in large part begun by Marek Olšák and his daredevil stunts driving RadeonSI through flaming hoops while juggling chainsaws.

While zink isn't quite at that level yet (and neither am I), there's still some progress being made that I'd like to dig into a bit.

What Is Overhead?

As in all software, overhead is the performance penalty that is incurred as compared to a baseline measurement. In Mesa, a lot of people know of driver overhead as "Gallium sucks" and/or "A Gallium-based driver is slow" due to the fact that Gallium does incur some amount of overhead as compared to the old-style immediate mode DRI drivers.

While it's true that there is an amount of performance lost by using Gallium in this sense, it's also true that the performance gained is much greater. The reason for this is that Gallium is able to batch commands and state changes for every driver using it, allowing redundant calls to avoid triggering any work in the GPU.

It also makes for an easier time profiling and improving upon the CPU usage that's required to handle the state changes emitted by Gallium. Instead of having a ton of core Mesa callbacks which need to be handled, each one potentially leading to a no-op that can be analyzed and deferred by the driver, Gallium provides a more cohesive API where each driver hook is a necessary change that must be handled. Because of this, the job of optimizing for those changes is simplified.

How Can Overhead Be Measured?

Other than the obvious method of running apps on a driver and checking the fps counter, piglit provides a facility for this: the drawoverhead test. This test has over a hundred subtests which perform sequences of draw operations with various state changes, each with its own result relative to a baseline, enabling a developer to easily profile and optimize a given codepath.

How Is Zink Doing Here?

To answer this, let's look at some preliminary results from zink in master, the code which will soon be shipping in Mesa 21.0.0. All numbers here are, in contrast to my usual benchmarking, done on AMD 5700XT GPU. More on this later.

ZINK: MASTER

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  818, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  686, 83.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  411, 50.3%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  232, 28.4%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  258, 31.5%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             87, 10.7%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             162, 19.9%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 150, 18.3%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                120, 14.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     192, 23.5%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    146, 17.9%

After this point, the test aborts because shader images are not yet implemented, but it's enough for a baseline.

These numbers are…not great. Primarily, at least to start, I'll be focusing on the first row where zink is performing 818,000 draws per second.

Let's check out some performance from zink-wip (20201230 snapshot), specifically with GALLIUM_THREAD=0 set to disable threaded context. This means I'm adding in descriptor caching and unlimited command buffer counts (vs forcing a stall after every submit from the 4th one onwards to reset a batch):

ZINK: WIP (CACHED, NO THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  766, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  633, 82.6%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  407, 53.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  500, 65.3%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  449, 58.6%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             85, 11.2%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             235, 30.7%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 159, 20.8%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                128, 16.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     179, 23.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    139, 18.2%

This is actually worse for a lot of cases!

But why is that?

It turns out that in the base draw case, threaded context is really necessary to be doing caching and using more command buffers. There's sizable gains made in the baseline texture cases (+100% or so each) and a vertex attribute change (+50%), but fundamentally the overhead for the driver seems higher.

What happens if threading is enabled though?

ZINK: WIP (CACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5206, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5149, 98.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5187, 99.6%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5210, 100.1%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 4684, 90.0%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            137, 2.6%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             252, 4.8%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 243, 4.7%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                222, 4.3%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     213, 4.1%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    208, 4.0%

blink.gif

Indeed, threading yields almost a 700% performance improvement for teh baseline cases. It turns out that synchronously performing expensive tasks like computing hash values for descriptor sets is bad. Who could have guessed.

State Changes

Looking at the other values, however, is a bit more pertinent for the purpose of this post. Overhead is incurred when state changes are triggered by descriptors being changed, and this is much closer to a real world scenario (i.e., gaming) than simply running draw calls with no changes. Caching yields roughly a 50% performance improvement for this case.

Further Improvements

As I'd mentioned previously, I'm doing some work now on descriptor management with an aim to further lower this overhead. Let's see what that looks like.

ZINK: TEST (UNCACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5426, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5423, 99.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5432, 100.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5246, 96.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 5177, 95.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            153, 2.8%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             229, 4.2%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 247, 4.6%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                228, 4.2%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     237, 4.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    223, 4.1%

While there's a small (~4%) improvement for the baseline numbers, what's much more interesting is the values where descriptor states are changed. They are, in fact, about as good or even slightly better than the caching version of descriptor management.

This is huge. Specifically it's huge because it means that I can likely port over some of the techniques used in this approached to the cached version in order to drive further reductions in overhead.

Closing Remarks

Before I go, let's check out some numbers from a real driver. Specifically, RadeonSI: the pinnacle of Gallium-based drivers.

RADEONSI

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 6221, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 6261, 100.7%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 6236, 100.2%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 6263, 100.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 6243, 100.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            217, 3.5%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,            1467, 23.6%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 374, 6.0%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                218, 3.5%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     680, 10.9%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    318, 5.1%

Yikes. Especially intimidating here is the relative performance for vertex attribute changes, where RadeonSI is able to retain almost 25% of its baseline performance relative to zink not even managing 5%.

Hopefully these figures get closer to each other in the future, but this just shows that there's still a long way to go.

14 Jan 2021 12:00am GMT

13 Jan 2021

feedplanet.freedesktop.org

Peter Hutterer: Parsing HID Unit Items

This post explains how to parse the HID Unit Global Item as explained by the HID Specification, page 37. The table there is quite confusing and it took me a while to fully understand it (Benjamin Tissoires was really the one who cracked it). I couldn't find any better explanation online which means either I'm incredibly dense and everyone's figured it out or no-one has posted a better explanation. On the off-chance it's the latter [1], here are the instructions on how to parse this item.

We know a HID Report Descriptor consists of a number of items that describe the content of each HID Report (read: an event from a device). These Items include things like Logical Minimum/Maximum for axis ranges, etc. A HID Unit item specifies the physical unit to apply. For example, a Report Descriptor may specify that X and Y axes are in mm which can be quite useful for all the obvious reasons.

Like most HID items, a HID Unit Item consists of a one-byte item tag and 1, 2 or 4 byte payload. The Unit item in the Report Descriptor itself has the binary value 0110 01nn where the nn is either 1, 2, or 3 indicating 1, 2 or 4 bytes of payload, respectively. That's standard HID.

The payload is divided into nibbles (4-bit units) and goes from LSB to MSB. The lowest-order 4 bits (first byte & 0xf) define the unit System to apply: one of SI Linear, SI Rotation, English Linear or English Rotation (well, or None/Reserved). The rest of the nibbles are in this order: "length", "mass", "time", "temperature", "current", "luminous intensity". In something resembling code this means:


system = value & 0xf
length_exponent = (value & 0xf0) >> 4
mass_exponent = (value & 0xf00) >> 8
time_exponent = (value & 0xf000) >> 12
...

The System defines which unit is used for length (e.g. SILinear means length is in cm). The actual value of each nibble is the exponent for the unit in use [2]. In something resembling code:


switch (system)
case SILinear:
print("length is in cm^{length_exponent}");
break;
case SIRotation:
print("length is in rad^{length_exponent}");
break;
case EnglishLinear:
print("length is in in^{length_exponent}");
break;
case EnglishRotation:
print("length is in deg^{length_exponent}");
break;
case None:
case Reserved"
print("boo!");
break;

For example, the value 0x321 means "SI Linear" (0x1) so the remaining nibbles represent, in ascending nibble order: Centimeters, Grams, Seconds, Kelvin, Ampere, Candela. The length nibble has a value of 0x2 so it's square cm, the mass nibble has a value of 0x3 so it is cubic grams (well, it's just an example, so...). This means that any report containing this item comes in cm²g³. As a more realistic example: 0xF011 would be cm/s.

If we changed the lowest nibble to English Rotation (0x4), i.e. our value is now 0x324, the units represent: Degrees, Slug, Seconds, F, Ampere, Candela [3]. The length nibble 0x2 means square degrees, the mass nibble is cubic slugs. As a more realistic example, 0xF014 would be degrees/s.

Any nibble with value 0 means the unit isn't in use, so the example from the spec with value 0x00F0D121 is SI linear, units cm² g s⁻³ A⁻¹, which is... Voltage! Of course you knew that and totally didn't have to double-check with wikipedia.

Because bits are expensive and the base units are of course either too big or too small or otherwise not quite right, HID also provides a Unit Exponent item. The Unit Exponent item (a separate item to Unit in the Report Descriptor) then describes the exponent to be applied to the actual value in the report. For example, a Unit Eponent of -3 means 10⁻³ to be applied to the value. If the report descriptor specifies an item of Unit 0x00F0D121 (i.e. V) and Unit Exponent -3, the value of this item is mV (milliVolt), Unit Exponent of 3 would be kV (kiloVolt).

Now, in hindsight all this is pretty obvious and maybe even sensible. It'd have been nice if the spec would've explained it a bit clearer but then I would have nothing to write about, so I guess overall I call it a draw.

[1] This whole adventure was started because there's a touchpad out there that measures touch pressure in radians, so at least one other person out there struggled with the docs...
[2] The nibble value is twos complement (i.e. it's a signed 4-bit integer). Values 0x1-0x7 are exponents 1 to 7, values 0x8-0xf are exponents -8 to -1.
[3] English Linear should've trolled everyone and use Centimetres instead of Centimeters in SI Linear.

13 Jan 2021 11:28am GMT

10 Nov 2011

feedPlanet GNOME

Brad Taylor: Reports of Snowy’s death have been greatly exaggerated

Browsing foundation-list recently, I was honored to see Snowy (and Tomboy Online) hosting mentioned as one of the GNOME CEO goals (scroll to the bottom) for 2010! Unfortunately, the pace of Snowy's development has slowed in the last few months, due in part to both Sandy and my schedules. Despite that, we wouldn't want Stormy to get a bad reputation because of our slacking, so we're going to change that.

We're hosting an IRC meeting in the #snowy channel on irc.gimp.net on Saturday, 23 Jan 2010 at 11:00 AM EDT (16:00 GMT, other time zones) to get ourselves organized, and to recruit your help.

So, if you are a graphic designer that wants to help beautify an awesome open source project, if you're a hacker who knows or wants to learn Django, or even if you're just interested in Snowy, stop on by!

See you there!

10 Nov 2011 4:55am GMT

Brad Taylor: Mono Accessibility 2.0 unleashed!

Today, I'm proud to announce the 2.0 release of the Mono Accessibility project. Spanning a year of intensive work and fixing over 500 bugs, this is truly our best release ever.

This release enables all types of users to access System.Windows.Forms and Silverlight applications from Linux using Orca and other ATK-based Assistive Technologies (ATs), as well as access Linux applications from UI Automation (UIA) based ATs.

What's changed since version 1.0?

Added:

What is Mono Accessibility:

The Mono Accessibility project enables Winforms and Silverlight applications to be fully accessible on Linux, and allows Assistive Technologies (ATs) like screen readers and test automation tools that depend on UI Automation APIs to work on Linux.

Mono Accessibility is released under the MIT/X11 license.

Get it!

Mono Accessibility is available for a variety of Linux distributions, including:

A Note About at-spi2

Accessing GTK+ applications with the UIA Client API requires the most recent development version of the new dbus-based at-spi2, which is known to cause system instability.

In Fedora, at-spi2 repeatedly causes GDM to segfault. If you do not need this feature, do not install the latest at-spi2 and atk, or our packages which depend on them, which are at-spi-sharp and AtspiUiaSource.

We are working hard to identify these issues and hope to aid the GNOME Accessibility Team in stabilizing at-spi2 in the near future.

Find out more

Navigate to our homepage for all the latest information, and ways to contact us.

10 Nov 2011 4:55am GMT

Calum Benson: What’s new on the Solaris 11 desktop?

This entry is cross-posted from my Oracle blog… clearly, seasoned GNOME blog readers will be less excited about GNOME 2.30, compiz and Firefox 6 than my audience over there, many of whom have been using GNOME 2.6 on Solaris 10 for the past 7 years :)

Much has been written today about the enterprise and cloud features of Oracle Solaris 11, which was launched today, but what's new for those of us who just like to have the robustness and security of Solaris on our desktop machines? Here are a few of the Solaris 11 desktop highlights:

Solaris 11 is free to download and use for most non-commercial purposes (but IANAL, so do check the OTN License Agreement on the download page first - it's short and sweet, as these things go), and you can download various flavours, including a Live CD and a USB install image, right here.

10 Nov 2011 12:08am GMT