24 Nov 2020

feedLXer Linux News

Portwell and Congatec spin Elkhart Lake modules in multiple form factors

Portwell unveiled a "PQ7-M109" Qseven module with Intel's Atom x-6000. Congatec recently announced x6000 modules in Qseven (Conga-QA7), SMARC, (Conga-SA7), Mini Type 10 (Conga-MA7), and Compact Type 6 (Conga-TCA7) form factors. Portwell has announced the PQ7-M109, its first product based on Intel's 10nm fabricated Elkhart Lake family of low-power system-on-chips, which includes several Atom x-6000, […]

24 Nov 2020 10:39pm GMT

How to Install Gibbon LMS on Ubuntu 20.04

Gibbon is a free and open-source school management system specially designed for teachers, students, parents and leaders. It helps teachers to find, contact and help their students.

24 Nov 2020 8:56pm GMT

Linus Torvalds would like to use an M1 Mac for Linux, but…

Yes, Torvalds said he[he]#039[/he]d love to have one of the new M1-powered Apple laptops, but it won[he]#039[/he]t run Linux and, in an exclusive interview he explains why getting Linux to run well on it isn[he]#039[/he]t worth the trouble.

24 Nov 2020 7:13pm GMT

How to Install Arch Linux Beginner's Guide

This beginner's guide explains the steps on how to install Arch Linux - in the easiest and friendly way.

24 Nov 2020 5:30pm GMT

How to create a Cloudwatch Event Rule in AWS

A near-real-time stream of system events that describe changes in AWS resources is delivered by CloudWatch Events. We can create a rule that matches events and route them to one or more target functions.

24 Nov 2020 3:47pm GMT

Build a motion detection system with a Raspberry Pi

If you want a home security system to tell you if someone is lurking around your property, you don[he]#039[/he]t need an expensive, proprietary solution from a third-party vendor. You can set up your own system using a Raspberry Pi, a passive infrared (PIR) motion sensor, and an LTE modem that will send SMS messages whenever it detects movement.

24 Nov 2020 2:04pm GMT

Raspberry Pi automation add-on offers ADCs, DIDO, and UPS

Edgedevices.io's second-gen "Pi-oT 2" Raspberry Pi automation add-on offers 8x ADCs, 6x digital outputs, and Ethernet plus options including a 2-hour UPS, RS485, 4x 24V digital inputs, and a 12-24VDC input. Last year, Cleveland-based Pi-oT, now called Edgedevices.io, launched a Kickstarter campaign for a Pi-oT industrial controller add-on for the Raspberry Pi that is housed […]

24 Nov 2020 12:21pm GMT

How to use Bash file test operators in Linux

File Test Operators are used in Linux to check and verify attributes of files like ownership or if they are a symlink. In this article, you will learn to test files using the if statement followed by some important test operators in Linux.

24 Nov 2020 10:38am GMT

Create a machine learning model with Bash

Machine learning is a powerful computing capability for predicting or forecasting things that conventional algorithms find challenging. The machine learning journey begins with collecting and preparing data-a lot of it-then it builds mathematical models based on that data. While multiple tools can be used for these tasks, I like to use the shell.

24 Nov 2020 8:55am GMT

Linux-driven module and starter kit tap Renesas RZ/G2

TQ's "TQMaRZG2x" module runs Linux on a dual- to octa-core, Cortex-A57 and -A53 based RZ/G2 processor with up to 8GB LPDDR4 and 64GB eMMC plus an optional dev kit and -40 to 85°C support. When reporting on the SMARC 2.0 SoM collaboration between Renesas and RelySys last week featuring Renesas' scalable, 64-bit RZ/G2 processor, we […]

24 Nov 2020 7:12am GMT

Synchronize Files Between Multiple Systems With Syncthing In Linux

This guide explains what is Syncthing, how to install Syncthing on Linux, how to synchronize files between multiple Linux systems.

24 Nov 2020 5:29am GMT

An introduction to Prometheus metrics and performance monitoring

Use Prometheus to gather metrics into usable, actionable entries, giving you the data you need to manage alerts and performance information in your environment.

24 Nov 2020 3:46am GMT

Vulkan Ray Tracing becomes official with Vulkan 1.2.162

The day has arrived, it seems with Vulkan 1.2.162 the Ray Tracing extensions have become part of Vulkan bringing it from provisional status to official.

24 Nov 2020 2:03am GMT

6 predictions for JavaScript build tools

Code used in production is different from development code. In production, you need to build packages that run fast, manage dependencies, automate tasks, load external modules, and more. JavaScript tools that make it possible to turn development code into production code are called build tools.

24 Nov 2020 12:20am GMT

23 Nov 2020

feedLXer Linux News

How to use Nginx to redirect all traffic from http to https

If your website is hosted with NGINX and it has SSL enabled, it's best practice to disable HTTP completely and force all incoming traffic over to the HTTPS version of the website. This avoids having duplicate content and ensures that all of the site's users are only browsing the secure version of your website. You should also see an SEO boost, as search engines prefer non-redundant and secured web pages.

23 Nov 2020 10:38pm GMT

Linus Torvalds worried Linux kernel might get messy around Christmas

LTS release 5.10 is currently unruly and looks like colliding with the holiday season. Linus Torvalds has expressed some worries about progress of version 5.10 of the Linux kernel.…

23 Nov 2020 8:55pm GMT

13 Nov 2020

feedKernel Planet

Dave Airlie (blogspot): lavapipe: a *software* swrast vulkan layer FAQ

(project was renamed from vallium to lavapipe)

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.


What is lavapipe?

The lavapipe layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
In order to bridge the gap, the lavapipe layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?


Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the lavapipe layer.

The lavapipe execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.

13 Nov 2020 2:16am GMT

12 Nov 2020

feedKernel Planet

Dave Airlie (blogspot): Linux graphics, why sharing code with Windows isn't always a win.

A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.

I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.

tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.

The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.

This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.

The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.

A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.

Why is it a bad idea?

I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.

As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.

Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.

One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.

How would I do it?

If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:

a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).

b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.

A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.

12 Nov 2020 12:05am GMT

04 Nov 2020

feedKernel Planet

Brendan Gregg: BPF binaries: BTF, CO-RE, and the future of BPF perf tools

Two new technologies, BTF and CO-RE, are paving the way for BPF to become a billion dollar industry. Right now there are many BPF (eBPF) startups building networking, security, and performance products (and more in stealth), yet requiring customers to install LLVM, Clang, and kernel headers - which can consume over 100 Mbytes of storage - to use BPF is an adoption drag. BTF and CO-RE eliminate these dependencies at runtime, not only making BPF more practical for embedded Linux environments, but for adoption everywhere. These technologies are: - BTF: BPF Type Format, which provides struct information to avoid needing Clang and kernel headers. - CO-RE: BPF Compile-Once Run-Everywhere, which allows compiled BPF bytecode to be relocatable, avoiding the need for recompilation by LLVM. Clang and LLVM are still required for compilation, but the result is a lightweight ELF binary that includes the precompiled BPF bytecode and can be run everywhere. The BCC project has a collection of these, called libbpf tools. As an example, I ported over my opensnoop(8) tool:

# ./opensnoop
PID    COMM              FD ERR PATH
27974  opensnoop         28   0 /etc/localtime
1482   redis-server       7   0 /proc/1482/stat
1657   atlas-system-ag    3   0 /proc/stat

This opensnoop(8) is an ELF binary that doesn't use libLLVM or libclang:

# file opensnoop
opensnoop: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/l, for GNU/Linux 3.2.0, BuildID[sha1]=b4b5320c39e5ad2313e8a371baf5e8241bb4e4ed, with debug_info, not stripped

# ldd opensnoop
        linux-vdso.so.1 (0x00007ffddf3f1000)
        libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f9fb7836000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9fb7619000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9fb7228000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9fb7c76000)

# ls -lh opensnoop opensnoop.stripped
-rwxr-xr-x 1 root root 645K Feb 28 23:18 opensnoop
-rwxr-xr-x 1 root root 151K Feb 28 23:33 opensnoop.stripped

... and stripped is only 151 Kbytes. Now imagine a BPF product: instead of requiring customers install various heavyweight (and brittle) dependencies, a BPF agent may now be a single tiny binary that works on any kernel that has BTF. ## How this works It's not just a matter of saving the BPF bytecode in ELF and then sending it to any other kernel. Many BPF programs walk kernel structs that can change from one kernel version to another. Your BPF bytecode may still execute on different kernels, but it may be reading the wrong struct offsets and printing garbage output! opensnoop(8) doesn't walk kernel structs since it instruments stable tracepoints and their arguments, but many other tools do. This is an issue of *relocation*, and both BTF and CO-RE solve this for BPF binaries. BTF provides type information so that struct offsets and other details can be queried as needed, and CO-RE records which parts of a BPF program need to be rewritten, and how. CO-RE developer Andrii Nakryiko has written long posts explaining this in more depth: [BPF Portability and CO-RE] and [BTF Type Information]. ## CONFIG_DEBUG_INFO_BTF=y These new BPF binaries are only possible if this kernel config option is set. It adds about 1.5 Mbytes to the kernel image (this is tiny in comparison to DWARF debuginfo, which can be hundreds of Mbytes). Ubuntu 20.10 has already made this config option the default, and all other distros should follow. Note to distro maintainers: it requires pahole >= 1.16. ## The future of BPF performance tools, BCC Python, and bpftrace For BPF performance tools, you should start with running [BCC] and [bpftrace] tools, and then coding in bpftrace. The BCC tools should eventually be switched from Python to libbpf C under the hood, but will work the same. **Coding performance tools in BCC Python is now considered deprecated** as we move to libbpf C with BTF and CO-RE (although we still have library work to do, such as for USDT support, so the Python versions will be needed for a while). Note that there are other use cases of BCC that may continue to use the Python interface; both BPF co-maintainer Alexei Starovoitov and myself briefly discussed this on [iovisor-dev]. My [BPF Performance Tools] book focused on running BCC tools and coding in bpftrace, and that doesn't change. However, **Appendix C's Python programming examples are now considered deprecated.** Apologies for the inconvenience. Fortunately it's only 15 pages of appendix material out of the 880-page book. What about bpftrace? It does support BTF, and in the future we're looking at reducing its installation footprint as well (it can currently get to [29 Mbytes], and we think it can go a lot smaller). Given an average libbpf program size of 229 Kbytes (based on the current libbpf tools, stripped), and an average bpftrace program size of 1 Kbyte (my book tools), a large collection of bpftrace tools plus the bpftrace binary may become a smaller installation footprint than the equivalent in libbpf. Plus the bpftrace versions can be modified on the fly. libbpf is better suited for more complex and mature tools that needs custom arguments and libraries. As screenshots, the future of BPF performance tools is this:

# ls /usr/share/bcc/tools /usr/sbin/*.bt
argdist       drsnoop         mdflush         pythongc     tclobjnew
bashreadline  execsnoop       memleak         pythonstat   tclstat
/usr/sbin/bashreadline.bt    /usr/sbin/mdflush.bt    /usr/sbin/tcpaccept.bt
/usr/sbin/biolatency.bt      /usr/sbin/naptime.bt    /usr/sbin/tcpconnect.bt

... and this:

# bpftrace -e 'BEGIN { printf("Hello, World!\n"); }'
Attaching 1 probe...
Hello, World!

... and **not** this:


from bcc import BPF
from bcc.utils import printb

prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello, World!\\n");
    return 0;

Thanks to Yonghong Song (Facebook) for leading development of BTF, Andrii Nakryiko (Facebook) for leading development of CO-RE, and everyone else involved in making this happen. [BPF Portability and CO-RE]: https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html [BTF Type Information]: https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html [BPF Performance Tools]: /bpf-performance-tools-book.html [29 Mbytes]: https://github.com/iovisor/bpftrace/issues/342 [iovisor-dev]: https://lists.iovisor.org/g/iovisor-dev/topic/future_of_bcc_python_tools/77827559?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,77827559 [BCC]: https://github.com/iovisor/bcc [bpftrace]: https://github.com/iovisor/bpftrace

04 Nov 2020 8:00am GMT

02 Nov 2020

feedKernel Planet

Michael Kerrisk (manpages): man-pages-5.09 is released

I've released man-pages-5.09. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from more than 40 contributors. The release includes more than 500 commits that changed nearly 600 pages. Nine new pages were added in this release.

The most notable of the changes in man-pages-5.09 are the following:

As is probably clear, Alejandro Colomar owns this release. With 265 commits, he was by some margin the top contributor, and I'm very happy to report that he beat me into second place as a contributor to this release (something that happened only once before since I became maintainer).

02 Nov 2020 5:55am GMT

30 Oct 2020

feedKernel Planet

Dave Airlie (blogspot): llvmpipe is OpenGL 4.5 conformant.

(I just sent the below email to mesa3d developer list).

Just to let everyone know, a month ago I submitted the 20.2 llvmpipe
driver for OpenGL 4.5 conformance under the SPI/X.org umbrella, and it
is now official[1].

Thanks to everyone who helped me drive this forward, and to all the
contributors both to llvmpipe and the general Mesa stack that enabled

Big shout out to Roland Scheidegger for helping review the mountain of
patches I produced in this effort.

My next plans involved submitting lavapipe for Vulkan 1.0, it's at 99%
or so CTS, but there are line drawing, sampler accuracy and some snorm
blending failure I have to work out.
I also ran the OpenCL 3.0 conformance suite against clover/llvmpipe
yesterday and have some vague hopes of driving that to some sort of

(for GL 4.6 only texture anisotropy is really missing, I've got
patches for SPIR-V support, in case someone was feeling adventurous).


[1] https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_272

30 Oct 2020 8:25pm GMT

Andy Grover: Upgrading to Fedora 33: Removing Your Old Swap File on EFI Machine

Fedora 33 adds a compressed-memory-based swap device using zram. Cool! Now you can remove your old swap device, if you were a curmudgeon like me and even had one in the first place.

If you are NOT on an EFI system or not using LVM, be aware of this and make changes to these steps as needed. (Specifically, the path given in step 6 will be different.)

  1. After upgrading to Fedora 33, run free. Notice that swap size is the sum of the 4G zram device plus your previous disk-based swap device. Try zramctl and lsblk commands for more info.
  2. Stop swapping to the swap device we're about to remove. If using LVM, expect the VG and LV names to be different.
    swapoff /dev/vg0/swap
  3. If LVM, remove the no-longer-needed logical volume.
    lvremove /dev/vg0/swap
  4. Edit /etc/fstab and remove (or comment out) the line for your swap device.
  5. Edit /etc/default/grub.
    In the GRUB_CMDLINE_LINUX line, remove the "resume=" part referring to the now-gone swap partition, and the "rd.lvm.lv=" part that also refers to it.
  6. Apply above changes to actual GRUB configuration:
    grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

Reboot and your system should come back up. Enjoy using that reclaimed disk space for more useful things.

30 Oct 2020 7:01pm GMT

29 Oct 2020

feedKernel Planet

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Parts IV and V

Continuing further into the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file uncovers RCU's final two fundamental guarantees:

  1. The common-case RCU primitives are unconditional, and
  2. RCU users can perform a guaranteed read-to-write upgrade.

The first guarantee is trivially verified by inspection of the RCU API. The type of rcu_read_lock(), rcu_read_unlock(), synchronize_rcu(), call_rcu(), and rcu_assign_pointer() are all void. These API members therefore have no way to indicate failure. Even primitives like rcu_dereference(), which do have non-void return types, will succeed any time a load of their pointer argument would succeed. That is, if you do rcu_dereference(*foop), where foop is a NULL pointer, then yes, you will get a segmentation fault. But this segmentation fault will be unconditional, as advertised!

The second guarantee is a consequence of the first four guarantees, and must be tested not within RCU itself, but rather within the code using RCU to carry out the read-to-write upgrade.

Thus for these last two fundamental guarantees there is no code in rcutorture. So perhaps even rcutorture deserves a break from time to time! ;-)

29 Oct 2020 11:27pm GMT

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Part III

Even more reading of the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file encounters RCU's memory-barrier guarantees. These guarantees are a bit ornate, but roughly speaking guarantee that RCU read-side critical sections lapping over one end of a given grace period are fully ordered with anything past the other end of that same grace period. RCU's overall approach towards this guarantee is shown in the Linux-kernel Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst file, so one approach would be to argue that these guarantees are proven by a combination of this documentation along with periodic code inspection. Although this approach works well for some properties, the periodic code inspections require great attention to detail spanning a large quantity of intricate code. As such, these inspections are all too vulnerable to human error.

Another approach is formal verification, and in fact RCU's guarantees have been formally verified. Unfortunately, these formal-verification efforts, groundbreaking though they are, must be considered to be one-off tours de force. In contrast, RCU needs regular regression testing.

This leaves rcutorture, which has the advantage of being tireless and reasonably thorough, especially when compared to human beings. Except that rcutorture does not currently test RCU's memory-barrier guarantees.

Or at least it did not until today.

A new commit on the -rcu tree enlists the existing RCU readers. Each reader frequently increments a free-running counter, which can then be used to check memory ordering: If the counter appears to have counted backwards, something is broken. Each reader samples and records a randomly selected reader's counter, and assigns some other randomly selected reader to check for backwardsness. A flag is set at the end of each grace period, and once this flag is set, that other reader takes another sample of that same counter and compares them.

Of course, the reality is a bit more involved, and probably will become even more involved as review and testing proceeds. But in the meantime, the interested reader can find the initial state of this rcutorture enhancement here.

The test strategy for this particular fundamental property of RCU is more complex and likely less effective than the memory-ordering property described earlier, but life is like that sometimes.

29 Oct 2020 10:47pm GMT

14 Oct 2020

feedKernel Planet

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Part II

Further reading of the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file encounters RCU's publish/subscribe guarantee. This guarantee ensures that RCU readers that traverse a newly inserted element of an RCU-protected data structure never see pre-initialization garbage in that element. In CONFIG_PREEMPT_NONE=y kernels, this guarantee combined with the grace-period guarantee permits RCU readers to traverse RCU-protected data structures using exactly the same sequence of instructions that would be used if these data structures were immutable. As always, free is a very good price!

However, some care is required to make use of this publish-subscribe guarantee. When inserting a new element, updaters must take care to first initialize everything that RCU readers might access and only then use an RCU primitive to carry out the insertion. Such primitives include rcu_assign_pointer() and list_add_rcu(), but please see The RCU API, 2019 edition or the Linux-kernel source code for the full list.

For their part, readers must use an RCU primitive to carry out their traversals, for example, rcu_dereference() or list_for_each_entry_rcu(). Again, please see The RCU API, 2019 edition or the Linux-kernel source code for the full list of such primitives.

Of course, rcutorture needs to test this publish/subscribe guarantee. It does this using yet another field in the rcu_torture structure:

struct rcu_torture {
  struct rcu_head rtort_rcu;
  int rtort_pipe_count;
  struct list_head rtort_free;
  int rtort_mbtest;

This additional field is ->rtort_mbtest, which is set to zero when a given rcu_torture structure is freed for reuse (see the rcu_torture_pipe_update_one() function), and then set to 1 just before that structure is made available to readers (see the rcu_torture_writer() function). For its part, the rcu_torture_one_read() function checks to see if this field is zero, and if so flags the error by atomically incrementing the global n_rcu_torture_mberror counter. As you would expect, any run ending with a non-zero value in this counter is considered to be a failure.

Thus we have an important fundamental property of RCU that nevertheless happens to have a simple but effective test strategy. To the best of my knowledge, this was also the first aspect of Linux-kernel RCU that was subjected to an automated proof of correctness.

Sometimes you get lucky! ;-)

14 Oct 2020 11:16pm GMT

12 Oct 2020

feedKernel Planet

Linux Plumbers Conference: LPC 2020 Survey Results

We had 185 responses to the Linux Plumbers survey in 2020, which, given the total number of conference registrants of 809, has provided confidence in the feedback. Given that the event was held virtually this year, it's encouraging to see the community remaining engaged. So we are pleased to offer an especially heartfelt "thank you" to everyone who participated in this survey!

98.4% of respondents were positive or neutral about the event, with only 1.6% indicating they were dissatisfied. Given the fact we had to shift the event to be online this year, that is a very encouraging result. Co-location with the Kernel Summit continues to prove popular (67.5% considered it helpful/very helpful), and the first time introduction of the GNU Tools track was very well received with 68% of the respondents considering it helpful/very helpful as well. One thing we were a bit worried about is whether the online format would enable discussions to help resolve problems 73% found them useful, which compared to most online events was a great result.

The BOF track was very popular and we're looking to include this again in 2021. Conference participation was up from 2019 and even though we increased the capacity to 810, we sold out of regular tickets again. Given that the participants adhered to the guidelines for online we didn't bump into the capacity limits we were worried about, so are considering raising the cap next year if we need to be virtual. From the survey, the overwhelming majority of attendees prefer us to try to hold the conference in person, with a fall back to virtual. With this in mind, we're working with the Linux Foundation events team to identify options in Dublin for a hybrid event, but may fall back to be entirely online.

Based on the fact we sold out, we live-streamed and videotaped all of the sessions. All the live streams are available for playback now on our YouTube channel. There are over 120 hours of video for 2020 already and we are adding more. The committee is in the process of re-rendering them and linking them to the detailed schedule. The Microconferences are recorded as one long video block, but clicking on the video link of a particular discussion topic will take you to the time index in that file where the chosen discussion begins. The recorded BoFs will also be posted soon.

In terms of track feedback, Linux Plumbers Refereed track and Kernel Summit track were indicated as very relevant by almost all respondents who attended. The BOFs track was positively received and will continue. The hallway track continues to be regarded as very important and appreciated. Based on the feedback, if we have to be virtual again, we will look at options of making more hack rooms available, as they were well received for follow on conversations. If we are able to meet in person, we will evaluate options for making private meeting rooms available for groups who need to meet onsite.

The emails from the committee continue to be positively received as was our new website. There were some excellent suggestions in this year's write-in comments, we'll be looking into options to incorporate. In particular, because were online, it was possible for a more people to join us who would not have been able to get travel funding or visas approved.

There were lots of great suggestions to the "what one thing would you like to see changed" question, and the program committee has been studying them to see what is possible to implement this year. Thank you again to the participants for their input and help on improving the Linux Plumbers Conference. More information on the 2021 conference will be shared early in the new year.

12 Oct 2020 3:03pm GMT

09 Oct 2020

feedKernel Planet

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Part I

A quick look at the beginning of the Documentation/RCU/Design/Requirements/Requirements.rst file in a recent Linux-kernel source tree might suggest that testing RCU's fundamental requirements is Job One. And that suggestion would be quite correct. This post describes how rcutorture tests RCU's grace-period guarantee, which is usually used to make sure that data is not freed out from under an RCU reader. Later posts will describe how the other fundamental guarantees are tested.

What Exactly is RCU's Fundamental Grace-Period Guarantee?

Any RCU reader that started before the start of a given grace period is guaranteed to complete before that grace period completes. This is shown in the following diagram:

Diagram of RCU grace-period guarantee 1

Similarly, any RCU reader that completes after the end of a given grace period is guaranteed to have started after that grace period started. And this is shown in this diagram:

Diagram of RCU grace-period guarantee 2

More information is available in the aforementioned Documentation/RCU/Design/Requirements/Requirements.rst file.

Whose Fault is This rcutorture Failure, Anyway?

Suppose an rcutorture test fails, perhaps by triggering a WARN_ON() that normally indicates a problem in some other area of the kernel. But how do we know this failure is not instead RCU's fault?

One straightforward way to test RCU's grace-period guarantee would be to maintain a single RCU-protected pointer (let's call it rcu_torture_current) to a single structure, perhaps defined as follows:

struct rcu_torture {
  struct rcu_head rtort_rcu;
  atomic_t rtort_nreaders;
  int rtort_pipe_count;
} *rcu_torture_current;

Readers could then do something like this in a loop:

p = rcu_dereference(rcu_torture_current);

An updater could do something like this, also in a loop:

p = kzalloc(sizeof(*p), GFP_KERNEL);
q = xchg(&rcu_torture_current, p);
call_rcu(&q->rtort_rcu, rcu_torture_cb);

And the rcu_torture_cb() function might be defined as follows:

static void rcu_torture_cb(struct rcu_head *p)
  struct rcu_torture *rp = container_of(p, struct rcu_torture, rtort_rcu);

  WRITE_ONCE(rp->rtort_pipe_count, 1);

This approach is of course problematic, never mind that one of rcutorture's predecessors actually did something like this. For one thing, a reader might be interrupted or (in CONFIG_PREEMPT=y kernels) preempted between its rcu_dereference() and its atomic_inc(). Then a too-short RCU grace period could result in the above reader doing its atomic_inc() on some structure that had already been freed and allocated as some other data structure used by some other part of the kernel. This could in turn result in a confusing failure in that other part of the kernel that was really RCU's fault.

In addition, the read-side atomic_inc() will result in expensive cache misses that will end up synchronizing multiple tasks concurrently executing the RCU reader code shown above. This synchronization will reduce read-side concurrency, which will in turn likely reduce the probability of these readers detecting a too-short grace period.

Finally, using the passage of time for synchronization is almost always a bad idea, so burn_a_bit_more_cpu_time() really needs to go. One might suspect that burn_a_random_amount_of_cpu_time() is also a bad idea, but we will see the wisdom in it.

Making rcutorture Preferentially Break RCU

The rcutorture module reduces the probability of false-positive non-RCU failures using these straightforward techniques:

  1. Allocate the memory to be referenced by rcu_torture_current in an array, whose elements are only ever used by rcutorture.
  2. Once an element is removed from rcu_torture_current, keep it in a special rcu_torture_removed list for some time before allowing it to be reused.
  3. Keep the random time delays in the rcutorture readers.
  4. Run rcutorture on an otherwise idle system, or, more commonly these days, within an otherwise idle guest OS.
  5. Make rcutorture place a relatively heavy load on RCU.

Use of the array keeps rcutorture from use-after-free clobbering of other kernel subsystems' data structures, keeping to-be-freed elements on the rcu_torture_removed list increases the probability that rcutorture will detect a too-short grace period, the delays in the readers increases the probability that a too-short grace period will be detected, and ensuring that most of the RCU activity is done at rcutorture's behest decreases the probability that any too-short grace periods will clobber other kernel subsystems.

The rcu_torture_alloc() and rcu_torture_free() functions manage a freelist of array elements. The freelist is a simple list creatively named rcu_torture_freelist and guarded by a global rcu_torture_lock. Because allocation and freeing happen at most once per grace period, this global lock is just fine: It is nowhere near being any sort of performance or scalability bottleneck.

The rcu_torture_removed list is handled by the rcu_torture_pipe_update_one() function that is invoked by rcutorture callbacks and the rcu_torture_pipe_update() function that is invoked by rcu_torture_writer() after completing a synchronous RCU grace period. The rcu_torture_pipe_update_one() function updates only the specified array element, and the rcu_torture_pipe_update() function updates all of the array elements residing on the rcu_torture_removed list. These updates each increment the -&gtrtort_pipe_count field. When the value of this field reaches RCU_TORTURE_PIPE_LEN (by default 10), the array element is freed for reuse.

The rcu_torture_reader() function handles the random time delays and leverages the awesome power of multiple kthreads to maintain a high read-side load on RCU. The rcu_torture_writer() function runs in a single kthread in order to simplify synchronization, but it enlists the help of several other kthreads repeatedly invoking the rcu_torture_fakewriter() in order to keep the update-side load on RCU at a respectable level.

This blog post described RCU's fundamental grace-period guarantee and how rcutorture stress-tests it. It also described a few simple ways that rcutorture increases the probability that any failures to provide this guarantee are attributed to RCU and not to some hapless innocent bystander.

09 Oct 2020 8:49pm GMT

21 Sep 2020

feedKernel Planet

Kees Cook: security things in Linux v5.7

Previously: v5.6

Linux v5.7 was released at the end of May. Here's my summary of various security things that caught my attention:

arm64 kernel pointer authentication
While the ARMv8.3 CPU "Pointer Authentication" (PAC) feature landed for userspace already, Kristina Martsenko has now landed PAC support in kernel mode. The current implementation uses PACIASP which protects the saved stack pointer, similar to the existing CONFIG_STACKPROTECTOR feature, only faster. This also paves the way to sign and check pointers stored in the heap, as a way to defeat function pointer overwrites in those memory regions too. Since the behavior is different from the traditional stack protector, Amit Daniel Kachhap added an LKDTM test for PAC as well.

The kernel's Linux Security Module (LSM) API provide a way to write security modules that have traditionally implemented various Mandatory Access Control (MAC) systems like SELinux, AppArmor, etc. The LSM hooks are numerous and no one LSM uses them all, as some hooks are much more specialized (like those used by IMA, Yama, LoadPin, etc). There was not, however, any way to externally attach to these hooks (not even through a regular loadable kernel module) nor build fully dynamic security policy, until KP Singh landed the API for building LSM policy using BPF. With CONFIG_BPF_LSM=y, it is possible (for a privileged process) to write kernel LSM hooks in BPF, allowing for totally custom security policy (and reporting).

execve() deadlock refactoring
There have been a number of long-standing races in the kernel's process launching code where ptrace could deadlock. Fixing these has been attempted several times over the last many years, but Eric W. Biederman and Ernd Edlinger decided to dive in, and successfully landed the a series of refactorings, splitting up the problematic locking and refactoring their uses to remove the deadlocks. While he was at it, Eric also extended the exec_id counter to 64 bits to avoid the possibility of the counter wrapping and allowing an attacker to send arbitrary signals to processes they normally shouldn't be able to.

slub freelist obfuscation improvements
After Silvio Cesare observed some weaknesses in the implementation of CONFIG_SLAB_FREELIST_HARDENED's freelist pointer content obfuscation, I improved their bit diffusion, which makes attacks require significantly more memory content exposures to defeat the obfuscation. As part of the conversation, Vitaly Nikolenko pointed out that the freelist pointer's location made it relatively easy to target too (for either disclosures or overwrites), so I moved it away from the edge of the slab, making it harder to reach through small-sized overflows (which usually target the freelist pointer). As it turns out, there were a few assumptions in the kernel about the location of the freelist pointer, which had to also get cleaned up.

RISCV page table dumping
Following v5.6's generic page table dumping work, Zong Li landed the RISCV page dumping code. This means it's much easier to examine the kernel's page table layout when running a debug kernel (built with PTDUMP_DEBUGFS), visible in /sys/kernel/debug/kernel_page_tables.

array index bounds checking
This is a pretty large area of work that touches a lot of overlapping elements (and history) in the Linux kernel. The short version is: C is bad at noticing when it uses an array index beyond the bounds of the declared array, and we need to fix that. For example, don't do this:

int foo[5];
foo[8] = bar;

The long version gets complicated by the evolution of "flexible array" structure members, so we'll pause for a moment and skim the surface of this topic. While things like CONFIG_FORTIFY_SOURCE try to catch these kinds of cases in the memcpy() and strcpy() family of functions, it doesn't catch it in open-coded array indexing, as seen in the code above. GCC has a warning (-Warray-bounds) for these cases, but it was disabled by Linus because of all the false positives seen due to "fake" flexible array members. Before flexible arrays were standardized, GNU C supported "zero sized" array members. And before that, C code would use a 1-element array. These were all designed so that some structure could be the "header" in front of some data blob that could be addressable through the last structure member:

/* 1-element array */
struct foo {
    char contents[1];

/* GNU C extension: 0-element array */
struct foo {
    char contents[0];

/* C standard: flexible array */
struct foo {
    char contents[];

instance = kmalloc(sizeof(struct foo) + content_size);

Converting all the zero- and one-element array members to flexible arrays is one of Gustavo A. R. Silva's goals, and hundreds of these changes started landing. Once fixed, -Warray-bounds can be re-enabled. Much more detail can be found in the kernel's deprecation docs.

However, that will only catch the "visible at compile time" cases. For runtime checking, the Undefined Behavior Sanitizer has an option for adding runtime array bounds checking for catching things like this where the compiler cannot perform a static analysis of the index values:

int foo[5];
for (i = 0; i < some_argument; i++) {
    foo[i] = bar;

It was, however, not separate (via kernel Kconfig) until Elena Petrova and I split it out into CONFIG_UBSAN_BOUNDS, which is fast enough for production kernel use. With this enabled, it's now possible to instrument the kernel to catch these conditions, which seem to come up with some regularity in Wi-Fi and Bluetooth drivers for some reason. Since UBSAN (and the other Sanitizers) only WARN() by default, system owners need to set panic_on_warn=1 too if they want to defend against attacks targeting these kinds of flaws. Because of this, and to avoid bloating the kernel image with all the warning messages, I introduced CONFIG_UBSAN_TRAP which effectively turns these conditions into a BUG() without needing additional sysctl settings.

Fixing "additive" snprintf() usage
A common idiom in C for building up strings is to use sprintf()'s return value to increment a pointer into a string, and build a string with more sprintf() calls:

/* safe if strlen(foo) + 1 < sizeof(string) */
wrote  = sprintf(string, "Foo: %s\n", foo);
/* overflows if strlen(foo) + strlen(bar) > sizeof(string) */
wrote += sprintf(string + wrote, "Bar: %s\n", bar);
/* writing way beyond the end of "string" now ... */
wrote += sprintf(string + wrote, "Baz: %s\n", baz);

The risk is that if these calls eventually walk off the end of the string buffer, it will start writing into other memory and create some bad situations. Switching these to snprintf() does not, however, make anything safer, since snprintf() returns how much it would have written:

/* safe, assuming available <= sizeof(string), and for this example
 * assume strlen(foo) < sizeof(string) */
wrote  = snprintf(string, available, "Foo: %s\n", foo);
/* if (strlen(bar) > available - wrote), this is still safe since the
 * write into "string" will be truncated, but now "wrote" has been
 * incremented by how much snprintf() *would* have written, so "wrote"
 * is now larger than "available". */
wrote += snprintf(string + wrote, available - wrote, "Bar: %s\n", bar);
/* string + wrote is beyond the end of string, and availabe - wrote wraps
 * around to a giant positive value, making the write effectively 
 * unbounded. */
wrote += snprintf(string + wrote, available - wrote, "Baz: %s\n", baz);

So while the first overflowing call would be safe, the next one would be targeting beyond the end of the array and the size calculation will have wrapped around to a giant limit. Replacing this idiom with scnprintf() solves the issue because it only reports what was actually written. To this end, Takashi Iwai has been landing a bunch scnprintf() fixes.

That's it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.8.

© 2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

21 Sep 2020 11:32pm GMT

17 Sep 2020

feedKernel Planet

James Bottomley: Creating a Home IPv6 Network

One of the recent experiences of Linux Plumbers Conference convinced me that if you want to be part of a true open source WebRTC based peer to peer audio/video interaction, you need an internet address that's not behind a NAT. In reality, the protocol still works as long as you can contact a stun server to tell you what your external address is and possibly a turn server to proxy the packets if both endpoints are NATed but all this seeking external servers takes time as those of you who complained about the echo test found. The solution to all this is to connect over IPv6 which has an address space large enough to support every device on the planet having its own address. All modern Linux distributions support IPv6 out of the box so the chances are you've actually accidentally used it without ever noticing, which is one of the beauties of IPv6 autoconfiguration (it's supposed to just work).

However, I recently moved, and so lost my fibre internet connection to cable but cable that did come with an IPv6 address, so this is my story of getting it all to work. If you don't really care about the protocol basics, you can skip down to the how. This guide is also focussed on a "dual stack" configuration (one that has both IPv6 and IPv4 addresses). Pure IPv6 configurations are possible, but because some parts of the internet are still IPv4 only, they're not complete unless you set up an IPv4 encapsulating bridge.

The Basics of IPv6

IPv6 has been a mature protocol for a long time now, so I erroneously assumed there'd be a load of good HOWTOs about it. However, after reading 20 different descriptions of how the IPv6 128 bit address space works and not much else, I gave up in despair and read the RFCs instead. I'll assume you've read at least one of these HOWTOS, so I don't have to go into IPv6 address prefixes, suffixes, interface IDs or subnets so I'll begin where most of the HOWTOs end.

How does IPv6 Just Work?

In IPv4 there's a protocol called dynamic host configuration protocol (DHCP) so as long as you can find a DHCP server you can get all the information you need to connect (local address, router, DNS server, time server, etc). However, this service has to be set up by someone and IPv6 is designed to configure a network without it.

The first assumption IPv6 StateLess Address AutoConfiguration (SLAAC) makes is that it's on a /64 subnet (So every subnet in IPv6 contains 1010 times as many addresses as the entire IPv4 internet). This means that, since most real subnets contain <100 systems, they can simply choose a random address and be very unlikely to clash with the existing systems. In fact, there are three current ways of choosing an address in the /64:

  1. EUI-64 (RFC 4291) based on the MAC address which is basically the MAC with one bit flipped and ff:fe placed in the middle.
  2. Stable Private (RFC 7217) which generate from a hash based on a static key, interface, prefix and a counter (the counter is incremented if there is a clash). These are preferred to the EUI-64 ones which give away any configuration associated with the MAC address (such as what type of network card you have)
  3. Privacy Extension Addresses (RFC 4941) which are very similar to stable private addresses except they change over time using the IPv6 address deprecation mechanism and are for client systems who want to preserve anonymity.

The next problem in Linux is who configures the interface? The Kernel IPv6 stack is actually designed to do it, and will unless told not to, but most of the modern network controllers (like NetworkManager) are control freaks and turn off the kernel's auto configuration so they can do it themselves. They also default to stable private addressing using a static secret maintained in the filesystem (/var/lib/NetworkManager/secret_key).

The next thing to understand about IPv6 addresses is that they are divided into scopes, the most important being link local (unrouteable) addresses which conventionally always have the prefix fe80::/64. The link local address is configured first using one of the above methods and then used to probe the network.

Multicast and Neighbour Discovery

Unlike IPv4, IPv6 has no broadcast capability so all discovery is done by multicast. Nodes coming up on the network subscribe to particular multicast addresses, via special packets intercepted by the switch, and won't receive any multicast to which they're not subscribed. Conventionally, all link local multicast addresses have the prefix ff02::/64 (for other types of multicast address see RFC 4291). All nodes subscribe to the "all nodes" multicast address ff02::1 and also must subscribe to their own solicited node multicast address at ff02::1:ffXX:XXXX where the last 24 bits correspond to the lowest 24 bits of the node's IPv6 address. This latter is to avoid the disruption that used to occur in IPv4 from ARP broadcasts because now you can target a specific subset of nodes for address resolution.

The IPV6 address resolution protocol is called Neighbour Solicitation (NS), described in RFC 4861 and it's use with SLAAC described in RFC 4862, and is done by sending a multicast to the neighbor solicitation address of the node you want to discover containing the full IPv6 address you want to know, a node with the matching address replies with its link layer (MAC) address in a Neighbour Advertisement (NA) packet.

Once a node has chosen its link local address, it first sends out a NS packet to its chosen address to see if anyone replies and if no-one does it assumes it is OK to keep it otherwise it follows the collision avoidance protocol associated with its particular form of address. Once it has found a unique address, the node configures this link local address and looks for a router. Note that if an IPv6 network isn't present, discovery stops here, which is why most network interfaces always show a link local IPv6 address.

Router Discovery

Once the node has its own unique link local address, it uses it to send out Router Solicitation (RS) packets to the "all routers" multicast address ff02::2. Every router on the network responds with a Router Advertisement (RA) packet which describes (among other things) the the router lifetime, the network MTU, a set of one or more prefixes the router is responsible for, the router's link address and a set of option flags including the M (Managed) and O (Other Configuration) flag and possibly a set of DNS servers.

Each advertised prefix contains the prefix and prefix length, a set of flags including the A (autonomous configuration) and L (link local) and a set of lifetimes. The Link Local prefixes tell you what global prefixes the local network users (there may be more than one) and whether you are allowed to do SLAAC on the global prefix (if the A flag is clear, you must ask the router for an address using DHCPv6). If the router has a non zero lifetime, you may assume it is a default router for the subnet.

Now that the node has discovered one or more routers it may configure its own global address (note that every IPv6 routeable node has at least two addresses: a link local and a global). How it does this depends on the router and prefix flags

Global Address Configuration

The first thing a node needs to know is whether to use SLAAC for the global address or DHCPv6. This is entirely determined by the A flag of any link local prefix in the RA packet. If A is set, then the node may use SLAAC and if A is clear then the node must use DHCPv6 to obtain an address. If A is set and also the M (Managed) flag then the node may use either SLAAC or DHCPv6 (or both) to obtain an address and if the M flag is clear, but the O (Other Config) flag is present then the node must use SLAAC but may use DHCPv6 to obtain other information about the network (usually DNS).

Once the node has a global address in now needs a default route. It forms the default route list from the RA packets that have a non-zero router Lifetime. All of these are configured as default routes to their link local address with the RA specified hop count. Finally, the node may add specific prefix routes from RA packets with zero router LifeTimes but non link local prefixes.

DHCPv6 is a fairly complex configuration protocol (see RFC 8415) but it cannot specify either prefix length (meaning all obtained addresses are configured as /128) or routes (these must be obtained from RA packets). This leads to a subtlety of outbound address selection in that the most specific is always preferred, so if you configure both by SLAAC and DHCPv6, the SLAAC address will be added as /64 and the DHCPv6 address as /128 meaning your outbound IP address will always be the DHCPv6 one (although if an external entity knows your SLAAC address, they will still be able to reach you on it).

The How: Configuring your own Home Router

One of the things you'd think from the above is that IPv6 always auto configures and, while it is true that if you simply plug your laptop into the ethernet port of a cable modem it will just automatically configure, most people have a more complex home setup involving a router, which needs some special coaxing before it will work. That means you need to obtain additional features from your ISP using special DHCPv6 requests.

This section is written from my own point of view: I have a rather complex IPv4 network which has a completely open but bandwidth limited (to untrusted clients) wifi network, and several protected internal networks (one for my lab, one for my phones and one for the household video cameras), so I need at least 4 subnets to give every device in my home an IPv6 address. I also use OpenWRT as my router distribution, so all the IPv6 configuration information is highly specific to it (although it should be noted that things like NetworkManager can also do all of this if you're prepared to dig in the documentation).

Prefix Delegation

Since DHCPv6 only hands out a /128 address, this isn't sufficient because it's the IP address of the router itself. In order to become a router, you must request delegation of part of the IPv6 address space via the Identity Association for Prefix Delection (IA_PD) option of DHCPv6. Once this is done the router IP address will be assumed by the ISP to be the route for all of the delegated prefixes. The subtlety here is that if you want more than one subnet, you have to ask for it specifically (the client must specify the exact prefix length it's looking for) and since it's a prefix length, and your default subnet should be /64, if you request a prefix length of 64 you only have one subnet. If you request 63 you have 2 and so on. The problem is how do you know how many subnets the ISP is willing to give you? Unfortunately there's no way of finding this (I had to do an internet search to discover my ISP, Comcast, was willing to delegate a prefix length of 60, meaning 16 subnets). If searching doesn't tell you how much your ISP is willing to delegate, you could try starting at 48 and working your way to 64 in increments of 1 to see what the largest delegation you can get away with (There have been reports of ISPs locking you at your first delegated prefix length, so don't start at 64). The final subtlety is that the prefix you're delegated may not be the same prefix as the address your router obtained (my current comcast configuration has my router at 2001:558:600a:… but my delegated prefix is 2601:600:8280:66d0:/60). Note you can run odhcp6c manually with the -P option if you have to probe your ISP to find out what size of prefix you can get.

Configuring the Router for Prefix Delegation

In OpenWRT terms, the router WAN DHCP(v6) configuration is controlled by /etc/default/network. You'll already have a WAN interface (likely called 'wan') for DHCPv4, so you simply add an additional 'wan6' interface to get an additional IPv6 and become dual stack. In my configuration this looks like

config interface 'wan6'
        option ifname '@wan'
        option proto 'dhcpv6'
        option reqprefix 60

The slight oddity is the ifname: @wan simply tells the config to use the same ifname as the 'wan' interface. Naming it this way is essential if your wan is a bridge, but it's good practice anyway. The other option 'reqprefix' tells DHCPv6 to request a /60 prefix delegation.

Handing Out Delegated Prefixes

This turns out to be remarkably simple. Firstly you have to assign a delegated prefix to each of your other interfaces on the router, but you can do this without adding a new OpenWRT interface for each of them. My internal IPv4 network has all static addresses, so you add three directives to each of the interfaces:

config interface 'lan'
        ... interface designation (bridge for me)
        option proto 'static'
        ... ipv4 addresses
        option ip6assign '64'
        option ip6hint '1'
        option ip6ifaceid '::ff'

ip6assign 'N' means you are a /N network (so this is always /64 for me) and ip6hint 'N' means use N as your subnet id and ip6ifaceid 'S' means use S as the IPv6 suffix (This defaults to ::1 so if you're OK with that, omit this directive). So given I have a 2601:600:8280:66d0::/60 prefix, the global address of this interface will be 2601:600:8280:66d1::ff. Now the acid test, if you got this right, this global address should be pingable from anywhere on the IPv6 internet (if it isn't, it's likely a firewall issue, see below).

Advertising as a Router

Simply getting delegated a delegated prefix on a local router interface is insufficient . Now you need to get your router to respond to Router Solicitations on ff02::2 and optionally do DHCPv6. Unfortunately, OpenWRT has two mechanisms for doing this, usually both installed: odhcpd and dnsmasq. What I found was that none of my directives in /etc/config/dhcp would take effect until I disabled odhcpd completely

/etc/init.d/odhcpd stop
/etc/init.d/odhcpd disable

and since I use dnsmasq extensively elsewhere (split DNS for internal/external networks), that suited me fine. I'll describe firstly what options you need in dnsmasq and secondly how you can achieve this using entries in the OpenWRT /etc/config/dhcp file (I find this useful because it's always wise to check what OpenWRT has put in the /var/etc/dnsmasq.conf file).

The first dnsmasq option you need is 'enable-ra' which is a global parameter instructing dnsmasq to handle router advertisements. The next parameter you need is the per-interface 'ra-param' which specifies the global router advertisement parameters and must appear once for every interface you want to advertise on. Finally the 'dhcp-range' option allows more detailed configuration of the type of RA flags and optional DHCPv6.

SLAAC or DHCPv6 (or both)

In many ways this is a matter of personal choice. If you allow SLAAC, hosts which want to use privacy extension addresses (like Android phones) can do so, which is a good thing. If you also allow DHCPv6 address selection you will have a list of addresses assigned to hosts and dnsmasq will do DNS resolution for them (although it can do DNS for SLAAC addresses provided it gets told about them). A special tag 'constructor' exists for the 'dhcp-range' option which tells it to construct the supplied address (for either RA or DHCPv6) from the IPv6 global prefix of the specified interface, which is how you pass out our delegated prefix addresses. The modes for 'dhcp-range' are 'ra-only' to disallow DHCPv6 entirely, 'slaac' to allow DHCPv6 address selection and 'ra-stateless' to disallow DHCPv6 address selection but allow other DHCPv6 configuration information.

Based on trial and error (and finally examining the scripting in /etc/init.d/dnsmasq) the OpenWRT options required to achieve the above dnsmasq options are

config dhcp lan
        option interface lan
        option start 100
        option limit 150
        option leasetime 1h
        option dhcpv6 'server'
        option ra_management '1'
        option ra 'server'

with 'ra_management' as the key option with '0' meaning SLAAC with DHCPv6 options, '1' meaning SLAAC with full DHCPv6, '2' meaning DHCPv6 only and '3' meaning SLAAC only. Another OpenWRT oddity is that there doesn't seem to be a way of setting the lease range: it always defaults to either static only or ::1000 to ::ffff.

Firewall Configuration

One of the things that trips people up is the fact that Linux has two completely separate firewalls: one for IPv4 and one for IPv6. If you've ever written any custom rules for them, the chances are you did it in the OpenWRT /etc/firewall.user file and you used the iptables command, which means you only added the rules to the IPv4 firewall. To add the same rule for IPv6 you need to duplicate it using the ip6tables command. Another significant problem, if you're using a connection tracking for port knock detection like I am, is that Linux connection tracking has difficulty with IPv6 multicast, so packets that go out to a multicast but come back as unicast (as most of the discovery protocols do) get the wrong conntrack state. To fix this, I eventually had to have an INPUT rule just accepting all ICMPv6 and DHCPv6 (udp ports 546 [client] and 547 [server]). The other firewall considerations are that now everyone has their own IP address, there's no need to NAT (OpenWRT can be persuaded to take care of this automatically, but if you're duplicating the IPv4 rules manually, don't duplicate the NAT rules). The final one is likely more applicable to me: my wifi interface is designed to be an extension of the local internet and all machines connecting to it are expected to be able to protect themselves since they'll migrate to such hostile environments as airport wifi, thus I do complete exposure of wifi connected devices to the general internet for all ports, including port probes. For my internal devices, I have a RELATED,ESTABLISHED rule to make sure they're not probed since they're not designed to migrate off the internal network.

Now the problems with OpenWRT: since you want NAT on IPv4 but not on IPv6 you have to have two separate wan zones for them: if you try to combine them (as I first did), then OpenWRT will add an IPv6 -ctstate INVALID rule which will prevent Neighbour Discovery from working because of the conntrack issues with IPv6 multicast, so my wan zones are (well, this is a lie because my firewall is now hand crafted, but this is what I checked worked before I put the hand crafted firewall in place):

config zone
        option name 'wan'
        option network 'wan'
        option masq '1'

config zone
        option name 'wan6'
        option network 'wan6'

And the routing rules for the lan zone (fully accessible) are

config forwarding
        option src 'lan'
        option dest 'wan'

config forwarding
        option src 'lan'
        option dest 'wan6'

config forwarding
        option src 'wan6'
        option dest 'lan'

Putting it Together: Getting the Clients IPv6 Connected

Now that you have your router configured, everything should just work. If it did, your laptop wifi interface should now have a global IPv6 address

ip -6 address show dev wlan0

If that comes back empty, you need to enable IPv6 on your distribution. If it has only a link local (fe80:: prefix) address, IPv6 is enabled but your router isn't advertising (suspect firewall issues with discovery packets or dnsmasq misconfiguration). If you see a global address, you're done. Now you should be able to go to https://testv6.com and secure a 10/10 score.

The final piece of the puzzle is preferring your new IPv6 connection when DNS offers a choice of IPv4 or IPv6 addresses. All modern Linux clients should prefer IPv6 when available if connected to a dual stack network, so try … if you ping, say, www.google.com and see an IPv6 address you're done. If not, you need to get into the murky world of IPv6 address labelling (RFC 6724) and gai.conf.


Adding IPv6 to and existing IPv4 setup is currently not a simple plug in and go operation. However, provided you understand a handful of differences between the two protocols, it's not an insurmountable problem either. I have also glossed over many of the problems you might encounter with your ISP. Some people have reported that their ISPs only hand out one IPv6 address with no prefix delegation, in which case I think finding a new ISP would be wisest. Others report that the ISP will only delegate one /64 prefix so your choice here is either to only run one subnet (likely sufficient for a lot of home networks), or subnet at greater than /64 and forbid SLAAC, which is definitely not a recommended configuration. However, provided your ISP is reasonable, this blog post should at least help get you started.

17 Sep 2020 10:23pm GMT

07 Sep 2020

feedKernel Planet

Paul E. Mc Kenney: The Old Man and His Smartphone, 2020 “See You in September” Episode

The continued COVID-19 situation continues to render my smartphone's location services less than useful, though a number of applications will still beg me to enable it, preferring to know my present location rather than consider my past habits. One in particular does have a "Don't ask me again" link, but it asks each time anyway. Given that I have only ever used one of that business's locations, you would think that it would not be all that hard to figure out which location I was going to be using next. But perhaps I am the only one who habitually disables location services.

Using the smartphone for breakfast-time Internet browsing has avoided almost all flat-battery incidents. One recent exception occurred while preparing for a hike. But I still have my old digital camera, so I plugged the smartphone into its charger and took my digital camera instead. I have previously commented on the excellent quality of my smartphone's cameras, but there is nothing quite like going back to the old digital camera (never mind my long-departed 35mm SLR) to drive that lesson firmly home.

I was recently asked to text a photo, and saw no obvious way to do this. There was some urgency, so I asked for an email address and emailed the photo instead. This did get the job done, but let's just say that it appears that asking for an email address is no longer a sign of youth, vigor, or with-it-ness. Thus chastened, I experimented in a calmer time, learning that the trick is to touch the greater-than icon to the left of the text-message-entry bar, which produces an option to select from your gallery and also to include a newly taken picture.

The appearance of Comet Neowise showcased my smartphone's ability to orient and to display the relevant star charts. Nevertheless, my wife expressed confidence in this approach only after seeing the large number of cars parked in the same area that my smartphone and I had selected. I hadn't intended to take a photo of the comet because the professionals a much better job, especially those who are willing to travel far away from city lights and low altitudes. But here was my smartphone and there was the comet, so why not? The resulting photo was quite unsatisfactory, with so much pixelated noise that the comet was just barely discernible.

It was some days later that I found the smartphone's night mode. This is quite impressive. In this mode, the smartphone can form low-light images almost as well as my eyes can, which is saying something. It is also extremely good with point sources of light.

One recent trend in clothing is pockets for smartphones. This trend prompted my stepfather to suggest that the smartphone is the pocket watch of the 21st century. This might well be, but I still wear a wristwatch.

My refusal to use my smartphone's location services does not mean that location services cannot get me in trouble. Far from it! One memorable incident took place on BPA Road in Forest Park. A group of hikers asked me to verify their smartphone's chosen route, which would have taken them past the end of Firelane 13 and eventually down a small cliff. I advised them to choose a different route.

But I had seen the little line that their smartphone had drawn, and a week or so later found myself unable to resist checking it out. Sure enough, when I peered through the shrubbery marking the end of Firelane 13, I saw an unassuming but very distinct trail. Of course I followed it. Isn't that what trails are for? Besides, maybe someone had found a way around the cliff I knew to be at the bottom of that route.

To make a long story short, no one had found a way around that cliff. Instead, the trail went straight down it. For all but about eight feet of the trail, it was possible to work my way down via convenient handholds in the form of ferns, bushes, and trees. My plan for that eight feet was to let gravity do the work, and to regain control through use of a sapling at the bottom of that stretch of the so-called trail. Fortunately for me, that sapling was looking out for this old man, but unfortunately this looking out took the form of ensuring that I had a subcutaneous hold on its bark. Thankfully, the remainder of the traverse down the cliff was reasonably uneventful.

Important safety tip: If you absolutely must use that trail, wear a pair of leather work gloves!

07 Sep 2020 4:02am GMT

05 Sep 2020

feedKernel Planet

Paul E. Mc Kenney: Stupid RCU Tricks: Enlisting the Aid of a Debugger

Using Debuggers With rcutorture

So rcutorture found a bug, you have figured out how to reproduce it, git bisect was unhelpful (perhaps because the bug has been around forever), and the bug happens to be one of those rare RCU bugs for which a debugger might be helpful. What can you do?

What I have traditionally done is to get partway through figuring out how to make gdb work with rcutorture, then suddenly realize what the bug's root cause must be. At this point, I of course abandon gdb in favor of fixing the bug. As a result, although I have tried to apply gdb to the Linux kernel many times over the past 20 years, I never have actually succeeded in doing so. Now, this is not to say that gdb is useless to Linux-kernel hackers. Far from it! For one thing, the act of trying to use gdb has inspired me to perceive the root cause of a great many bugs, which means that it has served as a great productivity aid. For another thing, I frequently extract Linux-kernel code into a usermode scaffolding and use gdb in that context. And finally, there really are a number of Linux-kernel hackers who make regular use of gdb.

One of these hackers is Omar Sandoval, who happened to mention that he had used gdb to track down a Linux-kernel bug. And without first extracting the code to userspace. I figured that it was time for this old dog to learn a new trick, so I asked Omar how he made this happen.

Omar pointed out that because rcutorture runs in guest OSes, gdb can take advantage of the debugging support provided by qemu. To make this work, you build a kernel with CONFIG_DEBUG_INFO=y (which supplies gdb with additional symbols), provide the nokaslr kernel boot parameter (which prevents kernel address-space randomization from invalidating these symbols), and supply qemu with the -s -S command-line arguments (which causes it to wait for gdb to connect instead of immediately booting the kernel). You then specify the vmlinux file's pathname as the sole command-line argument to gdb. Once you see the (gdb) prompt, the target remote :1234 command will connect to qemu and then the continue command will boot the kernel.

I tried this, and it worked like a charm.

Alternatively, you can now use the shiny new rcutorture --gdb command-line argument in the -rcu tree, which will automatically set up the kernel and qemu, and will print out the required gdb commands, including the path to the newly built vmlinux file.

And yes, I do owe Omar a --drgn command-line argument, which I will supply once he lets me know how to connect drgn to qemu. :-)

In the meantime, the following sections cover a couple of uses I have made of --gdb, mostly to get practice with this approach to Linux-kernel debugging.

Case study 1: locktorture

For example, let's use gdb to investigate a long-standing locktorture hang when running scenario LOCK05:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --torture lock \
    --duration 3 --configs LOCK05 --gdb

This will print out the following once the kernel is built and qemu has started:

Waiting for you to attach a debug session, for example:
    gdb gdb /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2020.08.27-14.51.45/LOCK05/vmlinux
After symbols load and the "(gdb)" prompt appears:
    target remote :1234

Once you have started gdb and entered the two suggested commands, the kernel will start. You can track its console output by locating its console.log file as described in an earlier post. Or you can use the ps command to dump the qemu command line, looking for the -serial file: command, which is following by the pathname of the file receiving the console output.

Once the kernel is sufficiently hung, that is, more than 15 seconds elapses after the last statistics output line (Writes: Total: 27668769 Max/Min: 27403330/34661 Fail: 0), you can hit control-C at gdb. The usual info threads command will show the CPUs' states, here with the 64-bit hexadecimal addresses abbreviated:

(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 1 (CPU#0 [running]) stutter_wait (title=0xf... "lock_torture_writer")
    at kernel/torture.c:615
  2    Thread 2 (CPU#1 [running]) 0xf... in stutter_wait (
    title=0xf... "lock_torture_writer") at kernel/torture.c:615
  3    Thread 3 (CPU#2 [halted ]) default_idle () at arch/x86/kernel/process.c:689
  4    Thread 4 (CPU#3 [halted ]) default_idle () at arch/x86/kernel/process.c:689

It is odd that CPUs 0 and 1 are in stutter_wait(), spinning on the global variable stutter_pause_test. Even more odd is that the value of this variable is not zero, as it should be at the end of the test, but rather the value two. After all, all paths out of torture_stutter() should zero this variable.

But maybe torture_stutter() is still stuck in the loop prior to the zeroing of stutter_pause_test. A quick look at torture_stutter_init shows us that the task_struct pointer to the task running torture_stutter lives in stutter_task, which is non-NULL, meaning that this task still lives. One might hope to use sched_show_task(), but this sadly fails with Could not fetch register "fs_base"; remote failure reply 'E14'.

The value of stutter_task.state is zero, which indicates that this task is running. But on what CPU? CPUs 0 and 1 are both spinning in stutter_wait, and the other two CPUs are in the idle loop. So let's look at stutter_task.on_cpu, which is zero, as in not on a CPU. In addition, stutter_task.cpu has the value one, and CPU 1 is definitely running some other task.

It would be good to just be able to print the stack of the blocked task, but it is also worth just rerunning this test, but this time with the locktorture.stutter module parameter set to zero. This test completed successfully, in particular, with no hangs. Given that no other locktorture or rcutorture scenario suffers from similar hangs, perhaps the problem is in rt_mutex_lock() itself. To check this, let's restart the test, but with the default value of the locktorture.stutter module parameter. After letting it hang and interrupting it with control-C, even though it still feels strange to control-C a kernel:

(gdb)  print torture_rtmutex
$1 = {wait_lock = {raw_lock = {{val = {counter = 0}, {locked = 0 '\000', pending = 0 '\000'}, {
          locked_pending = 0, tail = 0}}}}, waiters = {rb_root = {rb_node = 0xffffc9000025be50}, 
    rb_leftmost = 0xffffc90000263e50}, owner = 0x1 <fixed_percpu_data+1>}

The owner = 0x1 looks quite strange for a task_struct pointer, but the block comment preceding rt_mutex_set_owner() says that this value is legitimate, and represents one of two transitional states. So maybe it is time for CONFIG_DEBUG_RT_MUTEXES=y, but enabling this Kconfig option produces little additional enlightenment.

However, the torture_rtmutex.waiters field indicates that there really is something waiting on the lock. Of course, it might be that we just happened to catch the lock at this point in time. To check on this, let's add a variable to capture the time of the last lock release. I empirically determined that it is necessary to use WRITE_ONCE() to update this variable in order to prevent the compiler from optimizing it out of existence. Learn from my mistakes!

With the addition of WRITE_ONCE(), the next run showed that the last lock operation was more than three minutes in the past and that the transitional lock state still persisted, which provides strong evidence that this is the result of a race condition in the locking primitive itself. Except that a quick scan of the code didn't immediately identify a race condition. Furthermore, the failure happens even with CONFIG_DEBUG_RT_MUTEXES=y, which disables the lockless fastpaths (or the obvious lockless fastpaths, anyway).

Perhaps this is instead a lost wakeup? This would be fortuitous given that there are rare lost-IPI issues, and having this reproduce so easily on my laptop would be extremely convenient. And adding a bit of debug code to mark_wakeup_next_waiter() and lock_torture_writer() show that there is a task that was awakened, but that never exited from rt_mutex_lock(). And this task is runnable, that is, its ->state value is zero. But it is clearly not running very far! And further instrumentation demonstrates that control is not reaching the __smp_call_single_queue() call from __ttwu_queue_wakelist(). The chase is on!

Except that the problem ended up being in stutter_wait(). As the name suggests, this function controls stuttering, that is, periodically switching between full load and zero load. Such stuttering can expose bugs that a pure full-load stress test would miss.

The stutter_wait() uses adaptive waiting, so that schedule_timeout_interruptible() is used early in each no-load interval, but a tight loop containing cond_resched() is used near the end of the interval. The point of this is to more tightly synchronize the transition from no-load to full load. But the LOCK05 scenario's kernel is built with CONFIG_PREEMPT=y, which causes cond_resched() to be a no-op. In addition, the kthreads doing the write locking lower their priority using set_user_nice(current, MAX_NICE), which appears to be preventing preemption. (We can argue that even MAX_NICE should not indefinitely prevent preemption, but the multi-minute waits that have been observed are for all intents and purposes indefinite.)

The fix (or workaround, as the case might be) is for stutter_wait() to block periodically, thus allowing other tasks to run.

Case study 2: RCU Tasks Trace

I designed RCU Tasks Trace for the same grace-period latency that I had designed RCU Tasks for, namely roughly one second. Unfortunately, this proved to be about 40x too slow, so adjustments were called for.

After those reporting the issue kindly verified for me that this was not a case of too-long readers, I used --gdb to check statistics and state. I used rcuscale, which is a member of the rcutorture family designed to measure performance and scalability of the various RCU flavors' grace periods:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus \
    --configs TRACE01 --bootargs "rcuscale.nreaders=0 rcuscale.nwriters=10" \
    --trust-make --gdb

Once the (gdb) prompt appears, we connect to qemu, set a break point, and then continue execution:

(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in exception_stacks ()
(gdb) b rcu_scale_cleanup
Breakpoint 1 at 0xffffffff810d27a0: file kernel/rcu/rcuscale.c, line 505.
(gdb) cont
Remote connection closed

Unfortunately, as shown above, this gets us Remote connection closed instead of a breakpoint. Apparently, the Linux kernel does not take kindly to debug exception instructions being inserted into its code. Fortunately, gdb also supplies a hardware breakpoint command:

(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in exception_stacks ()
(gdb) hbreak rcu_scale_cleanup
Hardware assisted breakpoint 1 at 0xffffffff810d27a0: file kernel/rcu/rcuscale.c, line 505.
(gdb) cont
[Switching to Thread 12]

Thread 12 hit Breakpoint 1, rcu_scale_cleanup () at kernel/rcu/rcuscale.c:505
505     {

This works much better, and the various data structures may now be inspected to check the validity of various optimization approaches. Of course, as the optimization effort continued, hand-typing gdb commands became onerous, and was therefore replaced with crude but automatic accumulation and display of relevant statistics.

Of course, Murphy being who he is, the eventual grace-period speedup also caused a few heretofore latent race conditions to be triggered by a few tens of hours of rctorture. These race conditions resulted in rcu_torture_writer() stalls, along with the occasional full-fledged RCU-Tasks-Trace CPU stall warning.

Now, rcutorture does dump out RCU grace-period kthread state when these events occur, but in the case of the rcu_torture_writer() stalls, this state is for vanilla RCU rather than the flavor of RCU under test. Which is an rcutorture bug that will be fixed. But in the meantime, gdb provides a quick workaround by setting a hardware breakpoint on the ftrace_dump() function, which is called when either of these sorts of stalls occur. When the breakpoint triggers, it is easy to manually dump the data pertaining to the grace-period kthread of your choice.

For those who are curious, the race turned out to be an IPI arriving between a pair of stores in rcu_read_unlock_trace() that could leave the corresponding task forever blocking the current RCU Tasks Trace grace period. The solution, as with vanilla RCU in the v3.0 timeframe, is to set the read-side nesting value to a negative number while clearing the .need_qs field indicating that a quiescent state is required. The buggy code is as follows:

if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
    // BUG: IPI here sets .need_qs after check!!!
    WRITE_ONCE(t->trc_reader_nesting, nesting);
    return;  // We assume shallow reader nesting.

Again, the fix is to set the nesting count to a large negative number, which allows the IPI handler to detect this race and refrain from updating the .need_qs field when the ->trc_reader_nesting field is negative, thus avoiding the grace-period hang:

WRITE_ONCE(t->trc_reader_nesting, INT_MIN); // FIX
if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
    WRITE_ONCE(t->trc_reader_nesting, nesting);
    return;  // We assume shallow reader nesting.

This experience of course suggests testing with grace period latencies tuned much more aggressively than they are in production, with an eye to finding additional low-probability race conditions.

Case study 3: x86 IPIs

Tracing the x86 IPI code path can be challenging because function pointers are heavily used. Unfortunately, some of these function pointers are initialized at runtime, so simply running gdb on the vmlinux binary does not suffice. However, we can again set a breakpoint somewhere in the run and check these pointers after initialization is complete:

tools/testing/selftests/rcutorture/bin/kvm.sh --torture scf --allcpus --duration 5 --gdb --configs "NOPREEMPT" --bootargs "scftorture.stat_interval=15 scftorture.verbose=1"

We can then set a hardware-assisted breakpoint as shown above at any convenient runtime function.

Once this breakpoint is encountered:

(gdb) print smp_ops
$2 = {smp_prepare_boot_cpu = 0xffffffff82a13833 , 
  smp_prepare_cpus = 0xffffffff82a135f9 , 
  smp_cpus_done = 0xffffffff82a13897 , 
  stop_other_cpus = 0xffffffff81042c40 , 
  crash_stop_other_cpus = 0xffffffff8104d360 , 
  smp_send_reschedule = 0xffffffff81047220 , 
  cpu_up = 0xffffffff81044140 , 
  cpu_disable = 0xffffffff81044aa0 , 
  cpu_die = 0xffffffff81044b20 , 
  play_dead = 0xffffffff81044b80 , 
  send_call_func_ipi = 0xffffffff81047280 , 
  send_call_func_single_ipi = 0xffffffff81047260 }

This shows that smp_ops.send_call_func_single_ipi is native_send_call_func_single_ipi(), which helps to demystify arch_send_call_function_single_ipi(). Except that this native_send_call_func_single_ipi() function is just a wrapper around apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR). So:

(gdb) print *apic
$4 = {eoi_write = 0xffffffff8104b8c0 , 
  native_eoi_write = 0x0 , write = 0xffffffff8104b8c0 , 
  read = 0xffffffff8104b8d0 , 
  wait_icr_idle = 0xffffffff81046440 , 
  safe_wait_icr_idle = 0xffffffff81046460 , 
  send_IPI = 0xffffffff810473c0 , 
  send_IPI_mask = 0xffffffff810473f0 , 
  send_IPI_mask_allbutself = 0xffffffff81047450 , 
  send_IPI_allbutself = 0xffffffff81047500 , 
  send_IPI_all = 0xffffffff81047510 , 
  send_IPI_self = 0xffffffff81047520 , dest_logical = 0, disable_esr = 0, 
  irq_delivery_mode = 0, irq_dest_mode = 0, 
  calc_dest_apicid = 0xffffffff81046f90 , 
  icr_read = 0xffffffff810464f0 , 
  icr_write = 0xffffffff810464b0 , 
  probe = 0xffffffff8104bb00 , 
  acpi_madt_oem_check = 0xffffffff8104ba80 , 
  apic_id_valid = 0xffffffff81047010 , 
  apic_id_registered = 0xffffffff8104b9c0 , 
  check_apicid_used = 0x0 , 
  init_apic_ldr = 0xffffffff8104b9a0 , 
  ioapic_phys_id_map = 0x0 , setup_apic_routing = 0x0 ,
  cpu_present_to_apicid = 0xffffffff81046f50 ,
  apicid_to_cpu_present = 0x0 , 
  check_phys_apicid_present = 0xffffffff81046ff0 , 
  phys_pkg_id = 0xffffffff8104b980 , 
  get_apic_id = 0xffffffff8104b960 , 
  set_apic_id = 0xffffffff8104b970 , wakeup_secondary_cpu = 0x0 , 
  inquire_remote_apic = 0xffffffff8104b9b0 , 
  name = 0xffffffff821f0802 "physical flat"}

Thus, in this configuration the result is default_send_IPI_single_phys(cpu, CALL_FUNCTION_SINGLE_VECTOR). And this function invokes __default_send_IPI_dest_field() with interrupts disabled, which in turn, after some setup work, writes a command word that includes the desired IPI vector to location 0x300 offset by the APIC_BASE.

To be continued...

05 Sep 2020 12:12am GMT

03 Sep 2020

feedKernel Planet

James Bottomley: Lessons from the GNOME Patent Troll Incident

First, for all the lawyers who are eager to see the Settlement Agreement, here it is. The reason I can do this is that I've released software under an OSI approved licence, so I'm covered by the Releases and thus entitled to a copy of the agreement under section 10, but I'm not a party to any of the Covenants so I'm not forbidden from disclosing it.

Analysis of the attack

The Rothschild Modus Operandi is to obtain a fairly bogus patent (in this case, patent 9,936,086), form a limited liability corporation (LLC) that only holds the one patent and then sue a load of companies with vaguely related businesses for infringement. A key element of the attack is to offer a settlement licensing the patent for a sum less than it would cost even to mount an initial defence (usually around US$50k), which is how the Troll makes money: since the cost to file is fairly low, as long as there's no court appearance, the amount gained is close to US$50k if the target accepts the settlement offer and, since most targets know how much any defence of the patent would cost, they do.

One of the problems for the target is that once the patent is issued by the USPTO, the court must presume it is valid, so any defence that impugns the validity of the patent can't be decided at summary judgment. In the GNOME case, the sued project, shotwell, predated the filing of the patent by several years, so it should be obvious that even if shotwell did infringe the patent, it would have been prior art which should have prevented the issuing of the patent in the first place. Unfortunately such an obvious problem can't be used to get the case tossed on summary judgement because it impugns the validity of the patent. Put simply, once the USPTO issues a patent it's pretty much impossible to defend against accusations of infringement without an expensive trial which makes the settlement for small sums look very tempting.

If the target puts up any sort of fight, Rothschild, knowing the lack of merits to the case, will usually reduce the amount offered for settlement or, in extreme cases, simply drop the lawsuit. The last line of defence is the LLC. If the target finds some way to win damages (as ADS did in 2017) , the only thing on the hook is the LLC with the limited liability shielding Rothschild personally.

How it Played out Against GNOME

This description is somewhat brief, for a more in-depth description see the Medium article by Amanda Brock and Matt Berkowitz.

Rothschild performed the initial attack under the LLC RPI (Rothschild Patent Imaging). GNOME was fortunate enough to receive an offer of Pro Bono representation from Shearman and Sterling and immediately launched a defence fund (expecting that the cost of at least getting into court would be around US$200k, even with pro bono representation). One of its first actions, besides defending the claim was to launch a counterclaim against RPI alleging exceptional practices in bringing the claim. This serves two purposes: firstly, RPI can't now simply decide to drop the lawsuit, because the counterclaim survives and secondly, by alleging potential misconduct it seeks to pierce the LLC liability shield. GNOME also decided to try to obtain as much as it could for the whole of open source in the settlement.

As it became clear to Rothschild that GNOME wouldn't just pay up and they would create a potential liability problem in court, the offers of settlement came thick and fast culminating in an offer of a free licence and each side would pay their own costs. However GNOME persisted with the counter claim and insisted they could settle for nothing less than the elimination of the Rothschild patent threat from all of open source. The ultimate agreement reached, as you can read, does just that: gives a perpetual covenant not to sue any project under an OSI approved open source licence for any patent naming Leigh Rothschild as the inventor (i.e. the settlement terms go far beyond the initial patent claim and effectively free all of open source from any future litigation by Rothschild).

Analysis of the Agreement

Although the agreement achieves its aim, to rid all of Open Source of the Rothschild menace, it also contains several clauses which are suboptimal, but which had to be included to get a speedy resolution. In particular, Clause 10 forbids the GNOME foundation or its affiliates from publishing the agreement, which has caused much angst in open source circles about how watertight the agreement actually was. Secondly Clause 11 prohibits GNOME or its affiliates from pursuing any further invalidity challenges to any Rothschild patents leaving Rothschild free to pursue any non open source targets.

Fortunately the effect of clause 10 is now mitigated by me publishing the agreement and the effect of clause 11 by the fact that the Open Invention Network is now pursuing IPR invalidity actions against the Rothschild patents.

Lessons for the Future

The big lesson is that Troll based attacks are a growing threat to the Open Source movement. Even though the Rothschild source may have been neutralized, others may be tempted to follow his MO, so all open source projects have to be prepared for a troll attack.

The first lesson should necessarily be that if you're in receipt of a Troll attack, tell everyone. As an open source organization you're not going to be able to settle and you won't get either pro bono representation or the funds to fight the action unless people know about it.

The second lesson is that the community will rally, especially with financial aid, if you put out a call for help (and remember, you may be looking at legal bills in the six figure range).

The third lesson is always file a counter claim to give you significant leverage over the Troll in settlement negotiations.

And the fourth lesson is always refuse to settle for nothing less than neutralization of the threat to the entirety of open source.


While the lessons above should work if another Rothschild like Troll comes along, it's by no means guaranteed and the fact that Open Source project don't have the funding to defend themselves (even if they could raise it from the community) makes them look vulnerable. One thing the entire community could do to mitigate this problem is set up a community defence fund. We did this once before 16 years ago when SCO was threatening to sue Linux users and we could do it again. Knowing there was a deep pot to draw on would certainly make any Rothschild like Troll think twice about the vulnerability of an Open Source project, and may even deter the usual NPE type troll with more resources and better crafted patents.

Finally, it should be noted that this episode demonstrates how broken the patent system still is. The key element Rothschild like trolls require is the presumption of validity of a granted patent. In theory, in the light of the Alice decision, the USPTO should never have granted the patent but it did and once that happened the troll targets have no option than either to pay up the smaller sum requested or expend a larger sum on fighting in court. Perhaps if the USPTO can't stop the issuing of bogus patents it's time to remove the presumption of their validity in court … or at least provide some sort of prima facia invalidity test to apply at summary judgment (like the project is older than the patent, perhaps).

03 Sep 2020 4:53pm GMT

11 Nov 2011

feedLinux Today

Tech Comics: "How to Live with Non-Geeks"

Datamation: Geeks must realize that non-geeks simply don't understand some very basics things.

11 Nov 2011 11:00pm GMT

How To Activate Screen Saver In Ubuntu 11.10

AddictiveTip: Ubuntu 11.10 does not come with a default screen saver, and even Gnome 3 provides nothing but a black screen when your system is idle.

11 Nov 2011 10:00pm GMT

XFCE: Your Lightweight, Speedy, Fully-Fledged Linux Desktop

MakeUseOf: As far as Linux goes, customization is king

11 Nov 2011 9:00pm GMT

Fedora Scholarship Recognizes Students for Their Contributions to Open Source Software

Red Hat: The Fedora Scholarship is awarded to one student each year to assist with the recipient's college or university education.

11 Nov 2011 8:00pm GMT

Digital Divide Persists Even as Broadband Adoption Grows

Datamation: New report from Dept. of Commerce shows that the 'have nots' - continue to have not when it comes to Internet.

11 Nov 2011 7:00pm GMT

Why GNOME refugees love Xfce

The Register: Thunar rather than later...

11 Nov 2011 6:00pm GMT

Everything should be open source, says WordPress founder

Between the Lines: "It's a bold statement, but it's the ethos that Mullenweg admirably stuck to, pointing out that sites like Wikipedia replaced Encyclopedia Britannica, and how far Android has gone for mobile."

11 Nov 2011 5:02pm GMT

The Computer I Need

LXer: "Before I had a cell phone I did not realize that I needed one. As of one week ago, I did not realize that I needed a tablet either but I can sense that it might be a similar experience."

11 Nov 2011 4:01pm GMT

GPL violations in Android: Same arguments, different day

IT World: "IP attorney Edward J. Naughton is repeating his arguments that Google's use of Linux kernel header files within Android may be in violation of the GNU General Public License (GPLv2), and tries to discredit Linus Torvalds' thoughts on the matter along the way."

11 Nov 2011 3:04pm GMT

No uTorrent for Linux by Year's End

Softpedia: "When asked why there's no uTorrent client version of Linux users out, BitTorrent Inc. said that the company has other priorities at the moment."

11 Nov 2011 2:01pm GMT

Keep an Eye on Your Server with phpSysInfo

Linux Magazine: "There are quite a few server monitoring solutions out there, but most of them are overkill for keeping an eye on a single personal server."

11 Nov 2011 1:03pm GMT

At long last, Mozilla Releases Lightning 1.0 Calendar

InternetNews: From the 'Date and Time' files:

11 Nov 2011 12:00pm GMT

Richard Stallman's Personal Ad

Editors' Note: You can't make this stuff up...

11 Nov 2011 10:00am GMT

Linux Top 5: Fedora 16 Aims for the Cloud

LinuxPlanet: There are many things to explore on the Linux Planet. This week, a new Fedora release provides plenty of items to examine. The new Fedora release isn't the only new open source release this week, as the Linux Planet welcomes new KDE and Firefox releases as well.

11 Nov 2011 9:00am GMT

Orion Editor Ships in Firefox 8

Planet Orion: Firefox 8 now includes the Orion code editor in its scratchpad feature.

11 Nov 2011 6:00am GMT