01 Mar 2021
LXer Linux News
Red Hat RHEL For FREE, Kali Linux, GNOME 40, Modular Laptops, DIY N64 ROMs | This Week in Linux 140
on TWIL 140: Red Hat's RHEL Free for Open-Source Infrastructure, Mageia 8, Kali Linux, GNOME 40 Beta Released, Framework Modular Laptops, WINE On Wayland, RetroArch, Firefox 86, Tiny Core Linux 12.0
01 Mar 2021 9:34am GMT
How to create an RPM package from a tarball
Creating RPM package files can be both as easy or complicated as you desire. If you're needing to create an RPM package from a tarball (.tar.gz) that a vendor sent you, this tutorial will be beneficial for you. In this tutorial, I will describe a step-by-step procedure for building an RPM package. The procedure includes creating the required directory structure, configuring a .spec file required for the rpmbuild process, and building and installing the RPM package. An additional step is added for those who may need to create more RPM packages in the future via a skeleton file.
01 Mar 2021 7:22am GMT
Linus Torvalds went six days without electricity, swears smaller 5.12 kernel is co-incidental
Devs told if they really, really, need more time for merges they can have itLinux overlord Linus Torvalds has revealed that inclement weather in the USA meant he recently endured six electricity-free days in his Portland, Oregon, home during which he was unable to tend to the kernel. As a result he therefore pondered adding an extra week to the merge window for version 5.12 of the Linux kernel.…
01 Mar 2021 5:11am GMT
First Linux Kernel 5.12 Release Candidate Is Now Available for Public Testing
Here we go again! Linus Torvalds just announced today the general availability for public testing of the first Release Candidate (RC) milestone of the upcoming Linux 5.12 kernel series.
01 Mar 2021 2:59am GMT
How to run the Raspberry Pi Os in a virtual machine with Qemu and Kvm
Although many operating system are available for the Raspberry Pi, the official one is the Raspberry Pi Os. The operating system is made to run for the arm architecture, and can be easily installed on the SD card which will be used as the main Raspberry Pi storage device. Sometimes we may want to perform some tests or try some applications without having a physical Raspberry Pi machine; in this tutorial we will see how we can create a virtual machine with the Raspberry Pi Os system using Qemu and Kvm (Kernel Virtual Machine).
01 Mar 2021 1:45am GMT
28 Feb 2021
LXer Linux News
Mageia 8 Released with Better ARM Support and More
The Mageia team announced the release of Mageia 8 which brings some new features and enhancements. This is what's new.
28 Feb 2021 8:29pm GMT
Edit video on Linux with this Python app
In 2021, there are more reasons why people love Linux than ever before. In this series, I'll share 21 different reasons to use Linux. Here's how I use Linux to edit videos.read more
28 Feb 2021 7:13pm GMT
Xfce’s Apps Update for February 2021 Improves the Task Manager, Thunar, and More
In February 2021, Xfce, one of the lightweight desktop environments for GNU/Linux distributions, received a bunch of updates for its apps and core components.
28 Feb 2021 5:59pm GMT
How to Verify SHA256 Checksum of File in Linux using sha256sum
The last week I intend to install Red Hat on my System. I was able to download and create a bootable device using the dd command without any hassle.If you know, Red Hat has options to Test media [he] Install Red Hat Enterprises, so I have selected those options[/he] while it's verifying media, it shows the error of The file header checksum does not match computed checksum.
28 Feb 2021 2:37pm GMT
Nitrux 1.3.8 Is Here with KDE Plasma 5.21, Support for Linux Kernel 5.11 and Mesa-Git
Uri Herrera announced today the release and general availability of a new monthly update to the Nitrux Linux distribution, Nitrux 1.3.8, which comes with the latest GNU/Linux technologies and Open Source software.
28 Feb 2021 1:23pm GMT
How to manage Flatpak permissions graphicly?
The main purpose of Flatpak is to provide a centralized service for distributing applications. But while Penguin users enjoy the taste of updated and secured Linux apps, they have hard time managing Flatpak permissions for the lack of graphical front-end which helps them do so...
28 Feb 2021 12:08pm GMT
Building Programs from Source on any Linux Distribution in a simple way
For new Linux users, compiling and installing programs from source code for the first time might be a nightmare. But it's not; compiling from source code opens possibilities to use your favorite packages for any distribution.
28 Feb 2021 10:54am GMT
Contribute at the Fedora Audio, Kernel 5.11 and i18n test days
Fedora test days are events where anyone can help make sure changes in Fedora work well in an upcoming release. Fedora community members often participate, and the public is welcome at these events. If you've never contributed to Fedora before, this is a perfect way to get started. There are three upcoming test events in […]
28 Feb 2021 9:40am GMT
Linux Lite 5.4 Will Be Based on Ubuntu 20.04.2 LTS, Release Candidate Ready for Testing
Linux Lite maintainer and lead developer Jerry Bezencon announced today the availability for public testing of the Release Candidate of the upcoming Linux Lite 5.4 Ubuntu-based distribution.
28 Feb 2021 8:25am GMT
Copy file contents into Clipboard without displaying its contents, using Xclip and Xsel programs in Linux
This guide explains how to copy file contents into clipboard in Linux using Xclip and Xsel programs, without actually displaying the contents of the files.
28 Feb 2021 7:11am GMT
Happy birthday, Python, you're 30 years old this week: Easy to learn, and the right tool at the right time
Popular programming language, at the top of its game, still struggles to please everyone. Feature The 30th anniversary of Python this week finds the programming language at the top of its game, but not without challenges.…
28 Feb 2021 5:57am GMT
26 Feb 2021
Kernel Planet
Rusty Russell: A Model for Bitcoin Soft Fork Activation
TL;DR: There should be an option, taproot=lockintrue, which allows users to set lockin-on-timeout to true. It should not be the default, though.
As stated in my previous post, we need actual consensus, not simply the appearance of consensus. I'm pretty sure we have that for taproot, but I would like a template we can use in future without endless debate each time.
- Giving every group a chance to openly signal for (or against!) gives us the most robust assurance that we actually have consensus. Being able to signal opposition is vital, since everyone can lie anyway; making opposition difficult just reduces the reliability of the signal.
- Developers should not activate. They've tried to assure themselves that there's broad approval of the change, but that's not really a transferable proof. We should be concerned about about future corruption, insanity, or groupthink. Moreover, even the perception that developers can set the rules will lead to attempts to influence them as Bitcoin becomes more important. As a (non-Bitcoin-core) developer I can't think of a worse hell myself, nor do we want to attract developers who want to be influenced!
- Miner activation is actually brilliant. It's easy for everyone to count, and majority miner enforcement is sufficient to rely on the new rules. But its real genius is that miners are most directly vulnerable to the economic majority of users: in a fork they have to pick sides continuously knowing that if they are wrong, they will immediately suffer economically through missed opportunity cost.
- Of course, economic users are ultimately in control. Any system which doesn't explicitly encode that is fragile; nobody would argue that fair elections are unnecessary because if people were really dissatisfied they could always overthrow the government themselves! We should make it as easy for them to exercise this power as possible: this means not requiring them to run unvetted or home-brew modifications which will place them at more risk, so developers need to supply this option (setting it should also change the default User-Agent string, for signalling purposes). It shouldn't be an upgrade either (which inevitably comes with other changes). Such a default-off option provides both a simple method, and a Schelling point for the lockinontimeout parameters. It also means much less chance of this power being required: "Si vis pacem, para bellum".
This triumverate model may seem familiar, being widely used in various different governance systems. It seems the most robust to me, and is very close to what we have evolved into already. Formalizing it reduces uncertainty for any future changes, as well.
26 Feb 2021 2:17am GMT
21 Feb 2021
Kernel Planet
Matthew Garrett: Making hibernation work under Linux Lockdown
Linux draws a distinction between code running in kernel (kernel space) and applications running in userland (user space). This is enforced at the hardware level - in x86-speak[1], kernel space code runs in ring 0 and user space code runs in ring 3[2]. If you're running in ring 3 and you attempt to touch memory that's only accessible in ring 0, the hardware will raise a fault. No matter how privileged your ring 3 code, you don't get to touch ring 0.
Kind of. In theory. Traditionally this wasn't well enforced. At the most basic level, since root can load kernel modules, you could just build a kernel module that performed any kernel modifications you wanted and then have root load it. Technically user space code wasn't modifying kernel space code, but the difference was pretty semantic rather than useful. But it got worse - root could also map memory ranges belonging to PCI devices[3], and if the device could perform DMA you could just ask the device to overwrite bits of the kernel[4]. Or root could modify special CPU registers ("Model Specific Registers", or MSRs) that alter CPU behaviour via the /dev/msr interface, and compromise the kernel boundary that way.
It turns out that there were a number of ways root was effectively equivalent to ring 0, and the boundary was more about reliability (ie, a process running as root that ends up misbehaving should still only be able to crash itself rather than taking down the kernel with it) than security. After all, if you were root you could just replace the on-disk kernel with a backdoored one and reboot. Going deeper, you could replace the bootloader with one that automatically injected backdoors into a legitimate kernel image. We didn't have any way to prevent this sort of thing, so attempting to harden the root/kernel boundary wasn't especially interesting.
In 2012 Microsoft started requiring vendors ship systems with UEFI Secure Boot, a firmware feature that allowed[5] systems to refuse to boot anything without an appropriate signature. This not only enabled the creation of a system that drew a strong boundary between root and kernel, it arguably required one - what's the point of restricting what the firmware will stick in ring 0 if root can just throw more code in there afterwards? What ended up as the Lockdown Linux Security Module provides the tooling for this, blocking userspace interfaces that can be used to modify the kernel and enforcing that any modules have a trusted signature.
But that comes at something of a cost. Most of the features that Lockdown blocks are fairly niche, so the direct impact of having it enabled is small. Except that it also blocks hibernation[6], and it turns out some people were using that. The obvious question is "what does hibernation have to do with keeping root out of kernel space", and the answer is a little convoluted and is tied into how Linux implements hibernation. Basically, Linux saves system state into the swap partition and modifies the header to indicate that there's a hibernation image there instead of swap. On the next boot, the kernel sees the header indicating that it's a hibernation image, copies the contents of the swap partition back into RAM, and then jumps back into the old kernel code. What ensures that the hibernation image was actually written out by the kernel? Absolutely nothing, which means a motivated attacker with root access could turn off swap, write a hibernation image to the swap partition themselves, and then reboot. The kernel would happily resume into the attacker's image, giving the attacker control over what gets copied back into kernel space.
This is annoying, because normally when we think about attacks on swap we mitigate it by requiring an encrypted swap partition. But in this case, our attacker is root, and so already has access to the plaintext version of the swap partition. Disk encryption doesn't save us here. We need some way to verify that the hibernation image was written out by the kernel, not by root. And thankfully we have some tools for that.
Trusted Platform Modules (TPMs) are cryptographic coprocessors[7] capable of doing things like generating encryption keys and then encrypting things with them. You can ask a TPM to encrypt something with a key that's tied to that specific TPM - the OS has no access to the decryption key, and nor does any other TPM. So we can have the kernel generate an encryption key, encrypt part of the hibernation image with it, and then have the TPM encrypt it. We store the encrypted copy of the key in the hibernation image as well. On resume, the kernel reads the encrypted copy of the key, passes it to the TPM, gets the decrypted copy back and is able to verify the hibernation image.
That's great! Except root can do exactly the same thing. This tells us the hibernation image was generated on this machine, but doesn't tell us that it was done by the kernel. We need some way to be able to differentiate between keys that were generated in kernel and ones that were generated in userland. TPMs have the concept of "localities" (effectively privilege levels) that would be perfect for this. Userland is only able to access locality 0, so the kernel could simply use locality 1 to encrypt the key. Unfortunately, despite trying pretty hard, I've been unable to get localities to work. The motherboard chipset on my test machines simply doesn't forward any accesses to the TPM unless they're for locality 0. I needed another approach.
TPMs have a set of Platform Configuration Registers (PCRs), intended for keeping a record of system state. The OS isn't able to modify the PCRs directly. Instead, the OS provides a cryptographic hash of some material to the TPM. The TPM takes the existing PCR value, appends the new hash to that, and then stores the hash of the combination in the PCR - a process called "extension". This means that the new value of the TPM depends not only on the value of the new data, it depends on the previous value of the PCR - and, in turn, that previous value depended on its previous value, and so on. The only way to get to a specific PCR value is to either (a) break the hash algorithm, or (b) perform exactly the same sequence of writes. On system reset the PCRs go back to a known value, and the entire process starts again.
Some PCRs are different. PCR 23, for example, can be reset back to its original value without resetting the system. We can make use of that. The first thing we need to do is to prevent userland from being able to reset or extend PCR 23 itself. All TPM accesses go through the kernel, so this is a simple matter of parsing the write before it's sent to the TPM and returning an error if it's a sensitive command that would touch PCR 23. We now know that any change in PCR 23's state will be restricted to the kernel.
When we encrypt material with the TPM, we can ask it to record the PCR state. This is given back to us as metadata accompanying the encrypted secret. Along with the metadata is an additional signature created by the TPM, which can be used to prove that the metadata is both legitimate and associated with this specific encrypted data. In our case, that means we know what the value of PCR 23 was when we encrypted the key. That means that if we simply extend PCR 23 with a known value in-kernel before encrypting our key, we can look at the value of PCR 23 in the metadata. If it matches, the key was encrypted by the kernel - userland can create its own key, but it has no way to extend PCR 23 to the appropriate value first. We now know that the key was generated by the kernel.
But what if the attacker is able to gain access to the encrypted key? Let's say a kernel bug is hit that prevents hibernation from resuming, and you boot back up without wiping the hibernation image. Root can then read the key from the partition, ask the TPM to decrypt it, and then use that to create a new hibernation image. We probably want to prevent that as well. Fortunately, when you ask the TPM to encrypt something, you can ask that the TPM only decrypt it if the PCRs have specific values. "Sealing" material to the TPM in this way allows you to block decryption if the system isn't in the desired state. So, we define a policy that says that PCR 23 must have the same value at resume as it did on hibernation. On resume, the kernel resets PCR 23, extends it to the same value it did during hibernation, and then attempts to decrypt the key. Afterwards, it resets PCR 23 back to the initial value. Even if an attacker gains access to the encrypted copy of the key, the TPM will refuse to decrypt it.
And that's what this patchset implements. There's one fairly significant flaw at the moment, which is simply that an attacker can just reboot into an older kernel that doesn't implement the PCR 23 blocking and set up state by hand. Fortunately, this can be avoided using another aspect of the boot process. When you boot something via UEFI Secure Boot, the signing key used to verify the booted code is measured into PCR 7 by the system firmware. In the Linux world, the Shim bootloader then measures any additional keys that are used. By either using a new key to tag kernels that have support for the PCR 23 restrictions, or by embedding some additional metadata in the kernel that indicates the presence of this feature and measuring that, we can have a PCR 7 value that verifies that the PCR 23 restrictions are present. We then seal the key to PCR 7 as well as PCR 23, and if an attacker boots into a kernel that doesn't have this feature the PCR 7 value will be different and the TPM will refuse to decrypt the secret.
While there's a whole bunch of complexity here, the process should be entirely transparent to the user. The current implementation requires a TPM 2, and I'm not certain whether TPM 1.2 provides all the features necessary to do this properly - if so, extending it shouldn't be hard, but also all systems shipped in the past few years should have a TPM 2, so that's going to depend on whether there's sufficient interest to justify the work. And we're also at the early days of review, so there's always the risk that I've missed something obvious and there are terrible holes in this. And, well, given that it took almost 8 years to get the Lockdown patchset into mainline, let's not assume that I'm good at landing security code.
[1] Other architectures use different terminology here, such as "supervisor" and "user" mode, but it's broadly equivalent
[2] In theory rings 1 and 2 would allow you to run drivers with privileges somewhere between full kernel access and userland applications, but in reality we just don't talk about them in polite company
[3] This is how graphics worked in Linux before kernel modesetting turned up. XFree86 would just map your GPU's registers into userland and poke them directly. This was not a huge win for stability
[4] IOMMUs can help you here, by restricting the memory PCI devices can DMA to or from. The kernel then gets to allocate ranges for device buffers and configure the IOMMU such that the device can't DMA to anything else. Except that region of memory may still contain sensitive material such as function pointers, and attacks like this can still cause you problems as a result.
[5] This describes why I'm using "allowed" rather than "required" here
[6] Saving the system state to disk and powering down the platform entirely - significantly slower than suspending the system while keeping state in RAM, but also resilient against the system losing power.
[7] With some handwaving around "coprocessor". TPMs can't be part of the OS or the system firmware, but they don't technically need to be an independent component. Intel have a TPM implementation that runs on the Management Engine, a separate processor built into the motherboard chipset. AMD have one that runs on the Platform Security Processor, a small ARM core built into their CPU. Various ARM implementations run a TPM in Trustzone, a special CPU mode that (in theory) is able to access resources that are entirely blocked off from anything running in the OS, kernel or otherwise.
comments
21 Feb 2021 8:37am GMT
18 Feb 2021
Kernel Planet
Rusty Russell: Bitcoin Consensus and Solidarity
Bitcoin's consensus rules define what is valid, but this isn't helpful when we're looking at changing the rules themselves. The trend in Bitcoin has been to make such changes in an increasingly inclusive and conservative manner, but we are still feeling our way through this, and appreciating more nuance each time we do so.
To use Bitcoin, you need to remain in the supermajority of consensus on what the rules are. But you can never truly know if you are. Everyone can signal, but everyone can lie. You can't know what software other nodes or miners are running: even expensive testing of miners by creating an invalid block only tests one possible difference, may still give a false negative, and doesn't mean they can't change a moment later.
This risk of being left out is heightened greatly when the rules change. This is why we need to rely on multiple mechanisms to reassure ourselves that consensus will be maintained:
- Developers assure themselves that the change is technically valid, positive and has broad support. The main tools for this are open communication, and time. Developers signal support by implementing the change.
- Users signal their support by upgrading their nodes.
- Miners signal their support by actually tagging their blocks.
We need actual consensus, not simply the appearance of consensus. Thus it is vital that all groups know they can express their approval or rejection, in a way they know will be heard by others. In the end, the economic supermajority of Bitcoin users can set the rules, but no other group or subgroup should have inordinate influence, nor should they appear to have such control.
The Goodwill Dividend
A Bitcoin community which has consensus and knows it is not only safest from a technical perspective: the goodwill and confidence gives us all assurance that we can make (or resist!) changes in future.
It will also help us defend against the inevitable attacks and challenges we are going to face, which may be a more important effect than any particular soft-fork feature.
18 Feb 2021 3:29am GMT
09 Feb 2021
Kernel Planet
Kees Cook: security things in Linux v5.8
Previously: v5.7
Linux v5.8 was released in August, 2020. Here's my summary of various security things that caught my attention:
arm64 Branch Target Identification
Dave Martin added support for ARMv8.5's Branch Target Instructions (BTI), which are enabled in userspace at execve() time, and all the time in the kernel (which required manually marking up a lot of non-C code, like assembly and JIT code).
With this in place, Jump-Oriented Programming (JOP, where code gadgets are chained together with jumps and calls) is no longer available to the attacker. An attacker's code must make direct function calls. This basically reduces the "usable" code available to an attacker from every word in the kernel text to only function entries (or jump targets). This is a "low granularity" forward-edge Control Flow Integrity (CFI) feature, which is important (since it greatly reduces the potential targets that can be used in an attack) and cheap (implemented in hardware). It's a good first step to strong CFI, but (as we've seen with things like CFG) it isn't usually strong enough to stop a motivated attacker. "High granularity" CFI (which uses a more specific branch-target characteristic, like function prototypes, to track expected call sites) is not yet a hardware supported feature, but the software version will be coming in the future by way of Clang's CFI implementation.
arm64 Shadow Call Stack
Sami Tolvanen landed the kernel implementation of Clang's Shadow Call Stack (SCS), which protects the kernel against Return-Oriented Programming (ROP) attacks (where code gadgets are chained together with returns). This backward-edge CFI protection is implemented by keeping a second dedicated stack pointer register (x18
) and keeping a copy of the return addresses stored in a separate "shadow stack". In this way, manipulating the regular stack's return addresses will have no effect. (And since a copy of the return address continues to live in the regular stack, no changes are needed for back trace dumps, etc.)
It's worth noting that unlike BTI (which is hardware based), this is a software defense that relies on the location of the Shadow Stack (i.e. the value of x18
) staying secret, since the memory could be written to directly. Intel's hardware ROP defense (CET) uses a hardware shadow stack that isn't directly writable. ARM's hardware defense against ROP is PAC (which is actually designed as an arbitrary CFI defense - it can be used for forward-edge too), but that depends on having ARMv8.3 hardware. The expectation is that SCS will be used until PAC is available.
Kernel Concurrency Sanitizer infrastructure added
Marco Elver landed support for the Kernel Concurrency Sanitizer, which is a new debugging infrastructure to find data races in the kernel, via CONFIG_KCSAN
. This immediately found real bugs, with some fixes having already landed too. For more details, see the KCSAN documentation.
new capabilities
Alexey Budankov added CAP_PERFMON
, which is designed to allow access to perf()
. The idea is that this capability gives a process access to only read aspects of the running kernel and system. No longer will access be needed through the much more powerful abilities of CAP_SYS_ADMIN
, which has many ways to change kernel internals. This allows for a split between controls over the confidentiality (read access via CAP_PERFMON) of the kernel vs control over integrity (write access via CAP_SYS_ADMIN).
Alexei Starovoitov added CAP_BPF
, which is designed to separate BPF access from the all-powerful CAP_SYS_ADMIN
. It is designed to be used in combination with CAP_PERFMON
for tracing-like activities and CAP_NET_ADMIN
for networking-related activities. For things that could change kernel integrity (i.e. write access), CAP_SYS_ADMIN
is still required.
network random number generator improvements
Willy Tarreau made the network code's random number generator less predictable. This will further frustrate any attacker's attempts to recover the state of the RNG externally, which might lead to the ability to hijack network sessions (by correctly guessing packet states).
fix various kernel address exposures to non-CAP_SYSLOG
I fixed several situations where kernel addresses were still being exposed to unprivileged (i.e. non-CAP_SYSLOG
) users, though usually only through odd corner cases. After refactoring how capabilities were being checked for files in /sys
and /proc
, the kernel modules sections, kprobes, and BPF exposures got fixed. (Though in doing so, I briefly made things much worse before getting it properly fixed. Yikes!)
RISCV W^X detection
Following up on his recent work to enable strict kernel memory protections on RISCV, Zong Li has now added support for CONFIG_DEBUG_WX
as seen for other architectures. Any writable and executable memory regions in the kernel (which are lovely targets for attackers) will be loudly noted at boot so they can get corrected.
execve()
refactoring continues
Eric W. Biederman continued working on execve()
refactoring, including getting rid of the frequently problematic recursion used to locate binary handlers. I used the opportunity to dust off some old binfmt_script
regression tests and get them into the kernel selftests.
multiple /proc
instances
Alexey Gladkov modernized /proc
internals and provided a way to have multiple /proc
instances mounted in the same PID namespace. This allows for having multiple views of /proc
, with different features enabled. (Including the newly added hidepid=4 and subset=pid mount options.)
set_fs()
removal continues
Christoph Hellwig, with Eric W. Biederman, Arnd Bergmann, and others, have been diligently working to entirely remove the kernel's set_fs()
interface, which has long been a source of security flaws due to weird confusions about which address space the kernel thought it should be accessing. Beyond things like the lower-level per-architecture signal handling code, this has needed to touch various parts of the ELF loader, and networking code too.
READ_IMPLIES_EXEC
is no more for native 64-bit
The READ_IMPLIES_EXEC
flag was a work-around for dealing with the addition of non-executable (NX) memory when x86_64 was introduced. It was designed as a way to mark a memory region as "well, since we don't know if this memory region was expected to be executable, we must assume that if we need to read it, we need to be allowed to execute it too". It was designed mostly for stack memory (where trampoline code might live), but it would carry over into all mmap()
allocations, which would mean sometimes exposing a large attack surface to an attacker looking to find executable memory. While normally this didn't cause problems on modern systems that correctly marked their ELF sections as NX, there were still some awkward corner-cases. I fixed this by splitting READ_IMPLIES_EXEC
from the ELF PT_GNU_STACK
marking on x86 and arm/arm64, and declaring that a native 64-bit process would never gain READ_IMPLIES_EXEC
on x86_64 and arm64, which matches the behavior of other native 64-bit architectures that correctly didn't ever implement READ_IMPLIES_EXEC
in the first place.
array index bounds checking continues
As part of the ongoing work to use modern flexible arrays in the kernel, Gustavo A. R. Silva added the flex_array_size()
helper (as a cousin to struct_size()
). The zero/one-member into flex array conversions continue with over a hundred commits as we slowly get closer to being able to build with -Warray-bounds
.
scnprintf()
replacement continues
Chen Zhou joined Takashi Iwai in continuing to replace potentially unsafe uses of sprintf()
with scnprintf()
. Fixing all of these will make sure the kernel avoids nasty buffer concatenation surprises.
That's it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.9.
© 2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
09 Feb 2021 12:47am GMT
05 Feb 2021
Kernel Planet
Greg Kroah-Hartman: 8 bits are enough for a version number...
As was pointed out to us stable kernel maintainers last week, the overflow of the .y release number was going to happen soon, and our proposed solution for it (use 16 bits instead of 8), turns out to be breaking a userspace-visable api.
As we can't really break this, I did a release of the 4.4.256 and 4.9.256 releases today that contain nothing but a new version number. See the links for the full technical details if curious.
Right now I'm asking that everyone who uses these older kernel releases to upgrade to this release, and do a full rebuild of their systems in order to see what might, or might not, break. If problems happen, please let us know on the stable@vger.kernel.org
mailing list as soon as possible as I can only hold off on doing new stable releases for these branches for a single week only (i.e. February 12, 2021).
05 Feb 2021 2:29pm GMT
03 Feb 2021
Kernel Planet
Greg Kroah-Hartman: Helping out with LTS kernel releases
A recent email thread about "Why isn't the 5.10 stable kernel listed as supported for 6 years yet!" on the linux-kernel mailing list ended up generating a bunch of direct emails to me asking what could different companies and individuals due to help out. What exactly was I looking for here?
Instead of having to respond to private emails with the same information over and over, I figured it was better to just put it here so that everyone can see what exactly I am expecting with regards to support in order to be able to maintain a kernel for longer than 2 years:
What I need help with
All I request is that people test the -rc releases when I announce them, and let me know if they work or not for their systems/workloads/tests/whatever.
If you look at the -rc announcements today, you will see a number of different people/groups responding with this information. If they want, they can provide a Tested-by: ...
line that I will add to the release commit, or not, that's up to them.
Here and here and here and here and here are all great examples of how people let me know that all is ok with the -rc kernels so that I know it is "safe" to do the release.
I also have a few companies send me private emails that all is good, there's no requirement to announce this in public if you don't want to (but it is nice, as kernel development should be done in public.)
Some companies can't do tests on -rc releases due to their build infrastructures not handling that very well, so they email me after the stable release is out, saying all is good. Worst case, we end up reverting a patch in a released kernel, but it's better to quickly do that based on testing than to miss it entirely because no one is testing at all.
And that's it!
But, if you want to do more, I always really appreciate when people email me, or stable@vger.kernel.org
, git commit ids that are needed to be backported to specific stable kernel trees because they found them in their testing/development efforts. You know what problems you hit better than anyone, and once those issues are found and fixed, making sure they get backported is a good thing, so I always want to know that.
Again, if you look on the stable@vger.kernel.org
list, you will see different companies and developers providing backports of things they want backported, or just a list of the git commit ids if the backports apply cleanly.
Does that sound reasonable? I want to make sure that the LTS kernels that you rely on actually work for you without regressions, so testing is key, as is finding any fixes that are needed for them.
It's not much, but I can't do it alone :)
So, 6 years or not for 5.10?
The above is what I need in order to be able to support a kernel for 6 years, constant testing by users of the kernels. If we don't have that, then why even do these releases because that must mean that no one is using them? So email me and let me know.
As of this point in time (February 3, 2021), I do not have enough committments by companies to help out with this effort to be able to say I can do this for 6 years right now (note, no response yet from the company that originally asked this question…) Hopefully that changes soon, and if it does, the kernel.org release page will be updated with the new date.
03 Feb 2021 4:37pm GMT
31 Dec 2020
Kernel Planet
James Bottomley: Deploying Encrypted Images for Confidential Computing
In the previous post I looked at how you build an encrypted image that can maintain its confidentiality inside AMD SEV or Intel TDX. In this post I'll discuss how you actually bring up a confidential VM from an encrypted image while preserving secrecy. However, first a warning: This post represents the state of the art and includes patches that are certainly not deployed in distributions and may not even be upstream, so if you want to follow along at home you'll need to patch things like qemu, grub and OVMF. I should also add that, although I'm trying to make everything generic to confidential environments, this post is based on AMD SEV, which is the only confidential encrypted1 environment currently shipping.
The Basics of a Confidential Computing VM
At its base, current confidential computing environments are about using encrypted memory to run the virtual machine and guarding the encryption key so that the owner of the host system (the cloud service provider) can't get access to it. Both SEV and TDX have the encryption technology inside the main memory controller meaning the L1 cache isn't encrypted (still vulnerable to cache side channels) and DMA to devices must also be done via unencryped memory. This latter also means that both the BIOS and the Operating System of the guest VM must be enlightened to understand which pages to encrypted and which must not. For this reason, all confidential VM systems use OVMF2 to boot because this contains the necessary enlightening. To a guest, the VM encryption looks identical to full memory encryption on a physical system, so as long as you have a kernel which supports Intel or AMD full memory encryption, it should boot.
Each confidential computing system has a security element which sits between the encrypted VM and the host. In SEV this is an aarch64 processor called the Platform Security Processor (PSP) and in TDX it is an SGX enclave running Intel proprietary code. The job of the PSP is to bootstrap the VM, including encrypting the initial OVMF and inserting the encrypted pages. The security element also includes a validation certificate, which incorporates a Diffie-Hellman (DH) key. Once the guest owner obtains and validates the DH key it can use it to construct a one time ECDH encrypted bundle that can be passed to the security element on bring up. This bundle includes an encryption key which can be used to encrypt secrets for the security element and a validation key which can be used to verify measurements from the security element.
The way QEMU boots a Q35 machine is to set up all the configuration (including a disk device attached to the VM Image) load up the OVMF into rom memory and start the system running. OVMF pulls in the QEMU configuration and constructs the necessary ACPI configuration tables before executing grub and the kernel from the attached storage device. In a confidential VM, the first task is to establish a Guest Owner (the person whose encrypted VM it is) which is usually different from the Host Owner (the person running or controlling the Physical System). Ownership is established by transferring an encrypted bundle to the Secure Element before the VM is constructed.
The next step is for the VMM (QEMU in this case) to ask the secure element to provision the OVMF Firmware. Since the initial OVMF is untrusted, the Guest Owner should ask the Secure Element for an attestation of the memory contents before the VM is started. Since all paths lead through the Host Owner, who is also untrusted, the attestation contains a random nonce to prevent replay and is HMAC'd with a Guest Supplied key from the Launch Bundle. Once the Guest Owner is happy with the VM state, it supplies the Wrapped Key to the secure element (along with the nonce to prevent replay) and the Secure Element unwraps the key and provisions it to the VM where the Guest OS can use it for disc encryption. Finally, the enlightened guest reads the encrypted disk to unencrypted memory using DMA but uses the disk encryptor to decrypt it to encrypted memory, so the contents of the Encrypted VM Image are never visible to the Host Owner.
The Gaps in the System
The most obvious gap is that EFI booting systems don't go straight from the OVMF firmware to the OS, they have to go via an EFI bootloader (grub, usually) which must be an efi binary on an unencrypted vFAT partition. The second gap is that grub must be modified to pick the disk encryption key out of wherever the Secure Element has stashed it. The third is that the key is currently stashed in VM memory before OVMF starts, so OVMF must know not to use or corrupt the memory. A fourth problem is that the current recommended way of booting OVMF has a flash drive for persistent variable storage which is under the control of the host owner and which isn't part of the initial measurement.
Plugging The Gaps: OVMF
To deal with the problems in reverse order: the variable issue can be solved simply by not having a persistent variable store, since any mutable configuration information could be used to subvert the boot and leak the secret. This is achieved by stripping all the mutable variable handling out of OVMF. Solving key stashing simply means getting OVMF to set aside a page for a secret area and having QEMU recognise where it is for the secret injection. It turns out AMD were already working on a QEMU configuration table at a known location by the Reset Vector in OVMF, so the secret area is added as one of these entries. Once this is done, QEMU can retrieve the injection location from the OVMF binary so it doesn't have to be specified in the QEMU Machine Protocol (QMP) command. Finally OVMF can protect the secret and package it up as an EFI configuration table for later collection by the bootloader.
The final OVMF change (which is in the same patch set) is to pull grub inside a Firmware Volume and execute it directly. This certainly isn't the only possible solution to the problem (adding secure boot or an encrypted filesystem were other possibilities) but it is the simplest solution that gives a verifiable component that can be invariant across arbitrary encrypted boots (so the same OVMF can be used to execute any encrypted VM securely). This latter is important because traditionally OVMF is supplied by the host owner rather than being part of the VM image supplied by the guest owner. The grub script that runs from the combined volume must still be trusted to either decrypt the root or reboot to avoid leaking the key. Although the host owner still supplies the combined OVMF, the measurement assures the guest owner of its correctness, which is why having a fairly invariant component is a good idea … so the guest owner doesn't have potentially thousands of different measurements for approved firmware.
Plugging the Gaps: QEMU
The modifications to QEMU are fairly simple, it just needs to scan the OVMF file to determine the location for the injected secret and inject it correctly using a QMP command.. Since secret injection is already upstream, this is a simple find and make the location optional patch set.
Plugging the Gaps: Grub
Grub today only allows for the manual input of the cryptodisk password. However, in the cloud we can't do it this way because there's no guarantee of a secure tty channel to the VM. The solution, therefore, is to modify grub so that the cryptodisk can use secrets from a provider, in addition to the manual input. We then add a provider that can read the efi configuration tables and extract the secret table if it exists. The current incarnation of the proposed patch set is here and it allows cryptodisk to extract a secret from an efisecret provider. Note this isn't quite the same as the form expected by the upstream OVMF patch in its grub.cfg because now the provider has to be named on the cryptodisk command line thus
cryptodisk -s efisecret
but in all other aspects, Grub/grub.cfg works. I also discovered several other deviations from the initial grub.cfg (like Fedora uses /boot/grub2 instead of /boot/grub like everyone else) so the current incarnation of grub.cfg is here. I'll update it as it changes.
Putting it All Together
Once you have applied all the above patches and built your version of OVMF with grub inside, you're ready to do a confidential computing encrypted boot. However, you still need to verify the measurement and inject the encrypted secret. As I said before, this isn't easy because, due to replay defeat requirements, the secret bundle must be constructed on the fly for each VM boot. From this point on I'm going to be using only AMD SEV as the example because the Intel hardware doesn't yet exist and AMD kindly gave IBM research a box to play with (Anyone with a new EPYC 7xx1 or 7xx2 based workstation can likely play along at home, but check here). The first thing you need to do is construct a launch bundle. AMD has a tool called sev-tool to do this for you and the first thing you need to do is obtain the platform Diffie Hellman certificate (pdh.cert). The tool will extract this for you
sevtool --pdh_cert_export
Or it can be given to you by the cloud service provider (in this latter case you'll want to verify the provenance using sevtool -validate_cert_chain, which contacts the AMD site to verify all the details). Once you have a trusted pdh.cert, you can use this to generate your own guest owner DH cert (godh.cert) which should be used only one time to give a semblance of ECDHE. godh.cert is used with pdh.cert to derive an encryption key for the launch bundle. You can generate this with
sevtool --generate_launch_blob <policy>
The gory details of policy are in the SEV manual chapter 3, but most guests use 1 which means no debugging. This command will generate the godh.cert, the launch_blob.bin and a tmp_tk.bin file which you must save and keep secure because it contains the Transport Encryption and Integrity Keys (TEK and TIK) which will be used to encrypt the secret. Figuring out the qemu command line options needed to launch and pause a SEV guest is a bit of a palaver, so here is mine. You'll likely need to change things, like the QMP port and the location of your OVMF build and the launch secret.
Finally you need to get the launch measure from QMP, verify it against the sha256sum of OVMF.fd and create the secret bundle with the correct GUID headers. Since this is really fiddly to do with sevtool, I wrote this python script3 to do it all (note it requires qmp.py from the qemu git repository). You execute it as
sevsecret.py --passwd <disk passwd> --tiktek-file <location of tmp_tk.bin> --ovmf-hash <hash> --socket <qmp socket>
And it will verify the launch measure and encrypt the secret for the VM if the measure is correct and start the VM. If you got everything correct the VM will simply boot up without asking for a password (if you inject the wrong secret, it will still ask). And there you have it: you've booted up a confidential VM from an encrypted image file. If you're like me, you'll also want to fire up gdb on the qemu process just to show that the entire memory of the VM is encrypted …
Conclusions and Caveats
The above script should allow you to boot an encrypted VM anywhere: locally or in the cloud, provided you can access the QMP port (most clouds use libvirt which introduces yet another additional layering pain). The biggest drawback, if you refer to the diagram, is the yellow box: you must trust the secret element, which in both Intel and AMD is proprietary4, in order to get confidential computing to work. Although there is hope that in future the secret element could be fully open source, it isn't today.
The next annoyance is that launching a confidential VM is high touch requiring collaboration from both the guest owner and the host owner (due to the anti-replay nonce). For a single launch, this is a minor annoyance but for an autoscaling (launch VMs as needed) platform it becomes a major headache. The solution seems to be to have some Hardware Security Module (HSM), like the cloud uses today to store encryption keys securely, and have it understand how to measure and launch encrypted VMs on behalf of the guest owner.
The final conclusion to remember is that confidentiality is not security: your VM is as exploitable inside a confidential encrypted VM as it was outside. In many ways confidentiality and security are opposites, in that security in part requires reducing the trusted code and confidentiality requires pulling as much as possible inside. Confidential VMs do have an answer to the Cloud trust problem since the enterprise can now deploy VMs without fear of tampering by the cloud provider, but those VMs are as insecure in the cloud as they were in the Enterprise Data Centre. All of this argues that Confidential Computing, while an important milestone, is only one step on the journey to cloud security.
Patch Status
The OVMF patches are upstream (including modifications requested by Intel for TDX). The QEMU and grub patch sets are still on the lists.
31 Dec 2020 10:40pm GMT
30 Dec 2020
Kernel Planet
Paul E. Mc Kenney: Parallel Programming: December 2020 Update
This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features numerous improvments:
- LaTeX and build-system upgrades (including helpful error checking and reporting), formatting improvements (including much nicer display of hyperlinks and of Quick Quizzes, polishing of numerous figures and tables, plus easier builds for A4 paper), refreshing of numerous broken URLs, an improved "make help" command (see below), improved FAQ-BUILD material, and a prototype index, all courtesy of Akira Yokosawa.
- A lengthy Quick Quiz on the relationship of half-barriers, compilers, CPUs, and locking primitives, courtesy of Patrick Yingxi Pan.
- Updated performance results throughout the book, courtesy of a large x86 system kindly provided by Facebook.
- Compiler tricks, RCU semantics, and other material from the Linux-kernel memory model added to the memory-ordering and tools-of-the-trade chapters.
- Improved discussion of non-blocking-synchronization algorithms.
- Many new citations, cross-references, fixes, and touchups throughout the book.
A number of issues were spotted by Motohiro Kanda in the course of his translation of this book to Japanese, and Borislav Petkov, Igor Dzreyev, and Junchang Wang also provided much-appreciated fixes.
The output of the aforementioned make help is as follows:
Official targets (Latin Modern Typewriter for monospace font): Full, Abbr. perfbook.pdf, 2c: (default) 2-column layout perfbook-1c.pdf, 1c: 1-column layout Set env variable PERFBOOK_PAPER to change paper size: PERFBOOK_PAPER=A4: a4paper PERFBOOK_PAPER=HB: hard cover book other (default): letterpaper make help-full" will show the full list of available targets.
The following excerpt of the make help-full command's output might be of interest to those who find Quick Quizzes distracting:
Experimental targets: Full, Abbr. perfbook-qq.pdf, qq: framed Quick Quizzes perfbook-nq.pdf, nq: no inline Quick Quizzes (chapterwise Answers)
Thus, the make nq command creates a perfbook-nq.pdf with Quick Quizzes and their answers grouped at the end of each chapter, in the usual textbook style, while still providing PDF navigation from each Quick Quiz to the relevant portion of that chapter.
Finally, this release also happens to be the first release candidate for the long-awaited Second Edition, which should be available shortly.
30 Dec 2020 5:33am GMT
23 Dec 2020
Kernel Planet
James Bottomley: Building Encrypted Images for Confidential Computing
With both Intel and AMD announcing confidential computing features to run encrypted virtual machines, IBM research has been looking into a new format for encrypted VM images. The first question is why a new format, after all qcow2 only recently deprecated its old encrypted image format in favour of luks. The problem is that in confidential computing, the guest VM runs inside the secure envelope but the host hypervisor (including the QEMU process) is untrusted and thus runs outside the secure envelope and, unfortunately, even for the new luks format, the encryption of the image is handled by QEMU and so the encryption key would be outside the secure envelope. Thus, a new format is needed to keep the encryption key (and, indeed, the encryption mechanism) within the guest VM itself. Fortunately, encrypted boot of Linux systems has been around for a while, and this can be used as a practical template for constructing a fully confidential encrypted image format and maintaining that confidentiality within a hostile cloud environment. In this article, I'll explore the state of the art in encrypted boot, constructing EFI encrypted boot images, and finally, in the follow on article, look at deploying an encrypted image into a confidential environment and maintaining key secrecy in the cloud.
Encrypted Boot State of the Art
Luks and the cryptsetup toolkit have been around for a while and recently (in 2018), the luks format was updated to version 2. However, actually booting a linux kernel from an encrypted partition has always been a bit of a systems problem, primarily because the bootloader (grub) must decrypt the partition to actually load the kernel. Fortunately, grub can do this, but unfortunately the current grub in most distributions (2.04) can only read the version 1 luks format. Secondly, the user must type the decryption passphrase into grub (so it can pull the kernel and initial ramdisk out of the encrypted partition to boot them), but grub currently has no mechanism to pass it on to the initial ramdisk for mounting root, meaning that either the user has to type their passphrase twice (annoying) or the initial ramdisk itself has to contain a file with the disk passphrase. This latter is the most commonly used approach and only has minor security implications when the system is in motion (the ramdisk and the key file must be root read only) and the password is protected at rest by the fact that the initial ramdisk is also on the encrypted volume. Even more annoying is the fact that there is no distribution standard way of creating the initial ramdisk. Debian (and Ubuntu) have the most comprehensive documentation on how to do this, so the next section will look at the much less well documented systemd/dracut mechanism.
Encrypted Boot for Systemd/Dracut
Part of the problem here seems to be less that stellar systems co-ordination between the two components. Additionally, the way systemd supports passphraseless encrypted volumes has been evolving for a while but changed again in v246 to mirror the Debian method. Since cloud images are usually pretty up to date, I'll describe this new way. Each encrypted volume is referred to by UUID (which will be the UUID of the containing partition returned by blkid). To get dracut to boot from an encrypted partition, you must pass in
rd.luks.uuid=<UUID>
but you must also have a key file named
/etc/cryptsetup-keys.d/luks-<UUID>.key
And, since dracut hasn't yet caught up with this, you usually need a cryptodisk.conf file in /etc/dracut.conf.d/ which contains
install_items+=" /etc/cryptsetup-keys.d/* "
Grub and EFI Booting Encrypted Images
Traditionally grub is actually installed into the disk master boot record, but for EFI boot that changed and the disk (or VM image) must have an EFI System partition which is where the grub.efi binary is installed. Part of the job of the grub.efi binary is to find the root partition and source the /boot/grub1/grub.cfg. When you install grub on an EFI partition a search for the root by UUID is actually embedded into the grub binary. Another problem is likely that your distribution customizes the location of grub and updates the boot variables to tell the system where it is. However, a cloud image can't rely on the boot variables and must be installed in the default location (\EFI\BOOT\bootx64.efi). This default location can be achieved by adding the -removable flag to grub-install.
For encrypted boot, this becomes harder because the grub in the EFI partition must set up the cryptographic location by UUID. However, if you add
GRUB_ENABLE_CRYPTODISK=y
To /etc/default/grub it will do the necessary in grub-install and grub-mkconfig. Note that on Fedora, where every other GRUB_ENABLE parameter is true/false, this must be 'y', unfortunately grub-install will look for =y not =true.
Putting it all together: Encrypted VM Images
Start by extracting the root of an existing VM image to a tar file. Make sure it has all the tools you will need, like cryptodisk and grub-efi. Create a two partition raw image file and loopback mount it (I usually like 4GB) with a small efi partition (p1) and an encrypted root (p2):
truncate -s 4GB disk.img
parted disk.img mklabel gpt
parted disk.img mkpart primary 1Mib 100Mib
parted disk.img mkpart primary 100Mib 100%
parted disk.img set 1 esp on
parted disk.img set 1 boot on
Now setup the efi and cryptosystem (I use ext4, but it's not required). Note at this time luks will require a password. Use a simple one and change it later. Also note that most encrypted boot documents advise filling the encrypted partition with random numbers. I don't do this because the additional security afforded is small compared with the advantage of converting the raw image to a smaller qcow2 one.
losetup -P -f disk.img # assuming here it uses loop0
l=($(losetup -l|grep disk.img)) # verify with losetup -l
mkfs.vfat ${l}p1
blkid ${l}p1 # remember the EFI partition UUID
cryptsetup --type luks1 luksFormat ${l}p2 # choose temp password
blkid ${l}p2 # remember this as <UUID> you'll need it later
cryptsetup luksOpen ${l}p2 cr_root
mkfs.ext4 /dev/mapper/cr_root
mount /dev/mapper/cr_root /mnt
tar -C /mnt -xpf <vm root tar file>
for m in run sys proc dev; do mount --bind /$m /mnt/$m; done
chroot /mnt
Create or modify /etc/fstab to have root as /dev/disk/cr_root and the EFI partition by label under /boot/efi. Now set up grub for encrypted boot2
echo "GRUB_ENABLE_CRYPTODISK=y" >> /etc/default/grub
mount /boot/efi
grub-install --removable --target=x86_64-efi
grub-mkconfig -o /boot/grub/grub.cfg
For Debian, you'll need to add an /etc/crypttab entry for the encrypted disk:
cr_root UUID=<uuid> luks none
And then re-create the initial ramdisk. For dracut systems, you'll have to modify /etc/default/grub so the GRUB_CMDLINE_LINUX has a rd.luks.uuid=<UUID> entry. If this is a selinux based distribution, you may also have to trigger a relabel.
Now would also be a good time to make sure you have a root password you know or to install /root/.ssh/authorized_keys. You should unmount all the binds and /mnt and try EFI booting the image. You'll still have to type the password a couple of times, but once the image boots you're operating inside the encrypted envelope. All that remains is to create a fast boot high entropy low iteration password and replace the existing one with it and set the initial ramdisk to use it. This example assumes your image is mounted as SCSI disk sda, but it may be a virtual disk or some other device.
dd if=/dev/urandom bs=1 count=33|base64 -w 0 > /etc/cryptsetup-keys.d/luks-<UUID>.key
chmod 600 /etc/cryptsetup-keys.d/luks-<UUID>.key
cryptsetup --key-slot 1 luksAddKey /dev/sda2 # permanent recovery key
cryptsetup --key-slot 0 luksRemoveKey /dev/sda2 # remove temporary
cryptsetup --key-slot 0 --iter-time 1 luksAddKey /dev/sda2 /etc/cryptsetup-keys.d/luks-<UUID>.key
Note the "-w 0" is necessary to prevent the password from having a trailing newline which will make it difficult to use. For mkinitramfs systems, you'll now need to modify the /etc/crypttab entry
cr_root UUID=<UUID> /etc/cryptsetup-keys.d/luks-<UUID>.key luks
For dracut you need the key install hook in /etc/dracut.conf.d as described above and for Debian you need the keyfile pattern:
echo "KEYFILE_PATTERN=\"/etc/cryptsetup-keys.d/*\"" >>/etc/cryptsetup-initramfs/conf-hook
You now rebuild the initial ramdisk and you should now be able to boot the cryptosystem using either the high entropy password or your rescue one and it should only prompt in grub and shouldn't prompt again. This image file is now ready to be used for confidential computing.
23 Dec 2020 6:10pm GMT
22 Dec 2020
Kernel Planet
Michael Kerrisk (manpages): man-pages-5.10 is released
Starting with this release, Alejandro (Alex) Colomar has joined me as project comaintainer, and we've released man-pages-5.10. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.
This release resulted from patches, bug reports, reviews, and comments from around 25 contributors. The release includes just over 150 commits that changed around 140 pages.
The most notable of the changes in man-pages-5.10 are the following:
- I've added documentation of the faccessat2() system call (new in Linux 5.10) to the access(2) manual page.
- I've added a new subsection to the signal(7) manual page that provides a "big picture" of what happens when a signal handler is executed.
22 Dec 2020 9:56am GMT
16 Dec 2020
Kernel Planet
Pete Zaitcev: Google outage
It's very funny to hear about people who were unable to turn on their lights because their houses were "smart". Not a good look for Google Nest! But I had a real problem:
Google outage crashed my Thunderbird so good that the only fix is to delete the ~/.thunderbird and re-add all accounts.
Yes, really.
16 Dec 2020 6:20am GMT
13 Nov 2020
Kernel Planet
Dave Airlie (blogspot): lavapipe: a *software* swrast vulkan layer FAQ
(project was renamed from vallium to lavapipe)
I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.
tl;dr DO NOT USE LAVAPIPE OVER A GALLIUM HW DRIVER,
What is lavapipe?
How does it do that?
That sounds horrible, isn't it slow?
Why doesn't that matter for *software* drivers?
I still don't want to write a vulkan driver, give me more reasons.
Pipeline barriers:
Memory allocation:
Can this make my non-Vulkan capable hw run Vulkan?
No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.
13 Nov 2020 2:16am GMT
12 Nov 2020
Kernel Planet
Dave Airlie (blogspot): Linux graphics, why sharing code with Windows isn't always a win.
A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.
I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.
tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.
The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.
This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.
The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.
A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.
Why is it a bad idea?
I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.
As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.
Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.
One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.
How would I do it?
If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:
a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).
b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.
A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.
12 Nov 2020 12:05am GMT
04 Nov 2020
Kernel Planet
Brendan Gregg: BPF binaries: BTF, CO-RE, and the future of BPF perf tools
Two new technologies, BTF and CO-RE, are paving the way for BPF to become a billion dollar industry. Right now there are many BPF (eBPF) startups building networking, security, and performance products (and more in stealth), yet requiring customers to install LLVM, Clang, and kernel headers - which can consume over 100 Mbytes of storage - to use BPF is an adoption drag. BTF and CO-RE eliminate these dependencies at runtime, not only making BPF more practical for embedded Linux environments, but for adoption everywhere. These technologies are: - BTF: BPF Type Format, which provides struct information to avoid needing Clang and kernel headers. - CO-RE: BPF Compile-Once Run-Everywhere, which allows compiled BPF bytecode to be relocatable, avoiding the need for recompilation by LLVM. Clang and LLVM are still required for compilation, but the result is a lightweight ELF binary that includes the precompiled BPF bytecode and can be run everywhere. The BCC project has a collection of these, called libbpf tools. As an example, I ported over my opensnoop(8) tool:
# ./opensnoop PID COMM FD ERR PATH 27974 opensnoop 28 0 /etc/localtime 1482 redis-server 7 0 /proc/1482/stat 1657 atlas-system-ag 3 0 /proc/stat […]
This opensnoop(8) is an ELF binary that doesn't use libLLVM or libclang:
# file opensnoop opensnoop: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/l, for GNU/Linux 3.2.0, BuildID[sha1]=b4b5320c39e5ad2313e8a371baf5e8241bb4e4ed, with debug_info, not stripped # ldd opensnoop linux-vdso.so.1 (0x00007ffddf3f1000) libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f9fb7836000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9fb7619000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9fb7228000) /lib64/ld-linux-x86-64.so.2 (0x00007f9fb7c76000) # ls -lh opensnoop opensnoop.stripped -rwxr-xr-x 1 root root 645K Feb 28 23:18 opensnoop -rwxr-xr-x 1 root root 151K Feb 28 23:33 opensnoop.stripped
... and stripped is only 151 Kbytes. Now imagine a BPF product: instead of requiring customers install various heavyweight (and brittle) dependencies, a BPF agent may now be a single tiny binary that works on any kernel that has BTF. ## How this works It's not just a matter of saving the BPF bytecode in ELF and then sending it to any other kernel. Many BPF programs walk kernel structs that can change from one kernel version to another. Your BPF bytecode may still execute on different kernels, but it may be reading the wrong struct offsets and printing garbage output! opensnoop(8) doesn't walk kernel structs since it instruments stable tracepoints and their arguments, but many other tools do. This is an issue of *relocation*, and both BTF and CO-RE solve this for BPF binaries. BTF provides type information so that struct offsets and other details can be queried as needed, and CO-RE records which parts of a BPF program need to be rewritten, and how. CO-RE developer Andrii Nakryiko has written long posts explaining this in more depth: [BPF Portability and CO-RE] and [BTF Type Information]. ## CONFIG_DEBUG_INFO_BTF=y These new BPF binaries are only possible if this kernel config option is set. It adds about 1.5 Mbytes to the kernel image (this is tiny in comparison to DWARF debuginfo, which can be hundreds of Mbytes). Ubuntu 20.10 has already made this config option the default, and all other distros should follow. Note to distro maintainers: it requires pahole >= 1.16. ## The future of BPF performance tools, BCC Python, and bpftrace For BPF performance tools, you should start with running [BCC] and [bpftrace] tools, and then coding in bpftrace. The BCC tools should eventually be switched from Python to libbpf C under the hood, but will work the same. **Coding performance tools in BCC Python is now considered deprecated** as we move to libbpf C with BTF and CO-RE (although we still have library work to do, such as for USDT support, so the Python versions will be needed for a while). Note that there are other use cases of BCC that may continue to use the Python interface; both BPF co-maintainer Alexei Starovoitov and myself briefly discussed this on [iovisor-dev]. My [BPF Performance Tools] book focused on running BCC tools and coding in bpftrace, and that doesn't change. However, **Appendix C's Python programming examples are now considered deprecated.** Apologies for the inconvenience. Fortunately it's only 15 pages of appendix material out of the 880-page book. What about bpftrace? It does support BTF, and in the future we're looking at reducing its installation footprint as well (it can currently get to [29 Mbytes], and we think it can go a lot smaller). Given an average libbpf program size of 229 Kbytes (based on the current libbpf tools, stripped), and an average bpftrace program size of 1 Kbyte (my book tools), a large collection of bpftrace tools plus the bpftrace binary may become a smaller installation footprint than the equivalent in libbpf. Plus the bpftrace versions can be modified on the fly. libbpf is better suited for more complex and mature tools that needs custom arguments and libraries. As screenshots, the future of BPF performance tools is this:
# ls /usr/share/bcc/tools /usr/sbin/*.bt argdist drsnoop mdflush pythongc tclobjnew bashreadline execsnoop memleak pythonstat tclstat [...] /usr/sbin/bashreadline.bt /usr/sbin/mdflush.bt /usr/sbin/tcpaccept.bt /usr/sbin/biolatency.bt /usr/sbin/naptime.bt /usr/sbin/tcpconnect.bt [...]
... and this:
# bpftrace -e 'BEGIN { printf("Hello, World!\n"); }' Attaching 1 probe... Hello, World! ^C
... and **not** this:
#!/usr/bin/python from bcc import BPF from bcc.utils import printb prog = """ int hello(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; } """ [...]
Thanks to Yonghong Song (Facebook) for leading development of BTF, Andrii Nakryiko (Facebook) for leading development of CO-RE, and everyone else involved in making this happen. [BPF Portability and CO-RE]: https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html [BTF Type Information]: https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html [BPF Performance Tools]: /bpf-performance-tools-book.html [29 Mbytes]: https://github.com/iovisor/bpftrace/issues/342 [iovisor-dev]: https://lists.iovisor.org/g/iovisor-dev/topic/future_of_bcc_python_tools/77827559?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,77827559 [BCC]: https://github.com/iovisor/bcc [bpftrace]: https://github.com/iovisor/bpftrace
04 Nov 2020 8:00am GMT
02 Nov 2020
Kernel Planet
Michael Kerrisk (manpages): man-pages-5.09 is released
I've released man-pages-5.09. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.
This release resulted from patches, bug reports, reviews, and comments from more than 40 contributors. The release includes more than 500 commits that changed nearly 600 pages. Nine new pages were added in this release.
The most notable of the changes in man-pages-5.09 are the following:
- Alejandro Colomar created a very useful new page, system_data_types(7), which has summary information about many commonly used system data types (currently about 45 types). In addition, the pages has links for each type, so that now it is possible to type a command such as man timespec to to get information about that type, the header files that define it, and the primary APIs that use the type.
- Alejandro also did a lot of work to split the former queue(3) page, which was a rather unwieldy description of dozen of APIs, into a number of smaller pages, each of which describes a subset of related APIs. The new pages are circleq(3), list(3), slist(3), stailq(3), and tailq(3). The now rather smaller queue(3) page, which has essentially become a summary of the APIs, has migrated to become queue(7).
- In addition, Alejandro made a large number of consistency fixes across many pages, especially in the code examples, where many pieces of C code were made more correct.
- I added a new pthread_attr_setsigmask_np(3) page, documenting the pthread_attr_setsigmask_np(3) and pthread_attr_getsigmask_np(3) functions that were added in glibc 2.32.
- A new kernel_lockdown(7) page, written by David Howells with improvements from Heinrich Schuchardt, describes the Kernel Lockdown feature that was merged in Linux 5.4.
- I added descriptions of the sigabbrev_np() and sigdescr_np() functions (new in glibc 2.32) to the strsignal(3) page.
- I added descriptions of the strerrorname_np() and strerrordesc_np() functions (also new in glibc 2.32) to the strerror(3) manual page.
02 Nov 2020 5:55am GMT
30 Oct 2020
Kernel Planet
Dave Airlie (blogspot): llvmpipe is OpenGL 4.5 conformant.
(I just sent the below email to mesa3d developer list).
driver for OpenGL 4.5 conformance under the SPI/X.org umbrella, and it
is now official[1].
Thanks to everyone who helped me drive this forward, and to all the
contributors both to llvmpipe and the general Mesa stack that enabled
this.
Big shout out to Roland Scheidegger for helping review the mountain of
patches I produced in this effort.
My next plans involved submitting lavapipe for Vulkan 1.0, it's at 99%
or so CTS, but there are line drawing, sampler accuracy and some snorm
blending failure I have to work out.
I also ran the OpenCL 3.0 conformance suite against clover/llvmpipe
yesterday and have some vague hopes of driving that to some sort of
completion.
(for GL 4.6 only texture anisotropy is really missing, I've got
patches for SPIR-V support, in case someone was feeling adventurous).
Dave.
[1] https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_272
30 Oct 2020 8:25pm GMT
11 Nov 2011
Linux Today
Tech Comics: "How to Live with Non-Geeks"
Datamation: Geeks must realize that non-geeks simply don't understand some very basics things.
11 Nov 2011 11:00pm GMT
How To Activate Screen Saver In Ubuntu 11.10
AddictiveTip: Ubuntu 11.10 does not come with a default screen saver, and even Gnome 3 provides nothing but a black screen when your system is idle.
11 Nov 2011 10:00pm GMT
XFCE: Your Lightweight, Speedy, Fully-Fledged Linux Desktop
MakeUseOf: As far as Linux goes, customization is king
11 Nov 2011 9:00pm GMT
Fedora Scholarship Recognizes Students for Their Contributions to Open Source Software
Red Hat: The Fedora Scholarship is awarded to one student each year to assist with the recipient's college or university education.
11 Nov 2011 8:00pm GMT
Digital Divide Persists Even as Broadband Adoption Grows
Datamation: New report from Dept. of Commerce shows that the 'have nots' - continue to have not when it comes to Internet.
11 Nov 2011 7:00pm GMT
Why GNOME refugees love Xfce
The Register: Thunar rather than later...
11 Nov 2011 6:00pm GMT
Everything should be open source, says WordPress founder
Between the Lines: "It's a bold statement, but it's the ethos that Mullenweg admirably stuck to, pointing out that sites like Wikipedia replaced Encyclopedia Britannica, and how far Android has gone for mobile."
11 Nov 2011 5:02pm GMT
The Computer I Need
LXer: "Before I had a cell phone I did not realize that I needed one. As of one week ago, I did not realize that I needed a tablet either but I can sense that it might be a similar experience."
11 Nov 2011 4:01pm GMT
GPL violations in Android: Same arguments, different day
IT World: "IP attorney Edward J. Naughton is repeating his arguments that Google's use of Linux kernel header files within Android may be in violation of the GNU General Public License (GPLv2), and tries to discredit Linus Torvalds' thoughts on the matter along the way."
11 Nov 2011 3:04pm GMT
No uTorrent for Linux by Year's End
Softpedia: "When asked why there's no uTorrent client version of Linux users out, BitTorrent Inc. said that the company has other priorities at the moment."
11 Nov 2011 2:01pm GMT
Keep an Eye on Your Server with phpSysInfo
Linux Magazine: "There are quite a few server monitoring solutions out there, but most of them are overkill for keeping an eye on a single personal server."
11 Nov 2011 1:03pm GMT
At long last, Mozilla Releases Lightning 1.0 Calendar
InternetNews: From the 'Date and Time' files:
11 Nov 2011 12:00pm GMT
Richard Stallman's Personal Ad
Editors' Note: You can't make this stuff up...
11 Nov 2011 10:00am GMT
Linux Top 5: Fedora 16 Aims for the Cloud
LinuxPlanet: There are many things to explore on the Linux Planet. This week, a new Fedora release provides plenty of items to examine. The new Fedora release isn't the only new open source release this week, as the Linux Planet welcomes new KDE and Firefox releases as well.
11 Nov 2011 9:00am GMT
Orion Editor Ships in Firefox 8
Planet Orion: Firefox 8 now includes the Orion code editor in its scratchpad feature.
11 Nov 2011 6:00am GMT