17 Jul 2018

feedLXer Linux News

The Perfect Server - Ubuntu 18.04 (Nginx, MySQL, PHP, Postfix, BIND, Dovecot, Pure-FTPD and ISPConfig 3.1)

This tutorial shows the steps to install an Ubuntu 18.04 (Bionic Beaver) server with Nginx, PHP, MariaDB, Postfix, pure-ftpd, BIND, Dovecot and ISPConfig 3.1. ISPConfig is a web hosting control panel that allows you to configure the installed services through a web browser. This setup provides a full hosting server with web, email (inc. spam and antivirus filter), Database, FTP and DNS services.

17 Jul 2018 4:52am GMT

Notes/Domino is alive! Second beta of version 10 is imminent

Analytical email, modern web dev tools and more, for both of you who still careIBM's effort to make its Notes/Domino platform relevant for the future kicks up a gear this week, as the company prepares a second beta of a new version 10.…

17 Jul 2018 3:38am GMT

Is BDFL a death sentence?

A few days ago, Guido van Rossum, creator of the Python programming language and Benevolent Dictator For Life (BDFL) of the project, announced his intention to step away.read more

17 Jul 2018 2:24am GMT

Your Help Is Needed to Test VeraCrypt Support in the Tails Anonymous OS, GNOME

The team behind the famous Tails operating system, also known as the Amnesic Incognito Live System or simply Anonymous OS, needs your help to test the integration of the VeraCrypt disk encryption software.

17 Jul 2018 1:09am GMT

16 Jul 2018

feedLXer Linux News

Where to start with Rsync command : 8 Rsync Examples

Rsync command is used for transferring of files on Linux machines from one system to another. It uses method on incremental file transfer for copying the files & since the file transfer is incremental, only the difference among the two files is copied & not the complete file as is the case with other tools like scp. Rsync is very lightweight, it is fast as well as secure & also uses very low bandwidth for data copy.

16 Jul 2018 11:55pm GMT

Newsboat: A Snazzy Text-Based RSS Feed Reader

Newsboat is a sleek, open source RSS/Atom feed reader for the text console. The software is extremely configurable and offers a great feature set without any bloat.

16 Jul 2018 10:41pm GMT


Confessions of a recovering Perl hacker

opensource.com: As many have found, the lure of the Perl is hard to resist.

16 Jul 2018 10:00pm GMT

feedLXer Linux News

Confessions of a recovering Perl hacker

My name's MikeCamel, and I'm a Perl hacker.There, I've said it. That's the first step.read more

16 Jul 2018 9:26pm GMT


Linux history Command Tutorial for Beginners (8 Examples)

HowToForge: The history command gives you a list of commands you've executed earlier.

16 Jul 2018 9:00pm GMT

feedLXer Linux News

openSUSE Tumbleweed: a Linux Distro review

A Linux tyro reflects on a year with openSUSE's rolling distro Tumbleweed.

16 Jul 2018 8:12pm GMT


Newsboat: A Snazzy Text-Based RSS Feed Reader

Newsboat is a sleek, open source RSS/Atom feed reader for the text console.

16 Jul 2018 8:00pm GMT

Containers or virtual machines: ???Which is more secure? The answer will surprise you

ZDnet: IBM Research has created a new way to measure software security

16 Jul 2018 7:00pm GMT

feedLXer Linux News

Opinion: GitHub vs GitLab

So, Microsoft bought GitHub, and many people are confused orworried. It's not a new phenomenon when any large company buys anysmaller company, and people are right to be worried, although I arguethat their timing is wrong.

16 Jul 2018 6:58pm GMT


Where to start with Rsync command : 8 Rsync Examples

LinuxTechLab: The Rsync command is used for transferring of files on Linux machines from one system to another.

16 Jul 2018 6:00pm GMT

feedLXer Linux News

Get our Linux networking cheat sheet

If your daily tasks include managing servers and the data center's network. The following Linux utilities and commands-from basic to advanced-will help make network management easier.In several of these commands, you'll see <fqdn>, which stands for "fully qualified domain name." When you see this, substitute your website URL or your server (e.g., server-name.company.com), as the case may be.Download the cheat sheetread more

16 Jul 2018 5:43pm GMT


30 Bash Scripting Examples

LinuxHint: Bash scripts can be used for various purposes, such as executing a shell command, running multiple commands together, customizing administrative tasks, performing task automation etc.

16 Jul 2018 5:00pm GMT

feedLXer Linux News

Raspbian OS And Why Everyone Is So Attached To It

The Raspberry is a Debian based Linux distribution created for the Raspberry Pi. It is worth to mention that the Raspberry is designed to only function with the raspberry, however, to account for the rest of alternatives out there, it is all about competition and making the PI more adaptable and flexible.

16 Jul 2018 4:29pm GMT


Red Hat Looks Beyond Docker for Container Technology

ServerWatch: Red Hat has been busy leading multiple open-source efforts for managing, creating and deploying containers.

16 Jul 2018 4:00pm GMT

feedLXer Linux News

Apache Cassandra at 10: Making a community believe in NoSQL

A decade of technical promise and open-source fall-outsTen years ago this month, when Lehman Brothers was still just about in business and the term NoSQL wasn't even widely known, let alone an irritant, Facebook engineers open-sourced a distributed database system named Cassandra.…

16 Jul 2018 3:15pm GMT

3 cool productivity apps for Fedora 28

Productivity apps are especially popular on mobile devices. But when you sit down to do work, you're often at a laptop or desktop computer. Let's say you use a Fedora system for your platform. Can you find apps that help you get your work done? Of course! Read on for tips on apps to help […]

16 Jul 2018 2:00pm GMT


How The Update Framework Improves Software Distribution Security

eWEEK: ustin Cappos, assistant professor at New York University and founder of The Update Framework (TUF) open-source project provides insight into the process of securing software updates.

16 Jul 2018 2:00pm GMT

How to Benchmark Your Linux System

LinuxConfig: There are a bunch of reasons that you'd want to benchmark your Linux system.

16 Jul 2018 1:00pm GMT

feedLXer Linux News

Linux history Command Tutorial for Beginners (8 Examples)

If your work involves running tools and scripts on the Linux command line, I am sure there are a lot of commands you would be running each day. Those new to the command line should know there exists a tool - dubbed history - that gives you a list of commands you've executed earlier.

16 Jul 2018 12:41pm GMT

How to enable developer mode on a Chrome OS tablet (and install Linux using Crouton)

Switching to developer channel gives you the option of using Google's Crostini feature to install a Linux virtual machine that lets you install desktop applications like LibreOffice and GIMP and launch them from the same app launcher you use to load Android and Chrome OS apps.

16 Jul 2018 8:35am GMT

How to Benchmark Your Linux System

There are a bunch of reasons that you'd want to benchmark your Linux system. Most people benchmark out of pure curiosity or to measure the system's performance for games. Benchmarking can also help you identify problems with your system, though, and improve weak points for a smoother and more efficient experience. Benchmarking also helps you identify possible software issues and problematic upgrades with regressions.

16 Jul 2018 5:43am GMT

feedKernel Planet

Pete Zaitcev: Finally a use for code 451

Saw today at a respectable news site, which does not even nag about adblock:


We recognise you are attempting to access this website from a country belonging to the European Economic Area (EEA) including the EU which enforces the General Data Protection Regulation (GDPR) and therefore cannot grant you access at this time. For any issues, e-mail us at xxxxxxxx@xxxxxx.com or call us at xxx-xxx-4000.

What a way to brighten one's day. The phone without a country code is a cherry on top.

P.S. The only fly in this ointment is, I wasn't accessing it from the GDPR area. It was a geolocation failure.

16 Jul 2018 12:46am GMT

15 Jul 2018

feedKernel Planet

James Bottomley: Measuring the Horizontal Attack Profile of Nabla Containers

One of the biggest problems with the current debate about Container vs Hypervisor security is that no-one has actually developed a way of measuring security, so the debate is all in qualitative terms (hypervisors "feel" more secure than containers because of the interface breadth) but no-one actually has done a quantitative comparison. The purpose of this blog post is to move the debate forwards by suggesting a quantitative methodology for measuring the Horizontal Attack Profile (HAP). For more details about Attack Profiles, see this blog post. I don't expect this will be the final word in the debate, but by describing how we did it I hope others can develop quantitative measurements as well.

Well begin by looking at the Nabla technology through the relatively uncontroversial metric of performance. In most security debates, it's acceptable that some performance is lost by securing the application. As a rule of thumb, placing an application in a hypervisor loses anywhere between 10-30% of the native performance. Our goal here is to show that, for a variety of web tasks, the Nabla containers mechanism has an acceptable performance penalty.

Performance Measurements

We took some standard benchmarks: redis-bench-set, redis-bench-get, python-tornado and node-express and in the latter two we loaded up the web servers with simple external transactional clients. We then performed the same test for docker, gVisor, Kata Containers (as our benchmark for hypervisor containment) and nabla. In all the figures, higher is better (meaning more throughput):

The red Docker measure is included to show the benchmark. As expected, the Kata Containers measure is around 10-30% down on the docker one in each case because of the hypervisor penalty. However, in each case the Nabla performance is the same or higher than the Kata one, showing we pay less performance overhead for our security. A final note is that since the benchmarks are network ones, there's somewhat of a penalty paid by userspace networking stacks (which nabla necessarily has) for plugging into docker network, so we show two values, one for the bridging plug in (nabla-containers) required to orchestrate nabla with kubernetes and one as a direct connection (nabla-raw) showing where the performance would be without the network penalty.

One final note is that, as expected, gVisor sucks because ptrace is a really inefficient way of connecting the syscalls to the sandbox. However, it is more surprising that gVisor-kvm (where the sandbox connects to the system calls of the container using hypercalls instead) is also pretty lacking in performance. I speculate this is likely because hypercalls exact their own penalty and hypervisors usually try to minimise them, which using them to replace system calls really doesn't do.

HAP Measurement Methodology

The Quantitative approach to measuring the Horizontal Attack Profile (HAP) says that we take the bug density of the Linux Kernel code and multiply it by the amount of unique code traversed by the running system after it has reached a steady state (meaning that it doesn't appear to be traversing any new kernel paths). For the sake of this method, we assume the bug density to be uniform and thus the HAP is approximated by the amount of code traversed in the steady state. Measuring this for a running system is another matter entirely, but, fortunately, the kernel has a mechanism called ftrace which can be used to provide a trace of all of the functions called by a given userspace process and thus gives a reasonable approximation of the number of lines of code traversed (note this is an approximation because we measure the total number of lines in the function taking no account of internal code flow, primarily because ftrace doesn't give that much detail). Additionally, this methodology works very well for containers where all of the control flow emanates from a well known group of processes via the system call information, but it works less well for hypervisors where, in addition to the direct hypercall interface, you also have to add traces from the back end daemons (like the kvm vhost kernel threads or dom0 in the case of Xen).

HAP Results

The results are for the same set of tests as the performance ones except that this time we measure the amount of code traversed in the host kernel:

As stated in our methodology, the height of the bar should be directly proportional to the HAP where lower is obviously better. On these results we can say that in all cases the Nabla runtime tender actually has a better HAP than the hypervisor contained Kata technology, meaning that we've achieved a container system with better HAP (i.e. more secure) than hypervisors.

Some of the other results in this set also bear discussing. For instance the Docker result certainly isn't 10x the Kata result as a naive analysis would suggest. In fact, the containment provided by docker looks to be only marginally worse than that provided by the hypervisor. Given all the hoopla about hypervisors being much more secure than containers this result looks surprising but you have to consider what's going on: what we're measuring in the docker case is the system call penetration of normal execution of the systems. Clearly anything malicious could explode this result by exercising all sorts of system calls that the application doesn't normally use. However, this does show clearly that a docker container with a well crafted seccomp profile (which blocks unexpected system calls) provides roughly equivalent security to a hypervisor.

The other surprising result is that, in spite of their claims to reduce the exposure to Linux System Calls, gVisor actually is either equivalent to the docker use case or, for the python tornado test, significantly worse than the docker case. This too is explicable in terms of what's going on under the covers: gVisor tries to improve containment by rewriting the Linux system call interface in Go. However, no-one has paid any attention to the amount of system calls the Go runtime is actually using, which is what these results are really showing. Thus, while current gVisor doesn't currently achieve any containment improvement on this methodology, it's not impossible to write a future version of the Go runtime that is much less profligate in the way it uses system calls by developing a Secure Go using the same methodology we used to develop Nabla.


On both tests, Nabla is far and away the best containment technology for secure workloads given that it sacrifices the least performance over docker to achieve the containment and, on the published results, is 2x more secure even than using hypervisor based containment.

Hopefully these results show that it is perfectly possible to have containers that are more secure than hypervisors and lays to rest, finally, the arguments about which is the more secure technology. The next step, of course, is establishing the full extent of exposure to a malicious application and to do that, some type of fuzz testing needs to be employed. Unfortunately, right at the moment, gVisor is simply crashing when subjected to fuzz testing, so it needs to become more robust before realistic measurements can be taken.

15 Jul 2018 5:54am GMT

James Bottomley: A New Method of Containment: IBM Nabla Containers

In the previous post about Containers and Cloud Security, I noted that most of the tenants of a Cloud Service Provider (CSP) could safely not worry about the Horizontal Attack Profile (HAP) and leave the CSP to manage the risk. However, there is a small category of jobs (mostly in the financial and allied industries) where the damage done by a Horizontal Breach of the container cannot be adequately compensated by contractual remedies. For these cases, a team at IBM research has been looking at ways of reducing the HAP with a view to making containers more secure than hypervisors. For the impatient, the full open source release of the Nabla Containers technology is here and here, but for the more patient, let me explain what we did and why. We'll have a follow on post about the measurement methodology for the HAP and how we proved better containment than even hypervisor solutions.

The essence of the quest is a sandbox that emulates the interface between the runtime and the kernel (usually dubbed the syscall interface) with as little code as possible and a very narrow interface into the kernel itself.

The Basics: Looking for Better Containment

The HAP attack worry with standard containers is shown on the left: that a malicious application can breach the containment wall and attack an innocent application. This attack is thought to be facilitated by the breadth of the syscall interface in standard containers so the guiding star in developing Nabla Containers was a methodology for measuring the reduction in the HAP (and hence the improvement in containment), but the initial impetus came from the observation that unikernel systems are nicely modular in the libOS approach, can be used to emulate systemcalls and, thanks to rumprun, have a wide set of support for modern web friendly languages (like python, node.js and go) with a fairly thin glue layer. Additionally they have a fairly narrow set of hypercalls that are actually used in practice (meaning they can be made more secure than conventional hypervisors). Code coverage measurements of standard unikernel based kvm images confirmed that they did indeed use a far narrower interface.

Replacing the Hypervisor Interface

One of the main elements of the hypervisor interface is the transition from a less privileged guest kernel to a more privileged host one via hypercalls and vmexits. These CPU mediated events are actually quite expensive, certainly a lot more expensive than a simple system call, which merely involves changing address space and privilege level. It turns out that the unikernel based kvm interface is really only nine hypercalls, all of which are capable of being rewritten as syscalls, so the approach to running this new sandbox as a container is to do this rewrite and seccomp restrict the interface to being only what the rewritten unikernel runtime actually needs (meaning that the seccomp profile is now CSP enforced). This vision, by the way, of a broad runtime above being mediated to a narrow interface is where the name Nabla comes from: The symbol for Nabla is an inverted triangle (∇) which is broad at the top and narrows to a point at the base.

Using this formulation means that the nabla runtime (or nabla tender) can be run as a single process within a standard container and the narrowness of the interface to the host kernel prevents most of the attacks that a malicious application would be able to perform.

DevOps and the ParaVirt conundrum

Back at the dawn of virtualization, there were arguments between Xen and VMware over whether a hypervisor should be fully virtual (capable of running any system supported by the virtual hardware description) or paravirtual (the system had to be modified to run on the virtualization system and thus would be incapable of running on physical hardware). Today, thanks in a large part to CPU support for virtualization primtives, fully paravirtual systems have long since gone the way of the dodo and everyone nowadays expects any OS running on a hypervisor to be capable of running on physical hardware1. The death of paravirt also left the industry with an aversion to ever reviving it, which explains why most sandbox containment systems (gVisor, Kata) try to require no modifications to the image.

With DevOps, the requirement is that images be immutable and that to change an image you must take it through the full develop build, test, deploy cycle. This development centric view means that, provided there's no impact to the images you use as the basis for your development, you can easily craft your final image to suit the deployment environment, which means a step like linking with the nabla tender is very easy. Essentially, this comes down to whether you take the Dev (we can rebuild to suit the environment) or the Ops (the deployment environment needs to accept arbitrary images) view. However, most solutions take the Ops view because of the anti-paravirt bias. For the Nabla tender, we take the Dev view, which is born out by the performance figures.


Like most sandbox models, the Nabla containers approach is an alternative to namespacing for containment, but it still requires cgroups for resource management. The figures show that the containment HAP is actually better than that achieved with a hypervisor and the performance, while being marginally less than a namespaced container, is greater than that obtained by running a container inside a hypervisor. Thus we conclude that for tenants who have a real need for HAP reduction, this is a viable technology.

15 Jul 2018 5:54am GMT

12 Jul 2018

feedKernel Planet

Pete Zaitcev: Guido van Rossum steps down

See a mailing list message:

I would like to remove myself entirely from the decision process. // I am not going to appoint a successor.

12 Jul 2018 6:01pm GMT

29 Jun 2018

feedKernel Planet

Pete Zaitcev: The Proprietary Mind

Regarding the Huston missive, two quotes jumped at me the most. The first is just beautiful:

It may be slightly more disconcerting to realise that your electronic wallet is on a device that is using a massive compilation of open source software of largely unknown origin [...]

Yeah, baby. This moldy canard is still operational.

The second is from the narrative of the smartphone revolution:

Apple's iPhone, released in 2007, was a revolutionary device. [...] Apple's early lead was rapidly emulated by Windows and Nokia with their own offerings. Google's position was more as an active disruptor, using an open licensing framework for the Android platform [...]

Again, it's not like he's actually lying. He merely implies heavily that Nokia came next. I don't think the Nokia blunder even deserve a footnote, but to Huston, Google was too open. Google, Carl!

29 Jun 2018 12:58pm GMT

26 Jun 2018

feedKernel Planet

James Morris: Linux Security Summit North America 2018: Schedule Published

The schedule for the Linux Security Summit North America (LSS-NA) 2018 is now published.

Highlights include:

and much more!

LSS-NA 2018 will be co-located with the Open Source Summit, and held over 27th-28th August, in Vancouver, Canada. The attendance fee is $100 USD. Register here.

See you there!

26 Jun 2018 9:11pm GMT

25 Jun 2018

feedKernel Planet

Vegard Nossum: Compiler fuzzing, part 1

Much has been written about fuzzing compilers already, but there is not a lot that I could find about fuzzing compilers using more modern fuzzing techniques where coverage information is fed back into the fuzzer to find more bugs.

If you know me at all, you know I'll throw anything I can get my hands on at AFL. So I tried gcc. (And clang, and rustc -- but more about Rust in a later post.)

Levels of fuzzing

First let me summarise a post by John Regehr called Levels of Fuzzing, which my approach builds heavily on. Regehr presents a very important idea (which stems from earlier research/papers by others), namely that fuzzing can operate at different "levels". These levels correspond somewhat loosely to the different stages of compilation, i.e. lexing, parsing, type checking, code generation, and optimisation. In terms of fuzzing, the source code that you pass to the compiler has to "pass" one stage before it can enter the next; if you give the compiler a completely random binary file, it is unlikely to even get past the lexing stage, never mind to the point where the compiler is actually generating code. So it is in our interest (assuming we want to fuzz more than just the lexer) to generate test cases more intelligently than just using random binary data.

If we simply try to compile random data, we're not going to get very far.

In a "naïve" approach, we simply compile gcc with AFL instrumentation and run afl-fuzz on it as usual. If we give a reasonable corpus of existing C code, it is possible that the fuzzer will find something interesting by randomly mutating the test cases. But more likely than not, it is mostly going to end up with random garbage like what we see above, and never actually progress to more interesting stages of compilation. I did try this -- and the results were as expected. It takes a long time before the fuzzer hits anything interesting at all. Now, Sami Liedes did this with clang back in 2014 and obtained some impressive results ("34 distinct assertion failures in the first 11 hours"). So clearly it was possible to find bugs in this way. When I tried this myself for GCC, I did not find a single crash within a day or so of fuzzing. And looking at the queue of distinct testcases it had found, it was very clear that it was merely scratching the very outermost surface of the input handling in the compiler -- it was not able to produce a single program that would make it past the parsing stage.

AFL has a few built-in mutation strategies: bit flips, "byte flips", arithmetic on bytes, 2-bytes, and 4-bytes, insertion of common boundary values (like 0, 1, powers of 2, -1, etc.), insertions of and substitution by "dictionary strings" (basically user-provided lists of strings), along with random splicing of test cases. We can already sort of guess that most of these strategies will not be useful for C and C++ source code. Perhaps the "dictionary strings" is the most promising for source code as it allows you to insert keywords and snippets of code that have at least some chance of ending up as a valid program. For the other strategies, single bit flips can change variable names, but changing variable names is not that interesting unless you change one variable into another (which both have to exist, as otherwise you would hit a trivial "undeclared" error). They can also create expressions, but if you somehow managed to change a 'h' into a '(', source code with this mutation would always fail unless you also inserted a ')' somewhere else to balance the expression. Source code has a lot of these "correspondances" where changing one thing also requires changing another thing somewhere else in the program if you want it to still compile (even though you don't generate an equivalent program -- that's not what we're trying to do here). Variable uses match up with variable declarations. Parantheses, braces, and brackets must all match up (and in the right order too!).

These "correspondences" remind me a lot of CRCs and checksums in other file formats, and they give the fuzzer problems for the exact same reason: without extra code it's hard to overcome having to change the test case simultaneously in two or more places, never mind making the exact change that will preserve the relationship between these two values. It's a game of combinatorics; the more things we have to change at once and the more possibilities we have for those changes, the harder it will be to get that exact combination when you're working completely at random. For checksums the answer is easy, and there are two very good strategies: either you disable the checksum verification in the code you're fuzzing, or you write a small wrapper to "fix up" your test case so that the checksum always matches the data it protects (of course, after mutating an input you may not really know where in the file the checksum will be located anymore, but that's a different problem).

For C and C++ source code it's not so obvious how to help the fuzzer overcome this. You can of course generate programs with a grammar (and some heuristics), which is what several C random code generators such as Csmith, ccg, and yarpgen do. This is in a sense on the completely opposite side of the spectrum when it comes to the levels of fuzzing. By generating programs that you know are completely valid (and correct, and free of undefined behaviour), you will breeze through the lexing, the parsing, and the type checking and target the code generation and optimization stages. This is what Regehr et al. did in "Taming compiler fuzzers", another very interesting read. (Their approach does not include instrumentation feedback, however, so it is more of a traditional black-box fuzzing approach than AFL, which is considered grey-box fuzzing.)

But if you use a C++ grammar to generate C++ programs, that will also exclude a lot of inputs that are not valid but nevertheless accepted by the compiler. This approach relies on our ability to express all programs that should be valid, but there may also be programs non-valid programs that crash the compiler. As an example, if our generator knows that you cannot add an integer to a function, or assign a value to a constant, then the code paths checking for those conditions in the compiler would never be exercised, despite the fact that those errors are more interesting than mere syntax errors. In other words, there is a whole range of "interesting" test cases which we will never be able to generate if we restrict ourselves only to those programs that are actually valid code.

Please note that I am not saying that one approach is better than the other! I believe we need all of them to successfully find bugs in all the areas of the compiler. By realising exactly what the limits of each method are, we can try to find other ways to fill the gaps.

Fuzzing with a loose grammar

So how can we fill the gap between the shallow syntax errors in the front end and the very deep of the code generation in the back end? There are several things we can do.

The main feature of my solution is to use a "loose" grammar. As opposed to a "strict" grammar which would follow the C/C++ specs to the dot, the loose grammar only really has one type of symbol, and all the production rules in the grammar create this type of symbol. As a simple example, a traditional C grammar will not allow you to put a statement where an expression is expected, whereas the loose grammar has no restrictions on that. It does, however, take care that your parantheses and braces match up. My grammar file therefore looks something like this (also see the full grammar if you're curious!):

"[void] [f] []([]) { [] }"
"[]; []"
"{ [] }"
"[0] + [0]"

Here, anything between "[" and "]" (call it a placeholder) can be substituted by any other line from the grammar file. An evolution of a program could therefore plausibly look like this:

void f () { }           // using the "[void] [f] []([]) { [] }" rule
void f () { ; } // using the "[]; []" rule
void f () { 0 + 0; } // using the "[0] + [0]" rule
void f ({ }) { 0 + 0; } // using the "{ [] }" rule

Wait, what happened at the end there? That's not valid C. No -- but it could still be an interesting thing to try to pass to the compiler. We did have a placeholder where the arguments usually go, and according to the grammar we can put any of the other rules in there. This does quickly generate a lot of nonsensical programs that stop the compiler completely dead in its track at the parsing stage. We do have another trick to help things along, though...

AFL doesn't care at all whether what we pass it is accepted by the compiler or not; it doesn't distinguish between success and failure, only between graceful termination and crashes. However, all we have to do is teach the fuzzer about the difference between exit codes 0 and 1; a 0 means the program passed all of gcc's checks and actually resulted in an object file. Then we can discard all the test cases that result in an error, and keep a corpus of test cases which compile successfully. It's really a no-brainer, but makes such a big difference in what the fuzzer can generate/find.

Enter prog-fuzz

prog-fuzz output

If it's not clear by now, I'm not using afl-fuzz to drive the main fuzzing process for the techniques above. I decided it was easier to write a fuzzer from scratch, just reusing the AFL instrumentation and some of the setup code to collect the coverage information. Without the fork server, it's surprisingly little code, on the order of 15-20 lines of code! (I do have support for the fork server on a different branch and it's not THAT much harder to implement, but I simply haven't gotten around to it yet; and it also wasn't really needed to find a lot of bugs).

You can find prog-fuzz on GitHub: https://github.com/vegard/prog-fuzz

The code is not particularly clean, it's a hacked-up fuzzer that gets the job done. I'll want to clean that up at some point, document all the steps to build gcc with AFL instrumentation, etc., and merge a proper fork server. I just want the code to be out there in case somebody else wants to have a poke around.


From the end of February until some time in April I ran the fuzzer on and off and reported just over 100 distinct gcc bugs in total (32 of them fixed so far, by my count):

Now, there are a few things to be said about these bugs.

First, these bugs are mostly crashes: internal compiler errors ("ICEs"), assertion failures, and segfaults. Compiler crashes are usually not very high priority bugs -- especially when you are dealing with invalid programs. Most of the crashes would never occur "naturally" (i.e. as the result of a programmer trying to write some program). They represent very specific edge cases that may not be important at all in normal usage. So I am under no delusions about the relative importance of these bugs; a compiler crash is hardly a security risk.

However, I still think there is value in fuzzing compilers. Personally I find it very interesting that the same technique on rustc, the Rust compiler, only found 8 bugs in a couple of weeks of fuzzing, and not a single one of them was an actual segfault. I think it does say something about the nature of the code base, code quality, and the relative dangers of different programming languages, in case it was not clear already. In addition, compilers (and compiler writers) should have these fuzz testing techniques available to them, because it clearly finds bugs. Some of these bugs also point to underlying weaknesses or to general cases where something really could go wrong in a real program. In all, knowing about the bugs, even if they are relatively unimportant, will not hurt us.

Second, I should also note that I did have conversations with the gcc devs while fuzzing. I asked if I should open new bugs or attach more test cases to existing reports if I thought the area of the crash looked similar, even if it wasn't the exact same stack trace, etc., and they always told me to file a new report. In fact, I would like to praise the gcc developer community: I have never had such a pleasant bug-reporting experience. Within a day of reporting a new bug, somebody (usually Martin Liška or Marek Polacek) would run the test case and mark the bug as confirmed as well as bisect it using their huge library of precompiled gcc binaries to find the exact revision where the bug was introduced. This is something that I think all projects should strive to do -- the small feedback of having somebody acknowledge the bug is a huge encouragement to continue the process. Other gcc developers were also very active on IRC and answered almost all my questions, ranging from silly "Is this undefined behaviour?" to "Is this worth reporting?". In summary, I have nothing but praise for the gcc community.

I should also add that I played briefly with LLVM/clang, and prog-fuzz found 9 new bugs (2 of them fixed so far):

In addition to those, I also found a few other bugs that had already been reported by Sami Liedes back in 2014 which remain unfixed.

For rustc, I will write a more detailed blog post about how to set it up, as compiling rustc itself with AFL instrumentation is non-trivial and it makes more sense to detail those exact steps apart from this post.

What next?

I mentioned the efforts by Regehr et al. and Dmitry Babokin et al. on Csmith and yarpgen, respectively, as fuzzers that generate valid (UB-free) C/C++ programs for finding code generation bugs. I think there is work to be done here to find more code generation bugs; as far as I can tell, nobody has yet combined instrumentation feedback (grey-box fuzzing) with this kind of test case generator. Well, I tried to do it, but it requires a lot of effort to generate valid programs that are also interesting, and I stopped before finding any actual bugs. But I really think this is the future of compiler fuzzing, and I will outline the ideas that I think will have to go into it:

I don't have the time to continue working on this at the moment, but please do let me know if you would like to give it a try and I'll do my best to answer any questions about the code or the approach.


Thanks to John Regehr, Martin Liška, Marek Polacek, Jakub Jelinek, Richard Guenther, David Malcolm, Segher Boessenkool, and Martin Jambor for responding to my questions and bug reports!

Thanks to my employer, Oracle, for allowing me to do part of this fuzzing effort using company time and resources.

25 Jun 2018 7:35am GMT

22 Jun 2018

feedKernel Planet

Paul E. Mc Kenney: Stupid RCU Tricks: Changes to -rcu Workflow

The -rcu tree also takes LKMM patches, and I have been handling these completely separately, with one branch for RCU and another for LKMM. But this can be a bit inconvenient, and more important, can delay my response to patches to (say) LKMM if I am doing (say) extended in-tree RCU testing. So it is time to try something a bit different.

My current thought is continue to have separate LKMM and RCU branches (or more often, sets of branches) containing the commits to be offered up to the next merge window. The -rcu branch lkmm would flag the LKMM branch (or, more often, merge commit) and a new -rcu branch rcu would flag the RCU branch (or, again more often, merge commit). Then the lkmm and rcu merge commits would be merged, with new commits on top. These new commits would be intermixed RCU and LKMM commits.

The tip of the -rcu development effort (both LKMM and RCU) would be flagged with a new dev branch, with the old rcu/dev branch being retired. The rcu/next branch will continue to mark the commit to be pulled into the -next tree, and will point to the merge of the rcu and lkmm branches during the merge window.

I will create the next-merge-window branches sometime around -rc1 or -rc2, as I have in the past. I will send RFC patches to LKML shortly thereafter. I will send a pull request for the rcu branch around -rc5, and will send final patches from the lkmm branch at about that same time.

Should continue to be fun! :-)

22 Jun 2018 9:17pm GMT

21 Jun 2018

feedKernel Planet

James Bottomley: Containers and Cloud Security


The idea behind this blog post is to take a new look at how cloud security is measured and what its impact is on the various actors in the cloud ecosystem. From the measurement point of view, we look at the vertical stack: all code that is traversed to provide a service all the way from input web request to database update to output response potentially contains bugs; the bug density is variable for the different components but the more code you traverse the higher your chance of exposure to exploitable vulnerabilities. We'll call this the Vertical Attack Profile (VAP) of the stack. However, even this axis is too narrow because the primary actors are the cloud tenant and the cloud service provider (CSP). In an IaaS cloud, part of the vertical profile belongs to the tenant (The guest kernel, guest OS and application) and part (the hypervisor and host OS) belong to the CSP. However, the CSP vertical has the additional problem that any exploit in this piece of the stack can be used to jump into either the host itself or any of the other tenant virtual machines running on the host. We'll call this exploit causing a failure of containment the Horizontal Attack Profile (HAP). We should also note that any Horizontal Security failure is a potentially business destroying event for the CSP, so they care deeply about preventing them. Conversely any exploit occurring in the VAP owned by the Tenant can be seen by the CSP as a tenant only problem and one which the Tenant is responsible for locating and fixing. We correlate size of profile with attack risk, so the large the profile the greater the probability of being exploited.

From the Tenant point of view, improving security can be done in one of two ways, the first (and mostly aspirational) is to improve the security and monitoring of the part of the Vertical the Tenant is responsible for and the second is to shift responsibility to the CSP, so make the CSP responsible for more of the Vertical. Additionally, for most Tenants, a Horizontal failure mostly just means they lose trust in the CSP, unless the Tenant is trusting the CSP with sensitive data which can be exfiltrated by the Horizontal exploit. In this latter case, the Tenant still cannot do anything to protect the CSP part of the Security Profile, so it's mostly a contractual problem: SLAs and penalties for SLA failures.


To see how these interpretations apply to the various cloud environments, lets look at some of the Cloud (and pre-Cloud) models:

Physical Infrastructure

The left hand diagram shows a standard IaaS rented physical system. Since the Tenant rents the hardware it is shown as red indicating CSP ownership and the the two Tenants are shown in green and yellow. In this model, barring attacks from the actual hardware, the Tenant owns the entirety of the VAP. The nice thing for the CSP is that hardware provides air gap security, so there is no HAP which means it is incredibly secure.

However, there is another (much older) model shown on the right, called the shared login model, where the Tenant only rents a login on the physical system. In this model, only the application belongs to the Tenant, so the CSP is responsible for much of the VAP (the expanded red area). Here the total VAP is the same, but the Tenant's VAP is much smaller: the CSP is responsible for maintaining and securing everything apart from the application. From the Tenant point of view this is a much more secure system since they're responsible for much less of the security. From the CSP point of view there is now a because a tenant compromising the kernel can control the entire system and jump to other tenant processes. This actually has the worst HAP of all the systems considered in this blog.

Hypervisor based Virtual Infrastructure

In this model, the total VAP is unquestionably larger (worse) than the physical system above because there's simply more code to traverse (a guest and a host kernel). However, from the Tenant's point of view, the VAP should be identical to that of unshared physical hardware because the CSP owns all the additional parts. However, there is the possibility that the Tenant may be compromised by vulnerabilities in the Virtual Hardware Emulation. This can be a worry because an exploit here doesn't lead to a Horizontal security problem, so the CSP is apt to pay less attention to vulnerabilities in the Virtual Hardware simply because each guest has its own copy (even though that copy is wholly under the control of the CSP).

The HAP is definitely larger (worse) than the physical host because of the shared code in the Host Kernel/Hypervisor, but it has often been argued that because this is so deep in the Vertical stack that the chances of exploit are practically zero (although venom gave the lie to this hope: stack depth represents obscurity, not security).

However, there is another way of improving the VAP and that's to reduce the number of vulnerabilities that can be hit. One way that this can be done is to reduce the bug density (the argument for rewriting code in safer languages) but another is to restrict the amount of code which can be traversed by narrowing the interface (for example, see arguments in this hotcloud paper). On this latter argument, the host kernel or hypervisor does have a much lower VAP than the guest kernel because the hypercall interface used for emulating the virtual hardware is very narrow (much narrower than the syscall interface).

The important takeaways here are firstly that simply transferring ownership of elements in the VAP doesn't necessarily improve the Tenant VAP unless you have some assurance that the CSP is actively monitoring and fixing them. Conversely, when the threat is great enough (Horizontal Exploit), you can trust to the natural preservation instincts of the CSP to ensure correct monitoring and remediation because a successful Horizontal attack can be a business destroying event for the CSP.

Container Based Virtual Infrastructure

The total VAP here is identical to that of physical infrastructure. However, the Tenant component is much smaller (the kernel accounting for around 50% of all vulnerabilities). It is this reduction in the Tenant VAP that makes containers so appealing: the CSP is now responsible for monitoring and remediating about half of the physical system VAP which is a great improvement for the Tenant. Plus when the CSP remediates on the host, every container benefits at once, which is much better than having to crack open every virtual machine image to do it. Best of all, the Tenant images don't have to be modified to benefit from these fixes, simply running on an updated CSP host is enough. However, the cost for this is that the HAP is the entire linux kernel syscall interface meaning the HAP is much larger than then hypervisor virtual infrastructure case because the latter benefits from interface narrowing to only the hypercalls (qualitatively, assuming the hypercall interface is ~30 calls and the syscall interface is ~300 calls, then the HAP is 10x larger in the container case than the hypervisor case); however, thanks to protections from the kernel namespace code, the HAP is less than the shared login server case. Best of all, from the Tenant point of view, this entire HAP cost is borne by the CSP, which makes this an incredible deal: not only does the Tenant get a significant reduction in their VAP but the CSP is hugely motivated to keep on top of all vulnerabilities in their part of the VAP and remediate very fast because of the business implications of a successful horizontal attack. The flip side of this is that a large number of the world's CSPs are very unhappy about this potential risks and costs and actually try to shift responsibility (and risk) back to the Tenant by advocating nested virtualization solutions like running containers in hypervisors. So remember, you're only benefiting from the CSP motivation to actively maintain their share of the VAP if your CSP runs bare metal containers because otherwise they've quietly palmed the problem back off on you.

Other Avenues for Controlling Attack Profiles

The assumption above was that defect density per component is roughly constant, so effectively the more code the more defects. However, it is definitely true that different code bases have different defect densities, so one way of minimizing your VAP is to choose the code you rely on carefully and, of course, follow bug reduction techniques in the code you write.

Density Reduction

The simplest way of reducing defects is to find and fix the ones in the existing code base (while additionally being careful about introducing new ones). This means it is important to know how actively defects are being searched for and how quickly they are being remediated. In general, the greater the user base for the component, the greater the size of the defect searchers and the faster the speed of their remediation, which means that although the Linux Kernel is a big component in the VAP and HAP, a diligent patch routine is a reasonable line of defence because a fixed bug is not an exploitable bug.

Another way of reducing defect density is to write (or rewrite) the component in a language which is less prone to exploitable defects. While this approach has many advocates, particularly among language partisans, it suffers from the defect decay issue: the idea that the maximum number of defects occurs in freshly minted code and the number goes down over time because the more time from release the more chance they've been found. This means that a newly rewritten component, even in a shiny bug reducing language, can still contain more bugs than an older component written in a more exploitable language, simply because a significant number of bugs introduced on creation have been found in the latter.

Code Reduction (Minimization Techniques)

It also stands to reason that, for a complex component, simply reducing the amount of code that is accessible to the upper components reduces the VAP because it directly reduces the number of defects. However, reducing the amount of code isn't as simple as it sounds: it can only really be done by components that are configurable and then only if you're not using the actual features you eliminate. Elimination may be done in two ways, either physically, by actually removing the code from the component or virtually by blocking access using a guard (see below).

Guarding and Sandboxing

Guarding is mostly used to do virtual code elimination by blocking access to certain code paths that the upper layers do not use. For instance, seccomp in the Linux Kernel can be used to block access to system calls you know the application doesn't use, meaning it also blocks any attempt to exploit code that would be in those system calls, thus reducing the VAP (and also reducing the HAP if the kernel is shared).

The deficiencies in the above are obvious: if the application needs to use a system call, you cannot block it although you can filter it, which leads to huge and ever more complex seccomp policies. The solution for the system call an application has to use problem can sometimes be guarding emulation. In this mode the guard code actually emulates all the effects of the system call without actually making the actual system call into the kernel. This approach, often called sandboxing, is certainly effective at reducing the HAP since the guards usually run in their own address space which cannot be used to launch a horizontal attack. However, the sandbox may or may not reduce the VAP depending on the bugs in the emulation code vs the bugs in the original. One of the biggest potential disadvantages to watch out for with sandboxing is the fact that the address space the sandbox runs in is often that of the tenant, often meaning the CSP has quietly switched ownership of that component back to the tenant as well.


First and foremost: security is hard. As a cloud Tenant, you really want to offload as much of it as possible to people who are much more motivated to actually do it than you are (i.e. the Cloud Service Provider).

The complete Vertical Attack Profile of a container bare metal system in the cloud is identical to a physical system and better than a Hypervisor based system; plus the tenant owned portion is roughly 50% of the total VAP meaning that Containers are by far the most secure virtualization technology available today from the Tenant perspective.

The increased Horizontal Attack profile that containers bring should all rightly belong to the Cloud Service Provider. However, CSPs are apt to shirk this responsibility and try to find creative ways to shift responsibility back to the tenant including spreading misinformation about the container Attack profiles to try to make Tenants demand nested solutions.

Before you, as a Tenant, start worrying about the CSP owned Horizontal Attack Profile, make sure that contractual remedies (like SLAs or Reputational damage to the CSP) would be insufficient to cover the consequences of any data loss that might result from a containment breach. Also remember that unless you, as the tenant, are under external compliance obligations like HIPPA or PCI, contractual remedies for a containment failure are likely sufficient and you should keep responsibility for the HAP where it belongs: with the CSP.

21 Jun 2018 5:31am GMT

19 Jun 2018

feedKernel Planet

Pete Zaitcev: Slasti py3

Got Slasti 2.1 released today, the main feature being a support for Python 3. Some of the changes were somewhat... horrifying maybe? I tried to adhere to a general plan, where the whole of the application operates in unicode, and the UTF-8 data is encoded/decoded at the boundary. Unfortunately, in practice the boundary was rather leaky, so in several places I had to resort to isinstance(). I expected to always assign a type to all variables and fields, and then rigidly convert as needed. But WSGI had its own ideas.

Overall, the biggest source of issues was not the py3 model, but trying to make the code compatible. I'm not going to do that again if I can help it: either py2 or py3, but not both.

UPDATE: Looks like CKS agrees that compatible code is usually too hard. I'm glad the recommendation to avoid Python 3 entirely is no longer operational.

19 Jun 2018 2:54am GMT

18 Jun 2018

feedKernel Planet

James Morris: Linux Security BoF at Open Source Summit Japan

This is a reminder for folks attending OSS Japan this week that I'll be leading a Linux Security BoF session on Wednesday at 6pm.

If you've been working on a Linux security project, feel welcome to discuss it with the group. We will have a whiteboard and projector. This is also a good opportunity to raise topics for discussion, and to ask questions about Linux security.

See you then!

18 Jun 2018 8:26am GMT

15 Jun 2018

feedKernel Planet

Pete Zaitcev: Fedora 28 and IPv6 Neighbor Discovery

Finally updated my laptop to F28 and ssh connections started hanging. They hang for 15-20 seconds, then unstuck for a few seconds, then hang, and so on, cycling. I thought it was a WiFi problem at first. But eventually I narrowed it down to IPv6 ND being busted.

A packet trace on the laptop shows that traffic flows until the laptop issues a neighbor solicitation. The router replies with an advertisement, which I presume is getting dropped. Traffic stops - although what's strange, tcpdump still captures outgoing packets that the laptop sends. In a few seconds, the router sends a neighbor solicitation, but the laptop never replies. Presumably, dropped as well. This continues until a router advertisement resets the cycle.

Stopping firewalld lets solicitations in and the traffic resumes, so obviously a rule is busted somewhere. The IPv6 ICMP appears allowed, but the ip6tables rules generated by Firewalld are fairly opaque, I cannot be sure. Ended filing bug 1591867 for the time being and forcing ssh -4.

UPDATE: Looks like the problem is a "reverse path filter". Setting IPv6_rpfilter=no in /etc/firewalld/firewalld.conf fixes the the issue (thanks to Victor for the tip). Here's an associated comment in the configuration file:

# Performs a reverse path filter test on a packet for IPv6. If a reply to the
# packet would be sent via the same interface that the packet arrived on, the
# packet will match and be accepted, otherwise dropped.
# The rp_filter for IPv4 is controlled using sysctl.

Indeed there's no such sysctl for v6. Obviously the problem is that packets with the source of fe80::/16 are mistakenly assumed to be martians and dropped. That's easy enough to fix, I hope. But it's fascinating that we have an alternative configuration method nowadays, only exposed by certain specialist tools. If I don't have firewalld installed, and want this setting changed, what then?

Remarkably, the problem was reported first in March (it's June now). This tells me that most likely the erroneous check itself is in the kernel somewhere, and firewalld is not at fault, which is why Erik isn't fixing it. He should've reassigned the bug to kernel, if so, but...

The commit cede24d1b21d68d84ac5a36c44f7d37daadcc258 looks like the fix. Unfortunately, it just missed the 4.17.

15 Jun 2018 5:39pm GMT

14 Jun 2018

feedKernel Planet

Kees Cook: security things in Linux v4.17

Previously: v4.16.

Linux kernel v4.17 was released last week, and here are some of the security things I think are interesting:

Jailhouse hypervisor

Jan Kiszka landed Jailhouse hypervisor support, which uses static partitioning (i.e. no resource over-committing), where the root "cell" spawns new jails by shrinking its own CPU/memory/etc resources and hands them over to the new jail. There's a nice write-up of the hypervisor on LWN from 2014.

Sparc ADI

Khalid Aziz landed the userspace support for Sparc Application Data Integrity (ADI or SSM: Silicon Secured Memory), which is the hardware memory coloring (tagging) feature in Sparc M7. I'd love to see this extended into the kernel itself, as it would kill linear overflows between allocations, since the base pointer being used is tagged to belong to only a certain allocation (sized to a multiple of cache lines). Any attempt to increment beyond, into memory with a different tag, raises an exception. Enrico Perla has some great write-ups on using ADI in allocators and a comparison of ADI to Intel's MPX.

new kernel stacks cleared on fork

It was possible that old memory contents would live in a new process's kernel stack. While normally not visible, "uninitialized" memory read flaws or read overflows could expose these contents (especially stuff "deeper" in the stack that may never get overwritten for the life of the process). To avoid this, I made sure that new stacks were always zeroed. Oddly, this "priming" of the cache appeared to actually improve performance, though it was mostly in the noise.


As part of further defense in depth against attacks like Stack Clash, Michal Hocko created MAP_FIXED_NOREPLACE. The regular MAP_FIXED has a subtle behavior not normally noticed (but used by some, so it couldn't just be fixed): it will replace any overlapping portion of a pre-existing mapping. This means the kernel would silently overlap the stack into mmap or text regions, since MAP_FIXED was being used to build a new process's memory layout. Instead, MAP_FIXED_NOREPLACE has all the features of MAP_FIXED without the replacement behavior: it will fail if a pre-existing mapping overlaps with the newly requested one. The ELF loader has been switched to use MAP_FIXED_NOREPLACE, and it's available to userspace too, for similar use-cases.

pin stack limit during exec

I used a big hammer and pinned the RLIMIT_STACK values during exec. There were multiple methods to change the limit (through at least setrlimit() and prlimit()), and there were multiple places the limit got used to make decisions, so it seemed best to just pin the values for the life of the exec so no games could get played with them. Too much assumed the value wasn't changing, so better to make that assumption actually true. Hopefully this is the last of the fixes for these bad interactions between stack limits and memory layouts during exec (which have all been defensive measures against flaws like Stack Clash).

Variable Length Array removals start

Following some discussion over Alexander Popov's ongoing port of the stackleak GCC plugin, Linus declared that Variable Length Arrays (VLAs) should be eliminated from the kernel entirely. This is great because it kills several stack exhaustion attacks, including weird stuff like stepping over guard pages with giant stack allocations. However, with several hundred uses in the kernel, this wasn't going to be an easy job. Thankfully, a whole bunch of people stepped up to help out: Gustavo A. R. Silva, Himanshu Jha, Joern Engel, Kyle Spiers, Laura Abbott, Lorenzo Bianconi, Nikolay Borisov, Salvatore Mesoraca, Stephen Kitt, Takashi Iwai, Tobin C. Harding, and Tycho Andersen. With Linus Torvalds and Martin Uecker, I also helped rewrite the max() macro to eliminate false positives seen by the -Wvla compiler option. Overall, about 1/3rd of the VLA instances were solved for v4.17, with many more coming for v4.18. I'm hoping we'll have entirely eliminated VLAs by the time v4.19 ships.

That's in for now! Please let me know if you think I missed anything. Stay tuned for v4.18; the merge window is open. :)

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

14 Jun 2018 11:23pm GMT

07 Jun 2018

feedKernel Planet

Pete Zaitcev: Fundamental knowledge

Colleagues working in space technologies discussed recently if fundamental education were necessary for a programmer, so just for a reference, here's a list of fundamental-ish areas I had trouble with in practice over a 30 year career.

Statistics. This should be obvious. Although in theory I'm educated in the topic, I always had difficulty with it, and barely passed my tests, decades ago.

Error correction. To be entirely honest, I blew this. Every time I had to do it, I ended either using Phil Karn's library, or relying on Kevin Greenan's erasure coding package. I think the only time I implemented something that worked was the UAT.

The DSP on Inphase/Quadrature data. This one is really vexing. I ended with some ridiculous ad-hoc code, even though it's very interesting. In my excuse, there were some difficult performance constraints, so even if I knew the underlying math, there would be no way to apply it.

Other than the above, I don't feel like I was held back by any kind of fundamental background, most of all not in CS. About the only time it mattered was when an interviewer asked me to implement an R-B tree.

07 Jun 2018 4:03pm GMT

Pavel Machek: Complex cameras coming to PCs

It seems PCs are getting complex cameras. Which is bad news for PCs, because existing libv4l2 will not work there, but good news for OMAP3, as there will be bigger pressure to fix stuff.

07 Jun 2018 12:34pm GMT

05 Jun 2018

feedKernel Planet

Davidlohr Bueso: Linux v4.17: Performance Goodies

With Linux v4.17 now released, there are some interesting performance changes that went worth looking at. As always, the term 'performance' can be vague in that some gains in one area can negatively affect another so take everything with a grain of salt.

sysvipc: introduce STAT_ANY commands

There was a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/shm (0444) and getting stat info (such as via SHM_STAT shmctl command). The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the shm metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it.

Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases. For this, the new {SEM,SHM,MSG}_STAT_ANY commands have been introduced.
[Commit c21a6970ae72, a280d6dc77eb, 23c8cec8cf67]

kvm: x86 paravirtualization hints and KVM_HINTS_DEDICATED

When dealing with CPU virtualization, many in-kernel heuristics and optimizations revolve around the overcommited scenario. By introducing KVM_HINTS_DEDICATED, the hypervisor administrator can select this option when there are pinned 1:1 virtual to physical CPU scenarios; particularly reducing the paravirt overhead in locking and TLB flushing as the vCPU is most unlikely to get preempted. In these cases, native qspinlock may perform better than pvqspinlock as it disables paravirt spinlock slowpath optimizations. There is an older Xen equivalent available as a kernel parameter: xen_nopvspin.
[Commit b2798ba0b876, 34226b6b7098, 6beacf74c257]

sched: rework idle loop

Rework the idle loop in order to prevent CPUs from spending too much time in shallow idle states by making it stop the scheduler tick before putting the CPU into an idle state only if the idle duration predicted by the idle governor is long enough. It reduces idle power on some systems by 10% or more and may improve performance of workloads in which the idle loop overhead matters. This required the code to be reordered to invoke the idle governor before stopping the tick, among other things
[Commit 0e7767687fda, 2aaf709a518d, ed98c3491998]

mm: pcpu pages optimizations around zone lock

Two optimizations around zone->lock in free_pcpupages_bulk() that yield around a 5% performance improvement in page-fault benchmarks (will-it-scale in this case). The first reduces the scope of the when freeing a batch of pages from back to buddy. Considering the per-cpu semantics, the lock was unnecessarily held while pages are chosen from the pcpu page's migratetype list.

The second improvement adds a prefetch to the to-be-freed page's buddy outside of the lock in hope that accessing the buddy's page structure later with the lock held will be faster. Normally prefetching is froundupon, particularly for microbenchmarks, however in the particular case the prefetched pointer will always be used.
[Commit 0a5f4e5b4562, 97334162e4d7]

mm: lockless list_lru_count_one()

During the reclaiming slab of a memcg, shrink_slab() iterates over all registered shrinkers in the system, trying to count and consume objects related to the cgroup. In case of memory pressure, the operation was had a bottlenecking while trying to acquire the nlru->lock. By applying RCU to the data structure, the lookup can be done without taking the lock, which translates in the overall contention pretty much disappearing.
[Commit 0c7c1bed7e13]

memory hotplug optimizations

Such optimizations reduce the amount of times struct pages is traversed during a memory hotplug operation, from three to one. Among other benefits, the memory hotplug is made similar to the boot memory initialization path because it initializes struct pages only in one function. Finally, this improves memory hotplug performance because the cache is not being evicted several times and also reduce loop branching overhead.
[Commit d0dc12e86b31]

procfs: miscellaneous optimizations

Access to various files within procfs have been optimized by replacing calls to seq_printf() with lower cost alternatives. Changes show some performance benefits for ad-hoc microbenchmarks.

btrfs: relax barrier when unlocking an extent buffer

Serializing checks for active waitqueue requires a barrier as it can race with the waiter side. Such is the case with btrfs_tree_unlock(), which was abusing the barrier semantics on architectures where atomic operations are ordered, such as x86. A performance improvement is immediately noticeable by optimizing barrier usage while maintaining the necessary semantics.
[Commit 2e32ef87b074]

x86/pti: leave kernel text global for no PCID

From the patch: Global pages are bad for hardening because they potentially let an exploit read the kernel image via a Meltdown-style attack. But, global pages are good for performance because they reduce TLB misses when making user/kernel transitions, especially when PCIDs are not available, such as on older hardware, or where a hypervisor has disabled them for some reason.

This change implements a basic, sane policy: If PCIDs are available, only map a minimal amount of kernel text global. If no PCIDs, map all kernel text global. This translates into a considerable throughput increase on an lseek microbenhmark.
[Commit 8c06c7740d19]

lib/raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

This enhancement uses the vpermxor instruction to optimize the raid6 Q syndrome. This instruction was made available with POWER8, ISA version 2.07. It allows for both vperm and vxor instructions to be done in a single instruction. The benchmark results show a 35% speed increase over the best existing algorithm for powerpc (altivec).
[Commit 751ba79cc552]

05 Jun 2018 2:51pm GMT